Dueling Bandits with Delayed Feedback
DOI:
https://doi.org/10.11576/dataninja-1163Keywords:
Multi-Armed Bandits, Dueling Bandits, Online LearningAbstract
Dueling Bandits is a well-studied extension of the Multi-Armed Bandits problem, in which the learner must select two arms in each time step and receives a binary feedback as an outcome of the chosen duel. However, all of the existing best arm identification algorithms for the Dueling Bandits setting assume that the feedback can be observed immediately after selecting the two arms. If this is not the case, the algorithms simply do nothing and wait until the feedback of the recent duel can be observed, which is a waste of runtime. We propose an algorithm that can already start a new duel even if the previous one is not finished and thus is much more time efficient. Our arm selection strategy balances the expected information gain of the chosen duel and the expected delay until we observe the feedback. By theoretically grounded confidence bounds we can ensure that the arms we discard are not the best arms with high probability.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Jasmin Brandt, Björn Haddenhorst, Viktor Bengs, Eyke Hüllermeier
This work is licensed under a Creative Commons Attribution 4.0 International License.