Dueling Bandits with Delayed Feedback

Jasmin  Brandt; Björn Haddenhorst; Viktor Bengs; Eyke Hüllermeier

doi:10.11576/dataninja-1163

Authors

Jasmin Brandt
Björn Haddenhorst
Viktor Bengs
Eyke Hüllermeier

DOI:

https://doi.org/10.11576/dataninja-1163

Keywords:

Multi-Armed Bandits, Dueling Bandits, Online Learning

Abstract

Dueling Bandits is a well-studied extension of the Multi-Armed Bandits problem, in which the learner must select two arms in each time step and receives a binary feedback as an outcome of the chosen duel. However, all of the existing best arm identification algorithms for the Dueling Bandits setting assume that the feedback can be observed immediately after selecting the two arms. If this is not the case, the algorithms simply do nothing and wait until the feedback of the recent duel can be observed, which is a waste of runtime. We propose an algorithm that can already start a new duel even if the previous one is not finished and thus is much more time efficient. Our arm selection strategy balances the expected information gain of the chosen duel and the expected delay until we observe the feedback. By theoretically grounded confidence bounds we can ensure that the arms we discard are not the best arms with high probability.

Dueling Bandits with Delayed Feedback

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License