Olivier Cappé, Aurélien Garivier,
Odalric-Ambrym Maillard,
Rémi Munos, Gilles Stoltz.
In The Annals of Statistics, 2013.
[Download]
Abstract: |
We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas et Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art. |
You can dowload the paper from the Annals of Statistics (here) or from the HAL online open depository* (here).
Bibtex: |
@article{CaGaMaMuSt2013, AUTHOR = {Olivier Capp\'{e}, Aur\'{e}lien Garivier, Odalric-Ambrym Maillard, R\'{e}mi Munos, Gilles Stoltz}, TITLE = {Kullback–Leibler upper confidence bounds for optimal sequential allocation}, JOURNAL = {Ann. Statist.}, FJOURNAL = {Annals of Statistics}, YEAR = {2013}, VOLUME = {41}, NUMBER = {3}, PAGES = {1516-1541}, ISSN = {0090-5364}, DOI = {10.1214/13-AOS1119} } |
Related Publications: |
Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences. Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos. In Proceedings of the 24th annual Conference On Learning Theory, COLT, 2011. |