Latent Bandits.

2013, Discussing articles

Odalric-Ambrym Maillard, Shie Mannor
In International Conference on Machine Learning, 2014.

[Download]

Abstract:

We consider a multi-armed bandit problem where the reward distributions are indexed by two sets –one for arms, one for type– and can be partitioned into a small number of clusters according to the type. First,we consider the setting where all reward distributions are known and all types have the same underlying cluster, the type’s identity is, however, unknown. Second, we study the case where types may come from different classes, which is significantly more challenging. Finally, we tackle the case where the reward distributions are completely unknown. In each setting, we introduce specific algorithms and derive non-trivial regret performance. Numerical experiments show that, in the most challenging agnostic case, the proposed algorithm achieves excellent performance in several difficult scenarios.

You can dowload the paper from the JMLR website (here) or from the HAL online open depository* (here).

You can download the Java code used to generate the experiments here.

Bibtex:
@inproceedings{maillard2014latent,
  title={Latent Bandits.},
  author={Maillard, Odalric-Ambrym and Mannor, Shie},
  booktitle={Proceedings of The 31st International Conference on Machine Learning},
  pages={136--144},
  year={2014}
}

Robust Risk-averse Multi-armed Bandits.

2013, Discussing articles

Odalric-Ambrym Maillard.
In Algorithmic Learning Theory, 2013.

[Download]

Abstract:

We study a variant of the standard stochastic multi-armed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximizing some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative expected regret. We provide an algorithm, called RA-UCB to solve this problem, together with a high probability bound on its regret.

You can dowload the paper from the ALT website (here) or from the HAL online open depository* (here).

Bibtex:
@incollection{Maillard2013,
year={2013},
isbn={978-3-642-40934-9},
booktitle={Algorithmic Learning Theory},
volume={8139},
series={Lecture Notes in Computer Science},
editor={Jain, Sanjay and Munos, Rémi and Stephan, Frank and Zeugmann, Thomas},
title={Robust Risk-Averse Stochastic Multi-armed Bandits},
publisher={Springer Berlin Heidelberg},
author={Maillard, Odalric-Ambrym},
pages={218-233}
}
Related Publications:
Kullback-Leibler Upper Confidence Bounds for  Optimal Sequential Allocation.
Olivier Cappé, Aurelien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz.
In The Annals of Statistics, 2013.

Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences.
Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos.
In Proceedings of the 24th annual Conference On Learning Theory, COLT, 2011.

Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation.

2013, Discussing articles

Olivier Cappé, Aurélien Garivier,
Odalric-Ambrym Maillard,
Rémi Munos, Gilles Stoltz.
In The Annals of Statistics, 2013.

[Download]

Abstract:

We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of  Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas et Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.

You can dowload the paper from the Annals of Statistics (here) or from the HAL online open depository* (here).

Bibtex:
@article{CaGaMaMuSt2013,
AUTHOR = {Olivier Capp\'{e}, Aur\'{e}lien Garivier, Odalric-Ambrym Maillard, R\'{e}mi Munos, Gilles Stoltz},
TITLE = {Kullback–Leibler upper confidence bounds for optimal sequential allocation},
JOURNAL = {Ann. Statist.},
FJOURNAL = {Annals of Statistics},
YEAR = {2013},
VOLUME = {41},
NUMBER = {3},
PAGES = {1516-1541},
ISSN = {0090-5364},
DOI = {10.1214/13-AOS1119} }
Related Publications:
Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences.
Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos.
In Proceedings of the 24th annual Conference On Learning Theory, COLT, 2011.

Competing with an Infinite Set of Models in Reinforcement Learning.

2013, Discussing articles

Phuong Nguyen, Odalric-Ambrym Maillard, Daniil Ryabko,Ronald Ortner.
In International Conference on Artificial Intelligence and Statistics, 2013.

[Download]

Abstract:

We consider a reinforcement learning setting where the learner also has to deal with the problem of finding a suitable state-representation function from a given set of models. This has to be done while interacting with the environment in an online fashion (no resets), and the goal is to have small regret with respect to any Markov model in the set. For this setting, recently the BLB algorithm has been proposed, which achieves regret of order T^{2/3}, provided that the given set of models is finite. Our first contribution is to extend this result to a countably infinite set of models. Moreover, the BLB regret bound suffers from an additive term that can be exponential in the diameter of the MDP involved, since the diameter has to be guessed. The algorithm we propose avoids guessing the diameter, thus improving the regret bound.

You can dowload the paper from the JMLR website (here) or from the HAL online open depository* (soon).

Bibtex:
@InProceedings{Nguyen13,
author = “Nguyen, P. and Maillard, O. and Ryabko, D. and Ortner, R. “,
title = “Competing with an Infinite Set of Models in Reinforcement Learning”,
booktitle = “AISTATS”,
series = {JMLR W\&CP 31},
address = “Arizona, USA”,
year = “2013”,
pages = “463–471” }
Related Publications:
Optimal regret bounds for selecting the state representation in reinforcement learning.
Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, Daniil Ryabko.
In Proceedings of the 30th international conference on machine learning, ICML 2013, 2013.

Selecting the State-Representation in Reinforcement Learning.
Odalric-Ambrym Maillard, Daniil Ryabko, Rémi Munos.
In Proceedings of the 24th conference on advances in Neural Information Processing Systems, NIPS ’11, pages 2627–2635, 2011.