# “How hard is my MDP?” Distribution-norm to the rescue.

Odalric-Ambrym Maillard, Timothy A. Mann, Shie Mannor.
In advances in Neural Information Processing Systems, 2014.

 Abstract: In Reinforcement Learning (RL), state-of-the-art algorithms require a large number of samples per state-action pair to estimate the transition kernel p. In many problems, a good approximation of p is not needed. For instance, if from one state-action pair (s,a), one can only transit to states with the same value, learning p(⋅|s,a) accurately is irrelevant (only its support matters). This paper aims at capturing such behavior by defining a novel hardness measure for Markov Decision Processes (MDPs) we call the distribution-norm. The distribution-norm w.r.t. a measure ν is defined on zero ν-mean functions f by the standard variation of f with respect to ν. We first provide a concentration inequality for the dual of the distribution-norm. This allows us to replace the generic but loose ||⋅||1 concentration inequalities used in most previous analysis of RL algorithms, to benefit from this new hardness measure. We then show that several common RL benchmarks have low hardness when measured using the new norm. The distribution-norm captures finer properties than the number of states or the diameter and can be used to assess the difficulty of MDPs.

You can dowload the paper from the NIPS website (here) or from the HAL online open depository* (soon).

 Bibtex: @incollection{MaiManMan14, title = {How hard is my MDP?” The distribution-norm to the rescue”}, author = {Maillard, Odalric-Ambrym and Mann, Timothy A and Mannor, Shie}, booktitle = {Advances in Neural Information Processing Systems 27}, editor = {Z. Ghahramani and M. Welling and C. Cortes and N.D. Lawrence and K.Q. Weinberger}, pages = {1835–1843}, year = {2014}, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/5441-how-hard-is-my-mdp-the-distribution-norm-to-the-rescue.pdf} }

# Sub-sampling for multi-armed bandits.

Discussing articles

Akram Baransi, Odalric-Ambrym Maillard, Shie Mannor.
In European conference on Machine Learning, 2014.

 Abstract: The stochastic multi-armed bandit problem is a popular model of the exploration/exploitation trade-off in sequential decision problems. We introduce a novel algorithm that is based on sub-sampling. Despite its simplicity, we show that the algorithm demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB. The algorithm is very flexible, it does need to know a set of reward distributions in advance nor the range of the rewards. It is not restricted to Bernoulli distributions and is also invariant under rescaling of the rewards. We provide a detailed experimental study comparing the algorithm to the state of the art, the main intuition that explains the striking results, and conclude with a finite-time regret analysis for this algorithm in the simplified two-arm bandit setting.

You can dowload the paper from the ECML website (here) or from the HAL online open depository* (here).

 Bibtex: @incollection{baransi2014sub, title={Sub-sampling for Multi-armed Bandits}, author={Baransi, Akram and Maillard, Odalric-Ambrym and Mannor, Shie}, booktitle={Machine Learning and Knowledge Discovery in Databases}, pages={115–131}, year={2014}, publisher={Springer} }

# Concentration inequalities for sampling without replacement.

Rémi Bardenet, Odalric-Ambrym Maillard.
In Bernoulli Journal, 2014.

 Abstract: Concentration inequalities quantify the deviation of a random variable from a fixed value. In spite of numerous applications, such as opinion surveys or ecological counting procedures, few concentration results are known for the setting of sampling without replacement from a finite population. Until now, the best general concentration inequality has been a Hoeffding inequality due to Serfling (1974). In this paper, we first improve on the fundamental result of Serfling (1974), and further extend it to obtain a Bernstein concentration bound for sampling without replacement. We then derive an empirical version of our bound that does not require the variance to be known to the user.

You can dowload the paper from the Bernoulli website (here) or from the HAL online open depository* (here).

 Bibtex: (soon)

# Latent Bandits.

Odalric-Ambrym Maillard, Shie Mannor
In International Conference on Machine Learning, 2014.

 Abstract: We consider a multi-armed bandit problem where the reward distributions are indexed by two sets –one for arms, one for type– and can be partitioned into a small number of clusters according to the type. First,we consider the setting where all reward distributions are known and all types have the same underlying cluster, the type’s identity is, however, unknown. Second, we study the case where types may come from different classes, which is significantly more challenging. Finally, we tackle the case where the reward distributions are completely unknown. In each setting, we introduce specific algorithms and derive non-trivial regret performance. Numerical experiments show that, in the most challenging agnostic case, the proposed algorithm achieves excellent performance in several difficult scenarios.

You can dowload the paper from the JMLR website (here) or from the HAL online open depository* (here).

You can download the Java code used to generate the experiments here.

 Bibtex: @inproceedings{maillard2014latent, title={Latent Bandits.}, author={Maillard, Odalric-Ambrym and Mannor, Shie}, booktitle={Proceedings of The 31st International Conference on Machine Learning}, pages={136--144}, year={2014} }

# Robust Risk-averse Multi-armed Bandits.

Odalric-Ambrym Maillard.
In Algorithmic Learning Theory, 2013.

 Abstract: We study a variant of the standard stochastic multi-armed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximizing some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative expected regret. We provide an algorithm, called RA-UCB to solve this problem, together with a high probability bound on its regret.

You can dowload the paper from the ALT website (here) or from the HAL online open depository* (here).

 Bibtex: @incollection{Maillard2013, year={2013}, isbn={978-3-642-40934-9}, booktitle={Algorithmic Learning Theory}, volume={8139}, series={Lecture Notes in Computer Science}, editor={Jain, Sanjay and Munos, Rémi and Stephan, Frank and Zeugmann, Thomas}, title={Robust Risk-Averse Stochastic Multi-armed Bandits}, publisher={Springer Berlin Heidelberg}, author={Maillard, Odalric-Ambrym}, pages={218-233} }
 Related Publications: Kullback-Leibler Upper Confidence Bounds for  Optimal Sequential Allocation. Olivier Cappé, Aurelien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz. In The Annals of Statistics, 2013. Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences. Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos. In Proceedings of the 24th annual Conference On Learning Theory, COLT, 2011.

# Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation.

Olivier Cappé, Aurélien Garivier,
Odalric-Ambrym Maillard,
Rémi Munos, Gilles Stoltz.
In The Annals of Statistics, 2013.

 Abstract: We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of  Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas et Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.

You can dowload the paper from the Annals of Statistics (here) or from the HAL online open depository* (here).

 Bibtex: @article{CaGaMaMuSt2013, AUTHOR = {Olivier Capp\'{e}, Aur\'{e}lien Garivier, Odalric-Ambrym Maillard, R\'{e}mi Munos, Gilles Stoltz}, TITLE = {Kullback–Leibler upper confidence bounds for optimal sequential allocation}, JOURNAL = {Ann. Statist.}, FJOURNAL = {Annals of Statistics}, YEAR = {2013}, VOLUME = {41}, NUMBER = {3}, PAGES = {1516-1541}, ISSN = {0090-5364}, DOI = {10.1214/13-AOS1119} }
 Related Publications: Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences. Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos. In Proceedings of the 24th annual Conference On Learning Theory, COLT, 2011.

# Apprentissage Séquentiel : Bandits, Statistique et Renforcement.

Odalric-Ambrym Maillard.
PhD thesis, Université de Lille 1, October 2011.
[AFIA PhD Prize 2012]

 Abstract: This thesis studies the following topics in Machine Learning: Bandit theory, Statistical learning and Reinforcement learning. The common underlying thread is the non-asymptotic study of various notions of adaptation: to an environment or an opponent in part I about bandit theory, to the structure of a signal in part II about statistical theory, to the structure of states and rewards or to some state-model of the world in part III about reinforcement learning. First we derive a non-asymptotic analysis of a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit that enables to match, in the case of distributions with finite support, the asymptotic distribution-dependent lower bound known for this problem. Now for a multi-armed bandit with a possibly adaptive opponent, we introduce history-based models to catch some weakness of the opponent, and show how one can benefit from such models to design algorithms adaptive to this weakness. Then we contribute to the regression setting and show how the use of random matrices can be beneficial both theoretically and numerically when the considered hypothesis space has a large, possibly infinite, dimension. We also use random matrices in the sparse recovery setting to build sensing operators that allow for recovery when the basis is far from being orthogonal. Finally we combine part I and II to first provide a non-asymptotic analysis of reinforcement learning algorithms such as Bellman-residual minimization and a version of Least squares temporal-difference that uses random projections and then, upstream of the Markov Decision Problem setting, discuss the practical problem of choosing a good model of states.

You can dowload my Ph.D. manuscript from the University website (here).

 Bibtex: @phdthesis{maillard2011apprentissage, title={APPRENTISSAGE S{\’E}QUENTIEL: Bandits, Statistique et Renforcement.}, author={Maillard, Odalric-Ambrym}, year={2011}, school={Universit{\’e} des Sciences et Technologie de Lille — Lille I} }

# Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences.

Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos.
In Proceedings of the 24th annual Conference On Learning Theory,
COLT 2011.