Apprentissage Séquentiel : Bandits, Statistique et Renforcement.

2011, Discussing articles

Odalric-Ambrym Maillard.
PhD thesis, Université de Lille 1, October 2011.
[AFIA PhD Prize 2012]

Abstract:

This thesis studies the following topics in Machine Learning: Bandit theory, Statistical learning and Reinforcement learning. The common underlying thread is the non-asymptotic study of various notions of adaptation: to an environment or an opponent in part I about bandit theory, to the structure of a signal in part II about statistical theory, to the structure of states and rewards or to some state-model of the world in part III about reinforcement learning.
First we derive a non-asymptotic analysis of a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit that enables to match, in the case of distributions with finite support, the asymptotic distribution-dependent lower bound known for this problem. Now for a multi-armed bandit with a possibly adaptive opponent, we introduce history-based models to catch some weakness of the opponent, and show how one can benefit from such models to design algorithms adaptive to this weakness.
Then we contribute to the regression setting and show how the use of random matrices can be beneficial both theoretically and numerically when the considered hypothesis space has a large, possibly infinite, dimension. We also use random matrices in the sparse recovery setting to build sensing operators that allow for recovery when the basis is far from being orthogonal.
Finally we combine part I and II to first provide a non-asymptotic analysis of reinforcement learning algorithms such as Bellman-residual minimization and a version of Least squares temporal-difference that uses random projections and then, upstream of the Markov Decision Problem setting, discuss the practical problem of choosing a good model of states.

You can dowload my Ph.D. manuscript from the University website (here).

Bibtex:

@phdthesis{maillard2011apprentissage,
title={APPRENTISSAGE S{\’E}QUENTIEL: Bandits, Statistique et Renforcement.},
author={Maillard, Odalric-Ambrym},
year={2011},
school={Universit{\’e} des Sciences et Technologie de Lille — Lille I}
}

Selecting the State-Representation in Reinforcement Learning.

2011, Discussing articles

Odalric-Ambrym Maillard, Daniil Ryabko, Rémi Munos.
In Proceedings of the 24th conference on advances in Neural Information Processing Systems, NIPS ’11, pages 2627–2635, 2011.

[Download]

Abstract:

The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T^{2/3} where T is the horizon time.

You can dowload the paper from the NIPS website (here) or from the HAL online open depository* (here).

Bibtex:

@INPROCEEDINGS{Maillard2011,
author = {Odalric-Ambrym Maillard and Daniil Ryabko and R{\’e}mi Munos},
title = {Selecting the State-Representation in Reinforcement Learning},
booktitle = {Proceedings of the 24th conference on advances in Neural Information
Processing Systems},
year = {2011},
pages = {2627-2635},
keywords = {Reinforcement Learning},
location = {Granada, Spain}
}

Sparse recovery with Brownian sensing.

2011, Discussing articles

Alexandra Carpentier, Odalric-Ambrym Maillard, Rémi Munos.
In Proceedings of the 24th conference on advances in Neural Information Processing Systems, NIPS ’11, 2011.

[Download]

Abstract:

We consider the problem of recovering the parameter α ∈ R^K of a sparse function f (i.e. the number of non-zero entries of α is small compared to the number K of features) given noisy evaluations of f at a set of well-chosen sampling points. We introduce an additional randomization process, called Brownian sensing, based on the computation of stochastic integrals, which produces a Gaussian sensing matrix, for which good recovery properties are proven, independently on the number of sampling points N , even when the features are arbitrarily non-orthogonal. Under the assumption that f is Hölder continuous with exponent at least 1/2, we provide an estimate â of the parameter such that ||α − â||_2 = O( ||η||_2/ sqrt(N)), where η is the observation noise. The method uses a set of sampling points uniformly distributed along a one-dimensional curve selected according to the features. We report numerical experiments illustrating our method.

You can dowload the paper from the NIPS webiste (here) or from the HAL online open depository* (here).

Bibtex:

@incollection{CarpentierMM11,
title = {Sparse Recovery with Brownian Sensing},
author = {Alexandra Carpentier and Odalric-ambrym Maillard and R\'{e}mi Munos},
booktitle = {Advances in Neural Information Processing Systems 24},
editor = {J. Shawe-Taylor and R.S. Zemel and P.L. Bartlett and F. Pereira and K.Q. Weinberger},
pages = {1782–1790},
year = {2011},
publisher = {Curran Associates, Inc.}
}

Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences.

2011, Discussing articles

Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos.
In Proceedings of the 24th annual Conference On Learning Theory,
COLT 2011.

[Download]

Abstract:

We consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of \cite{Burnetas96}. Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms).

You can dowload the paper from the JMLR website (here) or from the HAL online open depository* (here).

Bibtex:

@INPROCEEDINGS{MaiMunSto2011KL,
author = {Odalric-Ambrym Maillard and R{\’e}mi Munos and Gilles Stoltz},
title = {Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler
Divergences},
booktitle = {Proceedings of the 24th annual Conference On Learning Theory},
year = {2011},
series = {COLT ’11},
keywords = {Bandit Theory},
location = {Budapest, Hungary},
numpages = {13}
}

Adaptive bandits: Towards the best history-dependent strategy

2011, Discussing articles

Odalric-Ambrym Maillard, Rémi Munos.
In Proceedings of the 14th international conference on Artificial Intelligence and Statistics,
AI&Statistics 2011, volume 15 of JMLR W&CP, 2011.

[Download]

Abstract:

We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. she provides rewards that are stochastic functions of equivalence classes defined by some model theta* in Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in Theta. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When Theta={theta}, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by Õ(\sqrt{TAC}), where C is the number of classes of theta. Now, when many models are available, all known algorithms achieving a nice regret O(\sqrt{T}) are unfortunately not tractable and scale poorly with the number of models |Theta| . Our contribution here is to provide tractable algorithms with regret bounded by T^{2/3}C^{1/3}log(|Theta|)^{1/2}.

You can dowload the paper from the JMLR webiste (here) or from the HAL online open depository* (here).

Bibtex:

@inproceedings{MaillardM11,
author = {Odalric{-}Ambrym Maillard and
R{\'{e}}mi Munos},
title = {Adaptive Bandits: Towards the best history-dependent strategy},
booktitle = {Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, {AISTATS} 2011, Fort Lauderdale, USA, April 11-13, 2011},
year = {2011},
pages = {570–578}
editor = {Geoffrey J. Gordon and David B. Dunson and Miroslav Dud{\'{\i}}k},
series = {{JMLR} Proceedings},
year = {2011},
volume = {15}
}