Adaptive bandits: Towards the best history-dependent strategy

2011, Discussing articles

Odalric-Ambrym Maillard, Rémi Munos.
In Proceedings of the 14th international conference on Artificial Intelligence and Statistics,
AI&Statistics 2011, volume 15 of JMLR W&CP, 2011.



We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. she provides rewards that are stochastic functions of equivalence classes defined by some model theta* in Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in  Theta. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When  Theta={theta}, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by  Õ(\sqrt{TAC}), where  C is the number of classes of  theta. Now, when many models are available, all known algorithms achieving a nice regret  O(\sqrt{T}) are unfortunately not tractable and scale poorly with the number of models  |Theta| . Our contribution here is to provide tractable algorithms with regret bounded by  T^{2/3}C^{1/3}log(|Theta|)^{1/2}.

You can dowload the paper from the JMLR webiste (here) or from the HAL online open depository* (here).

author = {Odalric{-}Ambrym Maillard and
R{\'{e}}mi Munos},
title = {Adaptive Bandits: Towards the best history-dependent strategy},
booktitle = {Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, {AISTATS} 2011, Fort Lauderdale, USA, April 11-13, 2011},
year = {2011},
pages = {570–578}
editor = {Geoffrey J. Gordon and David B. Dunson and Miroslav Dud{\'{\i}}k},
series = {{JMLR} Proceedings},
year = {2011},
volume = {15}

Online Learning in Adversarial Lipschitz Environments

2010, Discussing articles

Odalric-Ambrym Maillard, Rémi Munos.
In ECML-PKDD’10, pages 305–320, 2010



We consider the problem of online learning in an adversarial environment when the reward functions chosen by the adversary are assumed to be Lipschitz. This setting extends previous works on linear and convex online learning. We  provide a class of algorithms with cumulative regret upper bounded by O(\sqrt{dT ln(\lambda)}) where d is the dimension of the search space, T the time horizon, and \lambda the Lipschitz constant. Efficient numerical implementations using particle methods are discussed. Applications include online supervised learning problems for both full and partial (bandit) information settings, for a large class of non-linear regressors/classifiers, such as neural networks.

You can dowload the paper from the ECML webiste (here) or from the HAL online open depository* (here).

author = {{Odalric-Ambrym} Maillard and
R\'{e}mi Munos},
title = {Online Learning in Adversarial Lipschitz Environments.},
booktitle = {Machine Learning and Knowledge Discovery in Databases, European Conference,
{ECML} {PKDD} 2010, Barcelona, Spain, September 20-24, 2010, Proceedings,
Part {II}},
year = {2010},
pages = {305–320},
editor = {Jos\'{e} L. Balc\`{a}zar and
Francesco Bonchi and
Aristides Gionis and
Mich\`{e}le Sebag},
series = {Lecture Notes in Computer Science},
year = {2010},
volume = {6322},
publisher = {Springer}