Selecting Near-Optimal Approximate State Representations in Reinforcement Learning.

2014, Discussing articles

Ronald Ortner, Odalric-Ambrym Maillard, Daniil Ryabko.
In Algorithmic Learning Theory, 2014.

[Download]

Abstract:
We consider a reinforcement learning setting introduced in (Maillard et al., NIPS 2011) where the learner does not have explicit access to the states of the underlying Markov decision process (MDP). Instead, she has access to several models that map histories of past interactions to states. Here we improve over known regret bounds in this setting, and more importantly generalize to the case where the models given to the learner do not contain a true model resulting in an MDP representation but only approximations of it. We also give improved error bounds for state aggregation

You can dowload the paper from the  ALT Website (here) or from the HAL online open depository* (here).

Bibtex:
@inproceedings{ortner2014selecting,
  title={Selecting near-optimal approximate state representations in reinforcement learning},
  author={Ortner, Ronald and Maillard, Odalric-Ambrym and Ryabko, Daniil},
  booktitle={Algorithmic Learning Theory},
  pages={140--154},
  year={2014},
  organization={Springer}
}
Related Publications:
Competing with an infinite set of models in reinforcement learning, Phuong Nguyen, Odalric-Ambrym Maillard, Daniil Ryabko, and Ronald Ortner, in Proceedings of the International Conference on Artificial Intelligence and Statistics (AI&STATS), volume 31 of JMLR W&CP , pages 463–471, Arizona, USA, 2013.

Optimal regret bounds for selecting the state representation in reinforcement learning.
Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, Daniil Ryabko.
In Proceedings of the 30th international conference on machine learning, ICML 2013, 2013.

Selecting the State-Representation in Reinforcement Learning.
Odalric-Ambrym Maillard, Daniil Ryabko, Rémi Munos.
In Proceedings of the 24th conference on advances in Neural Information Processing Systems, NIPS ’11, pages 2627–2635, 2011.

Advertisements

Latent Bandits.

2013, Discussing articles

Odalric-Ambrym Maillard, Shie Mannor
In International Conference on Machine Learning, 2014.

[Download]

Abstract:

We consider a multi-armed bandit problem where the reward distributions are indexed by two sets –one for arms, one for type– and can be partitioned into a small number of clusters according to the type. First,we consider the setting where all reward distributions are known and all types have the same underlying cluster, the type’s identity is, however, unknown. Second, we study the case where types may come from different classes, which is significantly more challenging. Finally, we tackle the case where the reward distributions are completely unknown. In each setting, we introduce specific algorithms and derive non-trivial regret performance. Numerical experiments show that, in the most challenging agnostic case, the proposed algorithm achieves excellent performance in several difficult scenarios.

You can dowload the paper from the JMLR website (here) or from the HAL online open depository* (here).

You can download the Java code used to generate the experiments here.

Bibtex:
@inproceedings{maillard2014latent,
  title={Latent Bandits.},
  author={Maillard, Odalric-Ambrym and Mannor, Shie},
  booktitle={Proceedings of The 31st International Conference on Machine Learning},
  pages={136--144},
  year={2014}
}

Robust Risk-averse Multi-armed Bandits.

2013, Discussing articles

Odalric-Ambrym Maillard.
In Algorithmic Learning Theory, 2013.

[Download]

Abstract:

We study a variant of the standard stochastic multi-armed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximizing some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative expected regret. We provide an algorithm, called RA-UCB to solve this problem, together with a high probability bound on its regret.

You can dowload the paper from the ALT website (here) or from the HAL online open depository* (here).

Bibtex:
@incollection{Maillard2013,
year={2013},
isbn={978-3-642-40934-9},
booktitle={Algorithmic Learning Theory},
volume={8139},
series={Lecture Notes in Computer Science},
editor={Jain, Sanjay and Munos, Rémi and Stephan, Frank and Zeugmann, Thomas},
title={Robust Risk-Averse Stochastic Multi-armed Bandits},
publisher={Springer Berlin Heidelberg},
author={Maillard, Odalric-Ambrym},
pages={218-233}
}
Related Publications:
Kullback-Leibler Upper Confidence Bounds for  Optimal Sequential Allocation.
Olivier Cappé, Aurelien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz.
In The Annals of Statistics, 2013.

Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences.
Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos.
In Proceedings of the 24th annual Conference On Learning Theory, COLT, 2011.

Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation.

2013, Discussing articles

Olivier Cappé, Aurélien Garivier,
Odalric-Ambrym Maillard,
Rémi Munos, Gilles Stoltz.
In The Annals of Statistics, 2013.

[Download]

Abstract:

We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of  Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas et Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.

You can dowload the paper from the Annals of Statistics (here) or from the HAL online open depository* (here).

Bibtex:
@article{CaGaMaMuSt2013,
AUTHOR = {Olivier Capp\'{e}, Aur\'{e}lien Garivier, Odalric-Ambrym Maillard, R\'{e}mi Munos, Gilles Stoltz},
TITLE = {Kullback–Leibler upper confidence bounds for optimal sequential allocation},
JOURNAL = {Ann. Statist.},
FJOURNAL = {Annals of Statistics},
YEAR = {2013},
VOLUME = {41},
NUMBER = {3},
PAGES = {1516-1541},
ISSN = {0090-5364},
DOI = {10.1214/13-AOS1119} }
Related Publications:
Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences.
Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos.
In Proceedings of the 24th annual Conference On Learning Theory, COLT, 2011.

Competing with an Infinite Set of Models in Reinforcement Learning.

2013, Discussing articles

Phuong Nguyen, Odalric-Ambrym Maillard, Daniil Ryabko,Ronald Ortner.
In International Conference on Artificial Intelligence and Statistics, 2013.

[Download]

Abstract:

We consider a reinforcement learning setting where the learner also has to deal with the problem of finding a suitable state-representation function from a given set of models. This has to be done while interacting with the environment in an online fashion (no resets), and the goal is to have small regret with respect to any Markov model in the set. For this setting, recently the BLB algorithm has been proposed, which achieves regret of order T^{2/3}, provided that the given set of models is finite. Our first contribution is to extend this result to a countably infinite set of models. Moreover, the BLB regret bound suffers from an additive term that can be exponential in the diameter of the MDP involved, since the diameter has to be guessed. The algorithm we propose avoids guessing the diameter, thus improving the regret bound.

You can dowload the paper from the JMLR website (here) or from the HAL online open depository* (soon).

Bibtex:
@InProceedings{Nguyen13,
author = “Nguyen, P. and Maillard, O. and Ryabko, D. and Ortner, R. “,
title = “Competing with an Infinite Set of Models in Reinforcement Learning”,
booktitle = “AISTATS”,
series = {JMLR W\&CP 31},
address = “Arizona, USA”,
year = “2013”,
pages = “463–471” }
Related Publications:
Optimal regret bounds for selecting the state representation in reinforcement learning.
Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, Daniil Ryabko.
In Proceedings of the 30th international conference on machine learning, ICML 2013, 2013.

Selecting the State-Representation in Reinforcement Learning.
Odalric-Ambrym Maillard, Daniil Ryabko, Rémi Munos.
In Proceedings of the 24th conference on advances in Neural Information Processing Systems, NIPS ’11, pages 2627–2635, 2011.

Optimal regret bounds for selecting the state representation in reinforcement learning.

2012, Discussing articles

Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, Daniil Ryabko.
In Proceedings of the 30th international conference on machine learning, ICML 2013, 2013.

[Download]

Abstract:

We consider an agent interacting with an environment in a single stream of actions, observations, and rewards, with no reset. This process is not assumed to be a Markov Decision Process (MDP). Rather, the agent has several representations (mapping histories of past interactions to a discrete state space) of the environment with unknown dynamics, only some of which result in an MDP. The goal is to minimize the average regret criterion against an agent who knows an MDP representation giving the highest optimal reward, and acts optimally in it. Recent regret bounds for this setting are of order O(T^{2/3}) with an additive term constant yet exponential in some characteristics of the optimal MDP. We propose an algorithm whose regret after T time steps is O(\sqrt{T}), with all constants reasonably small. This is optimal in T since O(\sqrt{T}) is the optimal regret in the setting of learning in a (single discrete) MDP.

You can dowload the paper from the ICML website (here), and a corrected version from the HAL online open depository* (here), or from arXiv (here) (the correction is minor and changes only a constant 2 into 2\sqrt{2}). See also a talk presenting this work here.

Bibtex:
@inproceedings{MaillardNguyenOrtnerRyabko13,
author = “Maillard, O. and Nguyen, P. and Ortner, R. and Ryabko, D.”,
title = “Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning”,
booktitle = “International conference on Machine Learning”,
series = {JMLR W\&CP 28(1)},
address = “Atlanta, USA”,
year = “2013”,
pages = ” 543-551″
}
Related Publications:
Selecting the State-Representation in Reinforcement Learning.
Odalric-Ambrym Maillard, Daniil Ryabko, Rémi Munos.
In Proceedings of the 24th conference on advances in Neural Information Processing Systems, NIPS ’11, pages 2627–2635, 2011.

Hierarchical Optimistic Region Selection driven by Curiosity.

2012, Discussing articles

Odalric-Ambrym Maillard.
In Proceedings of the 25th conference on advances in Neural Information Processing Systems, NIPS ’12, 2012.

[Download]

Abstract:

This paper aims to take a step forwards making the term ”intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiosity-driven learning. To that end, we consider the setting where, a fixed partition \P of a continuous space \X being given, and a process \nu defined on \X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample \nu in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis.

You can dowload the paper from the NIPS website (here) or from the HAL online open depository* (here).

Bibtex:
@inproceedings{Maillard12,
title ={Hierarchical Optimistic Region Selection driven by Curiosity},
author={Odalric-Ambrym Maillard},
booktitle = {Advances in Neural Information Processing Systems 25},
editor = {P. Bartlett and F.C.N. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger},
pages = {1457–1465},
year = {2012}
}

Online allocation and homogeneous partitioning for piecewise constant mean-approximation.

2012, Discussing articles

 Alexandra Carpentier, Odalric-Ambrym Maillard.
In Proceedings of the 25th conference on advances in Neural Information Processing Systems, NIPS ’12, 2012.

[Download]

Abstract:

In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finite-time confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X \subset \Real^d. Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of \alpha-Hölder functions.

You can dowload the paper from the NIPS website (here) or from the HAL online open depository* (here).

Bibtex:
@inproceedings{CarpentierMaillard12,
title = {Online allocation and homogeneous partitioning for piecewise constant mean-approximation.},
author = {Carpentier, Alexandra and Maillard, Odalric-Ambrym},
booktitle = {Advances in Neural Information Processing Systems 25},
editor = {P. Bartlett and F.C.N. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger},
pages = {1970-1978},
year = {2012}
}

Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences.

2011, Discussing articles

Odalric-Ambrym Maillard, Gilles Stoltz, Rémi Munos.
In Proceedings of the 24th annual Conference On Learning Theory,
COLT 2011.

[Download]

Abstract:

We consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of \cite{Burnetas96}. Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms).

You can dowload the paper from the JMLR website (here) or from the HAL online open depository* (here).

Bibtex:
@INPROCEEDINGS{MaiMunSto2011KL,
author = {Odalric-Ambrym Maillard and R{\’e}mi Munos and Gilles Stoltz},
title = {Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler
Divergences},
booktitle = {Proceedings of the 24th annual Conference On Learning Theory},
year = {2011},
series = {COLT ’11},
keywords = {Bandit Theory},
location = {Budapest, Hungary},
numpages = {13}
}

Adaptive bandits: Towards the best history-dependent strategy

2011, Discussing articles

Odalric-Ambrym Maillard, Rémi Munos.
In Proceedings of the 14th international conference on Artificial Intelligence and Statistics,
AI&Statistics 2011, volume 15 of JMLR W&CP, 2011.

[Download]

Abstract:

We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. she provides rewards that are stochastic functions of equivalence classes defined by some model theta* in Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in  Theta. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When  Theta={theta}, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by  Õ(\sqrt{TAC}), where  C is the number of classes of  theta. Now, when many models are available, all known algorithms achieving a nice regret  O(\sqrt{T}) are unfortunately not tractable and scale poorly with the number of models  |Theta| . Our contribution here is to provide tractable algorithms with regret bounded by  T^{2/3}C^{1/3}log(|Theta|)^{1/2}.

You can dowload the paper from the JMLR webiste (here) or from the HAL online open depository* (here).

Bibtex:
@inproceedings{MaillardM11,
author = {Odalric{-}Ambrym Maillard and
R{\'{e}}mi Munos},
title = {Adaptive Bandits: Towards the best history-dependent strategy},
booktitle = {Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, {AISTATS} 2011, Fort Lauderdale, USA, April 11-13, 2011},
year = {2011},
pages = {570–578}
editor = {Geoffrey J. Gordon and David B. Dunson and Miroslav Dud{\'{\i}}k},
series = {{JMLR} Proceedings},
year = {2011},
volume = {15}
}