Optimal regret bounds for selecting the state representation in reinforcement learning.

2012, Discussing articles

Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, Daniil Ryabko.
In Proceedings of the 30th international conference on machine learning, ICML 2013, 2013.

[Download]

Abstract:

We consider an agent interacting with an environment in a single stream of actions, observations, and rewards, with no reset. This process is not assumed to be a Markov Decision Process (MDP). Rather, the agent has several representations (mapping histories of past interactions to a discrete state space) of the environment with unknown dynamics, only some of which result in an MDP. The goal is to minimize the average regret criterion against an agent who knows an MDP representation giving the highest optimal reward, and acts optimally in it. Recent regret bounds for this setting are of order O(T^{2/3}) with an additive term constant yet exponential in some characteristics of the optimal MDP. We propose an algorithm whose regret after T time steps is O(\sqrt{T}), with all constants reasonably small. This is optimal in T since O(\sqrt{T}) is the optimal regret in the setting of learning in a (single discrete) MDP.

You can dowload the paper from the ICML website (here), and a corrected version from the HAL online open depository* (here), or from arXiv (here) (the correction is minor and changes only a constant 2 into 2\sqrt{2}). See also a talk presenting this work here.

Bibtex:
@inproceedings{MaillardNguyenOrtnerRyabko13,
author = “Maillard, O. and Nguyen, P. and Ortner, R. and Ryabko, D.”,
title = “Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning”,
booktitle = “International conference on Machine Learning”,
series = {JMLR W\&CP 28(1)},
address = “Atlanta, USA”,
year = “2013”,
pages = ” 543-551″
}
Related Publications:
Selecting the State-Representation in Reinforcement Learning.
Odalric-Ambrym Maillard, Daniil Ryabko, Rémi Munos.
In Proceedings of the 24th conference on advances in Neural Information Processing Systems, NIPS ’11, pages 2627–2635, 2011.

Hierarchical Optimistic Region Selection driven by Curiosity.

2012, Discussing articles

Odalric-Ambrym Maillard.
In Proceedings of the 25th conference on advances in Neural Information Processing Systems, NIPS ’12, 2012.

[Download]

Abstract:

This paper aims to take a step forwards making the term ”intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiosity-driven learning. To that end, we consider the setting where, a fixed partition \P of a continuous space \X being given, and a process \nu defined on \X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample \nu in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis.

You can dowload the paper from the NIPS website (here) or from the HAL online open depository* (here).

Bibtex:
@inproceedings{Maillard12,
title ={Hierarchical Optimistic Region Selection driven by Curiosity},
author={Odalric-Ambrym Maillard},
booktitle = {Advances in Neural Information Processing Systems 25},
editor = {P. Bartlett and F.C.N. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger},
pages = {1457–1465},
year = {2012}
}

Online allocation and homogeneous partitioning for piecewise constant mean-approximation.

2012, Discussing articles

 Alexandra Carpentier, Odalric-Ambrym Maillard.
In Proceedings of the 25th conference on advances in Neural Information Processing Systems, NIPS ’12, 2012.

[Download]

Abstract:

In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finite-time confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X \subset \Real^d. Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of \alpha-Hölder functions.

You can dowload the paper from the NIPS website (here) or from the HAL online open depository* (here).

Bibtex:
@inproceedings{CarpentierMaillard12,
title = {Online allocation and homogeneous partitioning for piecewise constant mean-approximation.},
author = {Carpentier, Alexandra and Maillard, Odalric-Ambrym},
booktitle = {Advances in Neural Information Processing Systems 25},
editor = {P. Bartlett and F.C.N. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger},
pages = {1970-1978},
year = {2012}
}

Linear regression with random projections.

2012, Discussing articles

Odalric-Ambrym Maillard, Rémi Munos.
In Journal of Machine Learning Research 2012, vol:13, pp:2735-2772.

[Download]

Abstract:

We investigate a method for regression that makes use of a randomly generated subspace G_P (of finite dimension P) of a given large (possibly infinite) dimensional function space F, for example, L_{2}([0,1]^d). G_P is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d.~coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in G_P rather than in F, and derive excess risk bounds for a specific regression algorithm (least squares regression in G_P). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments.

You can dowload the paper from the JMRL website (here) or from the HAL online open depository* (here).

Bibtex:
@article{MaillardMunos12,
title = {{Linear Regression with Random Projections}},
author = {Maillard, Odalric-Ambrym and Munos, R{\’e}mi},
affiliation = {SEQUEL – INRIA Lille – Nord Europe},
pages = {2735-2772},
journal = {Journal of Machine Learning Research},
volume = {13},
year = {2012}
}
Related Publications:
Scrambled Objects for Least-Squares Regression.
Odalric-Ambrym Maillard, Rémi Munos.
In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Proceedings of the 23rd conference on advances in Neural Information Processing Systems, NIPS ’10, pages 1549–1557, 2010.

Compressed Least Squares Regression.
Odalric-Ambrym Maillard, Rémi Munos.
In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Proceedings of the 22nd conference on advances in Neural Information Processing Systems,
NIPS ’09, pages 1213–1221, 2009.