Akram Baransi, OdalricAmbrym Maillard, Shie Mannor.
In European conference on Machine Learning, 2014.
[Download]
Abstract: 
The stochastic multiarmed bandit problem is a popular model of the exploration/exploitation tradeoff in sequential decision problems. We introduce a novel algorithm that is based on subsampling. Despite its simplicity, we show that the algorithm demonstrates excellent empirical performances against stateoftheart algorithms, including Thompson sampling and KLUCB. The algorithm is very flexible, it does need to know a set of reward distributions in advance nor the range of the rewards. It is not restricted to Bernoulli distributions and is also invariant under rescaling of the rewards. We provide a detailed experimental study comparing the algorithm to the state of the art, the main intuition that explains the striking results, and conclude with a finitetime regret analysis for this algorithm in the simplified twoarm bandit setting.

You can dowload the paper from the ECML website (here) or from the HAL online open depository* (here).
Bibtex: 
@incollection{baransi2014sub, title={Subsampling for Multiarmed Bandits}, author={Baransi, Akram and Maillard, OdalricAmbrym and Mannor, Shie}, booktitle={Machine Learning and Knowledge Discovery in Databases}, pages={115–131}, year={2014}, publisher={Springer} } 