Thompson Sampling For Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox

Abstract
We consider Thompson Sampling (TS) for linear combinatorial semi-bandits and
subgaussian rewards. We propose the first known TS whose finite-time regret
does not scale exponentially with the dimension of the problem. We further
show the “mismatched sampling paradox”: A learner who knows the rewards
distributions and samples from the correct posterior distribution can perform
exponentially worse than a learner who does not know the rewards and simply
samples from a well-chosen Gaussian posterior.
Type
Publication
In Neural Information Processing Systems 2024

Authors
PhD Student at CentraleSupélec
I was a PhD Student at the L2S lab at CentraleSupelec under the supervision of
Richard Combes and
Sheng Yang