Skip to Main Content
In this paper, we study the online learning problem involving rested and restless bandits, in both a centralized and a decentralized setting. In a centralized setting, the system consists of a single player/user and a set of K finite-state discrete-time Markov chains (arms) with unknown state spaces (rewards) and statistics. The objective of the player is to decide in each step which M of the K arms to play over a sequence of trials so as to maximize its long-term reward. In a decentralized setting, multiple uncoordinated players each makes its own decision on which arm to play in a step, and if two or more players select the same arm simultaneously, a collision results and none of the players selecting that arm gets a reward. The objective of each player again is to maximize its long-term reward. We first show that logarithmic regret algorithms exist both for the centralized rested and restless bandit problems. For the decentralized setting, we propose an algorithm with logarithmic regret with respect to the optimal centralized arm allocation. Numerical results and extensive discussion are also provided to highlight insights obtained from this study.