Skip to Main Content
This paper solves the classical two-armed-bandit problem under the finite-memory constraint described below. Given are probability densities and , and two experiments and . It is not known which density is associated with which experiment. Thus the experimental outcome of experiment is as likely to be distributed according to as it is to be distributed according to . It is desired to sequentially choose an experiment to be performed on the basis of past observations according to the algorithm , where is the state of memory at time is the choice of experiment, and , is the random variable observation. The goal is to maximize the asymptotic proportion of uses of the experiment associated with density . Let , and let and denote the almost everywhere greatest lower bound and least upper bound on . Let . Then the optimal value of , over all -state algorithms , will be shown to be . An -optimal family of -state algorithms will be demonstrated. In general, optimal algorithms do not exist, and -optimal algorithms require artificial randomization.