An OCBA-Based Method for Efficient Sample Collection in Reinforcement Learning

This work focuses on the sample collection in reinforcement learning (RL), where the interaction with the environment is typically time-consuming and extravagantly expensive. In order to collect samples in a more valuable way, we propose a confidence-based sampling strategy based on the optimal computing budget allocation algorithm (OCBA), which actively allocates the computing efforts to actions with different predictive uncertainties. We estimate the uncertainty with ensembles and generalize them from tabular representations to function approximations. The OCBA-based sampling strategy could be easily integrated into various off-policy RL algorithms, where we take Q-learning, DQN, and SAC as examples to show the incorporation. Besides, we provide the theoretical analysis towards convergence and evaluate the algorithms experimentally. According to the experiments, the incorporated algorithms obtain remarkable gains compared with modern ensemble-based RL algorithms. Note to Practitioners—Reinforcement learning is a powerful tool for handling sequential decision-making problems, e.g., autonomous driving and robotics control, where the behaviors typically have a long-term effect on future events. However, although RL achieves human-level control in some tasks, it severely suffers from low sample efficiency. Therefore, implementing RL in some practical areas, e.g., healthcare and rescue, is extremely hard due to the requirement of massive samples. This work aims to enhance the exploration of RL by incorporating OCBA, which provides an asymptotically optimal data-collection strategy for simulation-based optimization. Based on ensemble-based uncertainty estimation and OCBA-based action selection, the incorporated RL algorithms show competitive performance on many benchmarks and significantly reduce the sampling efforts during iterations.


I. INTRODUCTION
I N THE past decade, reinforcement learning [1] has received widespread attention for its effective potential on a series of sequential decision-making problems.Furthermore, the combination with deep learning makes it possible to achieve human-level control on some complex tasks, e.g., playing Go [2], video games [3], [4], and robot control [5], [6], [7].However, it usually comes with the tradeoff between exploration and exploitation.Namely, it might be extravagantly expensive to reach an expected performance.During the learning process, the data collection and policy evolution are strongly correlated, which implies that a well-designed strategy for action selection may benefit the following iterations.
This topic is intensively studied in multi-armed bandits (MAB) [1] and statistical ranking and selection (R&S) [8].Both of them study the sampling strategy on finite actions, whose rewards have unknown distributions.Within limited sampling budget, MAB focuses on gathering more cumulative rewards (or equivalently fewer regret), while R&S aims to identify the best action with higher confidence.Classical MAB algorithms include optimistic initial values (OIV) [1], upper confidence bound (UCB) [9], and Thompson sampling (TS) [10], [11].These algorithms perform competitively in minimizing regret but are more conservative in identifying the best action [12].In contrast, R&S (or best arm identification, i.e., BAI, in computer science) performs better in identifying the best alternative with higher confidence.A representative approach is optimal computing budget allocation (OCBA) [13], [14], [15], which gives a closed-form budgetallocation strategy to maximize the probability of correct selection (PCS).The allocation problem could also be formulated as stochastic control, and an approximately optimal allocation strategy could be derived from the associated Bellman equation [16].There are also approaches designed to maximize the expected value of information (EVI) [17].For example, linear loss (LL) [17] and its variant LL 1 [18], [19] allocate sampling budget to minimize the expected opportunity cost (EOC) [17].A more comprehensive review can be found in [20].These algorithms provide effective sampling strategies based on the posterior performance distributions, but expanding them to Markov decision process (MDP) is not straightforward.On the one hand, the crucial interplay between states and actions is not under consideration.On the other hand, the performance distributions in MAB and R&S are time-invariant, but in MDP, they frequently vary with the updating of behavior policies.
The most widely used sampling strategy for interacting with MDP is ϵ-greedy, which selects the empirically best action with probability 1 − ϵ or a random one with probability ϵ.It is widely applied in Q-learning [21] and Deep Q-networks (DQN) type algorithms [3], [22], [23], [24].However, the ϵ-greedy strategy allocates sampling efforts based on the estimated discounted cumulative reward without considering the predictive uncertainty.The predictive uncertainty gives a measurement of the risk of false selection, without which a low training efficiency with exponentially many steps to learn might be caused [25].Besides this, Bayesian Q-learning [26] selects the action with maximal posterior probability to be the best, where the probability is calculated by the approximated distribution of the remaining rewards.Similarly, the predictive uncertainty can also serve as a bonus during policy evaluation, which also improves efficiency [25].However, these methods can only be applied to MDPs with discrete states and actions, where the action values (expected cumulative rewards starting from the given state-action pair) have tabular representations.These tabular-based algorithms are hard to expand to more complex tasks, especially those with continuous states or actions.
In practice, many real-world tasks can be characterized as MDPs with continuous states and actions, where typically RL algorithms with function approximations are adopted.These algorithms can be categorized into deterministic-policy-based algorithms and stochastic-policy-based algorithms.Similar to ϵ-greedy, the sampling strategies of deterministic-policybased algorithms, e.g., deep deterministic policy gradient (DDPG) [27] and twin delayed deep deterministic policy gradient (TD3) [28], take the adjacent area of the evaluated best action into account, without considering the predictive uncertainty.For stochastic-policy-based algorithms, e.g., trust region policy optimization (TRPO) [29], proximal policy optimization (PPO) [30], and soft actor-critic (SAC) [31], [32], [33], the policies are trained with the estimated action values, without considering the predictive uncertainty neither.These may derive the accumulation of estimation errors along with the sequential states and lead to sub-optimal policies or even divergence [28].
A potential approach to address above issues is incorporating predictive uncertainties into the sampling process, as aforementioned MAB and R&S literature.However, for continuous MDPs, the efficient sampling strategies with counting-based uncertainty estimation are hard to be applied, since storing the statistics for infinite state-action pairs is impractical.Besides, the correlation between adjacent state-action pairs is not well utilized.Nevertheless, learning from this, it is feasible to incorporate confidence-based action selection with ensemblebased predictive-uncertainty estimation, which is realized by measuring the diversity of multiple function approximations.For example, Chen et al. [34] use Q-ensembles to estimate the predictive uncertainty and design an UCB-based sampling strategy for discrete actions.A similar idea arises in [35], where the UCB-based sampling strategy is further generalized to handle continuous actions by incorporating policy ensembles.As shown in [36], by incorporating some techniques to enforce diversity, the predictive uncertainty can be effectively estimated.This makes it possible to further improve the performance of modern off-policy RL algorithms by incorporating confidence-based sampling strategies, e.g., UCB and TS [35].However, such MAB algorithms are a little conservative in identifying the best action, since exploitation is slightly overweighted [12].
Since RL algorithms are designed to accumulate more longterm rewards, a better sampling-effort allocation strategy has great potential to accelerate training.Namely, the tradeoff between exploration and exploitation should be adequately considered.On the one hand, if the action is selected with high confidence, we tend to exploit it and focus on the remaining trajectories.On the other hand, if the confidence is low, enhancing exploration has a more important long-term effect.A similar idea of efficiently accumulating evidence for decision-making arises in the aforementioned OCBA, which provides an asymptotically optimal strategy to maximize PCS for R&S problems.Motivated by this, we design an efficient sampling strategy by expanding the results of OCBA to the sampling process of RL.In the proposed algorithms, both the confidence evaluation and sampling-effort allocation are well considered.We firstly estimate the predictive uncertainty with ensembles.Then, under the tabular setting, we propose a confidence-based sampling strategy by relating the sampling process of RL to computing effort allocation in OCBA.We theoretically build the convergence property and further generalize it to continuous MDPs.The proposed OCBA-based sampling strategy can be incorporated into various off-policy algorithms.In this work, we take Q-learning, DQN, and SAC as instances to show the incorporation.Finally, we evaluate the proposed algorithms with experiments, where the OCBA-based RL algorithms show superior performance than the baselines.The standing of this work could be seen from three aspects.First, compared with prior arts in R&S [13], [17], [18], [19], we expand the sampling-effort allocation to more general situations of MDPs.Second, compared with existing ensemble-based RL algorithms [34], [35], we replace the widely used UCB with OCBA, which is more effective in identifying the best actions.Third, we expand the assumption of bounded reward in prior work [37] to a more general situation, where the rewards are only assumed to have bounded mean and variance.
The main contributions are as follows: 1) We facilitate the sampling process of RL with computing effort allocation in OCBA, based on which an efficient sampling strategy is proposed.In addition, we give a criterion to measure confidence and provide a sampling strategy to efficiently identify the best action.To the best knowledge of authors, this is the first work to incorporate OCBA into the sampling process of infinitehorizon MDPs.2) We adopt ensemble-based predictive-uncertainty estimation, which makes the OCBA-based sampling strategy effective for both tabular representations and nonlinear function approximations.Besides, by incorporating pol-icy ensemble, the OCBA-based sampling strategy can be applied to almost all common RL settings.3) We integrate the OCBA-based sampling strategy with three modern RL algorithms, e.g., Q-learning, DQN, and SAC, and validate their effectiveness through numerical experiments.The results show that the OCBA-based sampling strategy remarkably improves the performance compared with the baseline algorithms.The rest of this paper is organized as follows.We provide the preliminary in Section II, introduce the proposed algorithms in Section III, give the convergence analysis in Section IV, present the experimental results in Section V, and briefly conclude in Section VI.

II. PRELIMINARY
In this section, we first introduce some basic RL concepts and off-policy RL algorithms.Then, we present the main results of OCBA, which is used to develop the sampling strategy later.

A. Reinforcement Learning
We consider a sequential decision-making problem, which can be characterized as an MDP, ⟨S, A, P, R, γ ⟩, where S and A are the state and action spaces, R(s, a) : S × A → R 1 is the reward function that assigns each state-action pair a stochastic reward whose mean and variance are bounded, and γ ∈ (0, 1) is the discount factor for balancing instantaneous and future rewards.In the situation where S is discrete, P(s ′ |s, a) : S × A × S → [0, 1] defines the state transition probability of transiting from state s to s ′ by taking action a, while in the situation where S is continuous, P(s ′ |s, a) : S × A×S → [0, +∞) defines corresponding state transition probability density.The agent intends to maximize the discounted cumulative reward (so-called "return"), R t = ∞ k=0 γ k r t+k , where r t+k is the reward given by the reward function R at time t + k.
The action-value function Q π (s, a) = E[R t |s t = s, a t = a] is defined as the expected return of taking action a under state s and then following policy π , where in the situation where A is discrete, π(a|s) : S × A → [0, 1] defines the probability of taking action a under state s, while in the situation where A is continuous, π(a|s) : S × A → [0, +∞) defines corresponding probability density.The optimal action-value function is defined as which follows the Bellman equation and could be obtained through value-iteration algorithms.
There are various algorithms to solve MDPs under different settings.We conclude the representative works in TABLE I.For example, when both S and A are finite, the action-value 1 R is the set of real numbers.

TABLE I CATEGORIZATION OF REPRESENTATIVE RL ALGORITHMS
function could be described with a look-up table.Then, Qlearning [21] could be applied to estimate the optimal actionvalue function.It starts from random initialization and recursively updates estimation following where α t (s, a) ∈ [0, 1] is the step size taking non-zero value only on (s, a) = (s t , a t ).If all state-action pairs are performed infinitely often, and holds for all (s, a) ∈ S × A, the action values will converge with probability 1 (w.p.1) to Q * [37].
Since tabular representations have limited capacity, they are typically replaced with function approximations if the states are continuous.For example, DQN approximates the action values with neural networks and realizes value iteration by minimizing the residual error where θ , θ − are the parameters of the current network Q θ and target network Q θ − , respectively.The target network is adopted for stabilizing learning and is gradually updated towards θ during the training process.Besides, e t = (s t , a t , r t , s t+1 ) ∈ B is the experience collected at time t, and N is the size of mini-batch B.
A more complex situation is that actions are continuous, which makes the max(•) function in (3) and ( 5) hard to be calculated.In order to handle this situation, some actorcritic algorithms, e.g., DDPG [27], TD3 [28], and SAC [31], [32], [33], incorporate a separate actor to inference the best action.In this way, they separate the training process into two phases, i.e., policy evaluation and policy improvement.Taking SAC, a state-of-the-art off-policy actor-critic algorithm, as an example, during policy evaluation, the action-value function, i.e., critic, is updated by minimizing the residual error where is the soft target value, and the temperature ν determines the relative importance of the entropy against the reward.The action a ′ t+1 is freshly sampled from the actor π φ (•|s), which embeds a probability density over the continuous action space, and φ is the parameter.During policy improvement, the actor is updated by maximizing the entropy-regularized action value, where a ′ t is freshly sampled from π φ (•|s t ).In order to address the overestimation error and stabilize learning, SAC also incorporates target networks and clipped double-Q estimation, whose details can be found in [32].

B. Optimal Computing Budget Allocation
OCBA [13] is an effective technology for resources allocation in the field of simulation-based optimization [42], [43].As shown in Fig. 1, there are k alternatives {i} k i=1 , whose performance follows Gaussian distributions k ), respectively.The agent aims to identify the best alternative in the sense of mean performance, i.e., b = argmax i µ i , with higher confidence.
Conduct a thought experiment that total T simulation replications are allocated to the alternatives, where each alternative i is allocated with N i replications, and k i=1 N i = T .Then, given the samples, the confidence is evaluated by PCS where is the m-th sample from alternative i, and P(e) denotes the probability of event e occurring.If we use Gaussian distribution to approximate the posterior distribution for the unknown mean and adopt a non-informative prior distribution, the posterior distribution of µ i is given in [44] as Then, PCS could be rewritten as which is lower bounded by the approximate probability of correct selection (APCS) Since APCS provides an approximation of PCS and is easier to be calculated [13], it will be used as a criterion to control the sampling process of RL in the next section.
It can be seen from ( 10) that the computing budget allocation {N i } k i=1 affects PCS and APCS by affecting the posterior distributions.Therefore, towards higher confidence, OCBA formulates the problem as argmax and gives an asymptotically optimal solution where δ b,i = µ b − µ i is the performance difference.Getting out of the thought experiment, when allocating budget as ( 14), both µ i and σ i are estimated by earlier samples.Although the estimated parameters are not accurate, the allocation results still make sense by alternating between re-estimation and reallocation [8].Other details of OCBA can be found in [13].
Based on the observation that OCBA performs effectively in identifying the best alternative in R&S problems, we attempt to expand it to MDP situations and develop an OCBA-based sampling strategy for RL.The details are introduced in the next section.

III. METHODOLOGY
In this section, we propose the main OCBA-based sampling strategy.We first put forward the ensemble-based predictiveuncertainty estimation, which is generalizable from tabular representations to function approximations, e.g., neural networks.Then, based on the distributions estimated with ensembles, the allocation strategy of OCBA is then mapped into a sampling distribution over the action space.In principle, this sampling strategy can be integrated into all off-policy algorithms in TABLE I.In this section, we take Q-learning, DQN, and SAC as instances to show the incorporation in the situation of 1) discrete states and actions, 2) continuous states and discrete actions, and 3) continuous actions, respectively.The incorporation with other off-policy algorithms can be realized in a similar way.
We firstly introduce the predictive-uncertainty estimation.Since the solution of ( 2) is hard to be precisely calculated in large-scale problems, the action values are typically estimated with sample trajectories.Therefore, the estimated action values might be different across multiple runs.Namely, for each stateaction pair, the estimated action value at a given time step is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
a random variable that follows certain distribution, and the estimated action value in each run is just a sample.Then, in order to estimate the distributions to make better decisions, we approximate the action values with an ensemble of M independent estimations {Q θ i } M i=1 and calculate the mean and variance with where by slightly abusing of notation, Q θ represents the ensemble {Q θ i } M i=1 , and θ = θ i M i=1 is the aggregation of parameters.As shown in [36], by setting a proper ensemble size M, σ 2 (Q θ ) is effective in estimating the predictive uncertainty, and the estimation precision is even better than the benchmark Bayesian neural networks.
For each state-action pair, we empirically use Gaussian distributions to approximate the posterior distributions of estimated action values across multiple runs.As in OCBA, we adopt a non-informative prior distribution, and therefore the posterior distribution is estimated as Then, maximizing the probability of identifying the best action with noisy action-value estimations falls into the scope of OCBA, where ( 14) provides an asymptotically optimal solution.Note that normalizing the results in ( 14) provides an allocation strategy that is independent of T , which implies that it actually provides a proportional relationship of the budget allocated to each action.Based on this observation, we normalize (14) to a sampling distribution p i = N i T , with which the long-term allocation converges to the optimal solution.By replacing µ i , σ i in (14) with the ensemble-based estimations in (15) and ( 16), the probability of taking each action a under state s is obtained as where and After that, we integrate the OCBA-based sampling strategy (17) into three off-policy algorithms to show the incorporation with RL.
A. OCBA-Based Q-Learning Q-learning is an off-policy temporal-difference algorithm, which directly estimates the optimal action-value function Q * .Different from the original ϵ-greedy policy, inspired by the concept of APCS, the proposed sampling strategy starts by estimating the decision confidence with where ι θ (s, a) , a * is the estimated best action defined in (18), and f (•) is the probability density function of standard Gaussian distribution.Since APCS provides a lower bound of PCS, C θ (s) is used as a measurement of decision confidence.If C θ (s) is larger than the threshold η, the agent will exploit a * .Otherwise, the agent will explore actions to accumulate evidence for a reliable decision.In the latter situation, the OCBA-based sampling distribution in (17) will be adopted to accumulate evidence in a more efficient way.Indeed, similar to ϵ-greedy, we perform an ϵ-OCBA policy for alleviating the model error, where |A| is the number of feasible actions, and I(e) is the indicator function taking value one if and only if event e occurs (otherwise taking value zero).By recurrently updating the value estimations following where α t ∈ [0, 1] satisfies (4), and all the action-value estimations converge to Q * .The pseudo-code is provided in Algorithm 1, and the analysis towards its convergence is given in Section IV.In this work, we finish the training process when the total sample budget T is used up.

B. OCBA-Based DQN
As an extension of Q-learning, DQN replaces the look-up tables with deep neural networks.Namely, each item in the Algorithm 1 OCBA-Based Q-Learning Randomly initialize {Q θ i } M i=1 ; set the step size α t , step counter t = 0, parameter for exploration ϵ, and total sample budget T ; observe the initial state s 0 .repeat Sample action a t with the OCBA-based sampling strategy a t ∼ pθ (•|s t ).Execute a t and observe s t+1 , r t .Update the action-value estimations with (24).Set the iteration counter i=1 is a separate neural network.Besides, OCBA-based DQN adopts the same sampling strategy as OCBA-based Q-learning, which firstly estimates the decision confidence with C θ (s) and then decides to exploit a * or explore actions with the OCBA-based sampling strategy.The overall sampling strategy is given in (23).In order to estimate the optimal action values, the agent alternates between interacting with the environment and updating the estimations.After each iteration, an experience e t = (s t , a t , r t , s t+1 ) will be stored in the replay buffer D. Then, a mini-batch B will be randomly sampled from D to update the action-value estimations.The loss for each neural network is given by the residual error where is the aggregation of parameters.Finally, the neural networks are updated by performing stochastic gradient descent on L(θ i ) recurrently.
The pseudo-code is provided in Algorithm 2, where we incorporate experience replay and target networks to improve the performance.In order to stabilize training, in each time step, the target networks are slightly modified towards the current networks with a small step size κ.Update the action-value estimations by performing gradient descent on L(θ i ), which is defined in (26).Update the target networks with Set the iteration counter t ← t + 1. until t = T .

C. OCBA-Based SAC
As a state-of-the-art actor-critic algorithm for handling MDPs with continuous actions, SAC maintains a separate actor to generate actions.Similar to that, OCBA-based SAC maintains an action-value ensemble {Q θ i } M i=1 and a policy ensemble {π φ i } M i=1 , where π φ i (a|s) is the probability density of taking action a under state s.Correspondingly, the optimization process is divided into two phases, i.e., policy evaluation and policy improvement.
In policy-evaluation phase, each action-value estimation is updated towards minimizing the residual error where and a i′ t+1 is freshly sampled from π φ i (•|s t+1 ).The policy ensemble makes it possible to apply OCBA-based sampling strategy (17) on the proposed action set.Namely, in each time step, we firstly collect a set of actions {a i t ∼ π φ i (•|s t )} M i=1 and then select one based on the OCBA-based sampling strategy.
In policy-improvement phase, each actor is updated by performing gradient ascent on the entropy-regularized action values where a i′ t is freshly sampled from π φ i (•|s t ).By training each actor with a separate critic as in (30), we further enhance the diversity of policies to learn multi-modal behaviors.
The pseudo-code is provided in Algorithm 3, where the ϵ-OCBA policy is defined as where a * t = argmax a i t ∈A t µ(Q θ (s t , a i t )) in this situation.Update the critics by performing gradient descent on L(θ i ), which is defined in (28).Update the actors by performing gradient ascent on J (φ i ), which is defined in (30).Update the target networks with Set the iteration counter t ← t + 1. until t = T .

IV. THEORETICAL RESULTS
In this section, we discuss the convergence property of the proposed OCBA-based Q-learning algorithm.Compared with prior works [37], [45], which consider a single action-value estimation and bounded rewards, we study a more general case, where multiple action-value estimations are updated dependently, and the rewards are assumed to have bounded mean and variance.For ease of presentation, the notations are slightly different from prior sections.We replace the prior notation to represent the estimated action values at time step t.Besides, for consistency of notations, we use p t and pt to represent the same quantities defined in ( 17) and ( 23), respectively.
The main theorem is developed on the following lemmas.Lemma 1: [46] The random process { t } taking values in R n is defined as where α t (x) is the step size for x at time step t, and F t is a random process.Let F t = {F i |∀i < t}, then t converges to 0 w.p.1 if the following conditions are satisfied for all x.
). Lemma 2: [37] Given a finite MDP, ⟨S, A, P, R, γ ⟩, the Q-learning algorithm given by the update rule (34) where y t = r t + γ max a ′ Q t (s t+1 , a ′ ), converges w.p.1 to the optimal action-value function as long as each state-action pair is performed infinitely often, and holds for all (s, a) ∈ S × A. Lemma 1 gives a general criterion for the convergence of random process, based on which Lemma 2 establishes the convergence property of Q-learning.However, in the original proof of Lemma 2 [37], they assume the rewards to be bounded.In order to relax it to rewards with bounded mean and variance, e.g., Gaussian rewards, we give a modified proof in Appendix A.
Then, based on the lemmas, we provide the main theorem as follows.
Theorem 1: Give a finite MDP ⟨S, A, P, R, γ ⟩, whose rewards have bounded mean and variance.The OCBA-based Q-learning agent maintains an ensemble of action-value estimations {Q i t } M i=1 and selects actions following the ϵ-OCBA policy where p t is defined in (17).The estimations are updated with where y t = r t +γ max a ′ µ(Q t (s t+1 , a ′ )), α t (s, a) ∈ [0, 1] takes non-zero values only on (s, a) = (s t , a t ), and ( 35) holds for all (s, a) ∈ S × A. Then the estimations converge w.p.1 to the optimal action-value function, i.e., Proof: Let us start by proving that µ(Q t ) converges to the optimal action-value function Q * , which is defined in (1).By averaging the two hands of (37), we obtain the update rule for µ(Q t ), i.e., Since pt (a|s) ≥ ϵ |A| > 0 holds for all (s, a) ∈ S × A, which implies that each reachable state-action pair will be performed infinitely often, it is easy to validate that µ(Q t ) converges to the optimal action-value estimation, i.e., by replacing Q t in Lemma 2 with µ(Q t ).
Then, we prove that each action-value estimation converges to µ(Q t ).We first derive the update rule of σ (Q t ) by Since α t (s t , a t ) ∈ [0, 1], the update rule of σ (Q t ) could be obtained as, Further, by replacing t with σ (Q t ) and setting F t (x) = 0, Lemma 1 establishes the convergence property, i.e., Based on ( 40) and ( 43), all of the action-value estimations converge to Q * , which concludes the proof.□ To sum up, Theorem 1 establishes the convergence property of OCBA-based Q-learning.We remark that like in related works [37], [45], [47], [48], the convergence property is only established under the finite settings.However, as shown in most temporal-difference learning-based algorithms, e.g., [3], [27], [28], replacing the look-up tables with function approximations, e.g., neural networks, also show satisfying results.

V. EXPERIMENTAL RESULTS
In this section, we conduct several experiments to evaluate the effectiveness of the proposed OCBA-based sampling strategy.For the selection of ensemble size M, we give a qualitative analysis in Appendix B, based on which we set M = 5 for all the experiments.Besides, detailed hyper-parameters are provided in Appendix C.
Firstly, we compare the vanilla Q-learning [21], UCB-based Q-learning [34], and OCBA-based Q-learning on some MDPs with finite states and actions.Taking the MDP with state space S = {s} S s=1 and action space A = {a} A a=1 as example, the state transition probability is designed as where g(s, s, a) = min j∈Z |s + j S − a + (A + 1)/2|, and Z is the set of integers.Besides, the reward is given by R(s, a) = e −5((2s−S−1)/(S−1)) 2 + 0.1e −5((2a−A−1)/(A−1) where ς ∼ N (0, 0.1 2 ) is a Gaussian noise.We remark that UCB-based Q-learning adopts a similar ensemble-based uncertainty estimation but selects actions following p θ (a|s )), and α > 0 is a temperature coefficient.In the experiments, the sizes of decision spaces, i.e., S × A, are set as 10 × 5, 10 × 10, 30 × 10, and 50 × 50, respectively.The learned policies are tested at every fixed interval, and the performance is evaluated by cumulative rewards.The results are shown in Fig. 2, where OCBA-based Q-learning shows the best performance in all tasks.Besides, OCBA-based Qlearning has the smallest variance, which implies that the OCBA-based sampling strategy makes the training process stable.Compared with UCB-based Q-learning, OCBA-based Q-learning converges faster, which is reasonable since the situation of identifying the best action for each state in RL is similar to R&S, where typically OCBA is more efficient.This also reveals that the OCBA-based sampling strategy makes a better tradeoff between exploration and exploitation.If the decision confidence is low, the OCBA-based sampling strategy provides an effective approach to accumulate more evidence for a reliable decision.
Then, we conduct experiments on four classic-control tasks, i.e., CartPole, MountainCar, Acrobot, and Spread, to evaluate OCBA-based DQN.The first three tasks are provided by OpenAI Gym [49], and the last is provided in [50].Since these tasks have continuous states and discrete actions, we provide  [35] and vanilla DQN [3].SUNRISE incorporates ensemble-based uncertainty estimation and UCB-based sampling strategy, and vanilla DQN [3] adopts ϵ-greedy sampling strategy.The results are shown in Fig. 3, where OCBA-based DQN outperforms the baseline algorithms in all the environments.Similar to the tabular situations in Fig. 2, OCBA-based DQN converges faster and performs more stable, which implies that both the ensemble-based predictive-uncertainty estimation and OCBA-based sampling strategy can be effectively expanded to the situations of nonlinear function approximations.Besides, the OCBA-based sampling strategy performs better in complex situations, e.g., Fig. 2d and Fig. 3d, which implies its potential to handle large-scale problems.Moreover, from these experiments, it can be found that the OCBA-based algorithms significantly accelerate training in the initial phase, which further benefits later iteration due to the sequential relationship among states.As a consequence, the OCBA-based sampling strategy shows advantages in reducing sampling efforts.
Finally, we evaluate OCBA-based SAC on some continuouscontrol tasks, i.e., HalfCheetah, Walker, Hopper, and Ant, which are provided by OpenAI Gym and MuJoCo simulator.Since these tasks have continuous states and actions, we provide their dimension of states d(S) and dimension of actions d(A) in TABLE III.We take eight state-ofthe-art algorithms as baselines, including three model-based algorithms (PETS [53], POPLIN [52], and ME-TRPO [54]), two on-policy model-free algorithms (TRPO [29] and PPO [30]), two off-policy model-free algorithms (TD3 [28] and SAC [32]), and an ensemble-based algorithm (SACversion SUNRISE [35]) which incorporates ensemble-based predictive-uncertainty estimation and UCB-based sampling strategy.The results are reported in TABLE IV.In the first three environments, OCBA-based SAC obtains the best scores, especially in HalfCheetah, where it surpasses other algorithms Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.by a remarkable margin.In Ant task, OCBA-based SAC does not obtain the best score but still performs similarly to TD3 and SAC, which are state-of-the-art off-policy modelfree algorithms.In TABLE IV, compared with model-based algorithms, i.e., PETS, POPLIN, and ME-TRPO, OCBAbased SAC performs competitively from the aspect of sample efficiency.Empirically, model-based algorithms have higher sample efficiency but suffer from relatively worse asymptotic performance due to the bias of models.Therefore, OCBAbased SAC provides an effective approach that not only has competitive sample efficiency but executes in a model-free manner.Compared with the on-policy model-free algorithms, i.e., TRPO and PPO, OCBA-based SAC performs better in all the tasks, which validates the positive impact of the OCBA-based sampling strategy.In order to intuitively show the superior performance of OCBA-based SAC, we provide the learning curves in Fig. 4, where OCBA-based SAC shows remarkable gains towards the baseline algorithms in the first three tasks.In the last task, OCBA-based SAC and UCB-based SAC perform similarly, but compared with vanilla SAC, the proposed algorithm also shows a remarkable improvement in the initial training phases.These results verify the effectiveness of the OCBA-based sampling strategy in continuous control tasks.
To sum up, we implement the proposed algorithms on some benchmarks and compare them with baseline algorithms.It is shown that the OCBA-based sampling strategy significantly reduces the sampling efforts and meanwhile stabilizes training.Moreover, as an orthogonal technique, OCBA-based sampling strategy can be easily superimposed with other techniques, e.g., dueling network [24], prioritized experience replay [55], and noisy net [56], to further improve the performance.

VI. CONCLUSION
In this work, we focus on the sample collection in RL and develop an OCBA-based sampling strategy.Firstly, we estimate the action values with ensembles, with which the predictive uncertainties can be estimated.Based on this, we develop an OCBA-based sampling strategy and integrate it with three modern off-policy algorithms, i.e., Q-learning, DQN, and SAC.Then, we establish the convergence property and evaluate its performance with several experiments.It is shown that the OCBA-based sampling strategy effectively reduces the sampling efforts and surpasses other ensemble-based algorithms by a remarkable margin.
To the best knowledge of authors, this work is the first one to incorporate OCBA-based sampling-effort allocation into ensemble-based RL algorithms.In current work, we only focus on off-policy RL algorithms.As for future works, we will take on-policy RL algorithms, e.g., TRPO and PPO, into consideration.Besides, it is also interesting to incorporate the OCBA-based sampling strategy into curiosity-driven algorithms, e.g., [57], and decentralized networked systems, e.g., [48], [58], [59], [60], where the predictive-uncertainty estimation is more complicated.We hope this work will shed light on related directions.

APPENDIX A MODIFIED PROOF OF LEMMA 2
In this section, we prove that Lemma 2 holds for rewards with bounded mean and variance.Compared with the original work [37], which establishes the convergence property for bounded rewards, we consider a more general situation, where the variance term in Condition 3 of Lemma 1 cannot be easily bounded by a given constant.For the strictness of the theoretical analysis, we give a modified proof below.
Firstly, by defining and subtracting Q * (s t , a t ) from both hands of (34), we have where ② holds since that r t is independent of F t and s t+1 given (s t , a t ), and ③ holds since that for any random variable X , var(X ) = EX 2 − (EX ) 2 ≤ EX 2 .Because the rewards have bounded mean and variance, both var(r t ) and ||Q * || ∞ are bounded, which verifies condition 3 in Lemma 1. Finally, since condition 1 is naturally satisfied, Lemma 1 establishes the convergence property of Lemma 2.

APPENDIX B IMPACT OF ENSEMBLE SIZE
In this section, we give qualitative analyses about the impact of ensemble size M, which can be seen from the following two aspects.
Firstly, we show the impact of M from the perspective of statistical inference.In the algorithms, we use M estimated action values to approximate the posterior distributions, which can be related to the situation of estimating the parameters of a Gaussian distribution N (µ, σ 2 ) with M samples.From the results in [61], the expected length of confidence intervals for estimated µ and σ are c µ = 2σ N υ/2 √ M and c = σ , respectively, where N υ/2 , X 2 M−1,υ/2 , and X 2 M−1,1−υ/2 are quantiles of Gaussian and chisquare distributions.The lengths of confidence intervals with respect to M are shown in Fig. 5, where the improvement of increasing M gradually gets smaller.However, since the required computational resources increase linearly, we must select an appropriate ensemble size to make a tradeoff.
In practice, due to the noise of function approximations, e.g., neural networks, increasing the ensemble size over a threshold will not lead to a remarkable improvement.In order to intuitively show this, we take the CartPole task [49] and OCBA-based DQN algorithm as an example to show the practical impact of M. We set the ensemble size as M = 3, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.M = 5, and M = 7, respectively, and the results are shown in Fig. 6.It can be found that M = 5 has shown a satisfying performance, and increasing it from M = 5 to M = 7 only shows a slight improvement.This is also pointed out in [36], where they show that M = 5 has provided effective estimations, especially when neural networks are adopted.
Based on above observations, to make a tradeoff between the performance and required computational resources, we choose M = 5 for all the following experiments.

Fig. 1 .
Fig. 1.Illustration of OCBA.The shaded regions denote the distributions of rewards.

Fig. 2 .
Fig. 2. Testing results of vanilla Q-learning, UCB-based Q-learning (denoted as UCB Q-learning), and OCBA-based Q-learning (denoted as OCBA Q-learning).The lines and shaded regions represent the mean and standard deviation across five runs.

Fig.
Fig. Testing results of vanilla DQN, SUNRISE (DQN version), and OCBA-based DQN (denoted as OCBA DQN).The lines and shaded regions represent the mean and standard deviation across five runs.

Fig. 4 .
Fig. 4. Testing results of vanilla SAC, SUNRISE (SAC version), and OCBA-based SAC (denoted as OCBA SAC).The lines and shaded regions represent the mean and standard deviation across three runs.

Fig. 5 .
Fig. 5. Length of confidence intervals with respect to M.
, step size for target networks κ, replay buffer D = ∅, step counter t = 0, and total sample budget T ; observe the initial state s 0 .repeat Collect actions A t = {a i t ∼ π φ i (•|s t )} M i=1 .Sample an action a t ∼ pθ (•|s t ) from A t .Execute a t and observe s t+1 , r t .Store e t = (s t , a t , r t , s t+1 ) in D.
dimension of states d(S) and number of feasible actions |A| in TABLE II for a comparison.We set two baseline algorithms, SUNRISE (DQN version) their