Minimax Optimal Bandits for Heavy Tail Rewards

Stochastic multiarmed bandits (stochastic MABs) are a problem of sequential decision-making with noisy rewards, where an agent sequentially chooses actions under unknown reward distributions to minimize cumulative regret. The majority of prior works on stochastic MABs assume that the reward distribution of each action has bounded supports or follows light-tailed distribution, i.e., sub-Gaussian distribution. However, in a variety of decision-making problems, the reward distributions follow a heavy-tailed distribution. In this regard, we consider stochastic MABs with heavy-tailed rewards, whose <inline-formula> <tex-math notation="LaTeX">$p$ </tex-math></inline-formula>th moment is bounded by a constant <inline-formula> <tex-math notation="LaTeX">$\nu _{p}$ </tex-math></inline-formula> for <inline-formula> <tex-math notation="LaTeX">$1 < p\leq 2$ </tex-math></inline-formula>. First, we provide theoretical analysis on sub-optimality of the existing exploration methods for heavy-tailed rewards where it has been proven that existing exploration methods do not guarantee a minimax optimal regret bound. Second, to achieve the minimax optimality under heavy-tailed rewards, we propose a minimax optimal robust upper confidence bound (MR-UCB) by providing tight confidence bound of a <inline-formula> <tex-math notation="LaTeX">$p$ </tex-math></inline-formula>-robust estimator. Furthermore, we also propose a minimax optimal robust adaptively perturbed exploration (MR-APE) which is a randomized version of MR-UCB. In particular, unlike the existing robust exploration methods, both proposed methods have no dependence on <inline-formula> <tex-math notation="LaTeX">$\nu _{p}$ </tex-math></inline-formula>. Third, we provide the gap-dependent and independent regret bounds of proposed methods and prove that both methods guarantee the minimax optimal regret bound for a heavy-tailed stochastic MAB problem. The proposed methods are the first algorithm that theoretically guarantees the minimax optimality under heavy-tailed reward settings to the best of our knowledge. Finally, we demonstrate the superiority of the proposed methods in simulation with Pareto and Fréchet noises with respect to regrets.


I. INTRODUCTION
A STOCHASTIC multiarmed bandit (stochastic MAB) is a fundamental decision-making problem under uncertain environment.In this problem, an intelligent agent selects an action among a set of K actions and receives a noisy reward corresponding to the selected action.Then, the goal of the agent is to find an optimal action, whose expected reward is the maximum, over total rounds T .However, due to the noise in rewards, the agent needs estimation of true expected rewards.
Hence, the agent should explore entire set of actions including suboptimal one to obtain accurate estimations; however, selecting suboptimal actions for exploration will make the agent lose a large amount of rewards compared to an optimal one.In this regard, the agent faces a natural dilemma between exploration and exploitation: collecting more information to estimate rewards accurately (exploration) and selecting the best action based on experiences (exploitation).
The exploration methods should carefully balance this dilemma to efficiently find an optimal action.Specifically, efficiency of exploration methods can be measured by a cumulative regret which is defined as an expected cumulative difference between the maximum rewards and the expected reward of selected actions.Hence, the smaller the regret, the more efficient the algorithm.Most exploration methods have conducted regret analysis to guarantee their efficiency.Especially, the majority of researches have assumed that the noise of rewards follows sub-Gaussian distribution whose tail probability is dominated by the tail of Gaussian distribution.Under the sub-Gaussian assumption, it is well known that, for any algorithm, the gap-independent cumulative regret cannot be lower than ( √ K T ) [1].Several approaches have been proposed to achieve the gap-independent lower bound ( √ K T ), which is called a minimax optimal [2], [3], [4], [5], [6].
While many methods have been studied under sub-Gaussian noise, there still needs developing a robust exploration method to address real-world problems which are not covered by sub-Gaussian assumptions.However, few researches have investigated the stochastic MAB problem under heavy-tailed noise whose pth moment is bounded by a constant ν p .In general, heavy-tailed noise covers wider range of noise distributions than sub-Gaussian noise.Bubeck et al. [7] have first addressed the heavy-tailed noise in a bandit problem by proposing a robust upper confidence bound (robust UCB) whose gap-independent regret is O((K ln(T ) 1−1/ p T 1/ p ).Furthermore, Bubeck et al. [7] have shown that, for any algorithm, the worst case cumulative regret cannot be lower than (K 1−1/ p T 1/ p ) under heavy-tailed noise assumptions.To achieve the lower bound, Lee et al. [8] have proposed perturbation-based explorations called APE 2 which has achieved O(K 1−1/ p T 1/ p ln(K )) regret bound that is optimal only with respect to T but is still suboptimal with respect to K as factor ln(K ).Furthermore, Wei and Srivastava [9] have proposed minimax optimal strategy by modifying the upper confidence bound of the truncated mean estimator, but the truncated mean estimator requires the prior knowledge about ν p , which is not desirable for the bandit setting that assumes no prior knowledge about reward distributions.In this regard, we develop the minimax optimal exploration for heavy-tailed rewards without using the problem-dependent knowledge, ν p .
In this article, we propose a true minimax optimal exploration method which can guarantee (K 1−1/ p T 1/ p ) regret bound without using the prior information of ν p .To remove the dependency of ν p , we employ a robust estimator which is proposed in our prior work [8].In [8], we have proposed the robust estimator, called a p-robust estimator, whose error probability decays exponentially fast while it does not depend on ν p .More specifically, the error probability follows O(exp(−n 1−1/ p )) where n is the number of sample and is an error bound we consider.Note that the proposed robust mean estimator has worse decaying rate than other existing robust mean estimators, which have O(exp(−n p/( p−1) )), but it does not require prior information about ν p while other estimators essentially need ν p to guarantee the decaying rate O(exp(−n p/( p−1) )).Since the p-robust estimator has worse decaying rate, naíve UCB style exploration with the p-robust estimator shows suboptimal regret bound while it can remove dependency on ν p .Hence, to reduce the regret bound, we modify the confidence bound of p-robust estimator by borrowing a technique in MOSS [10].Furthermore, we also extend modified upper confidence bound to the perturbation-based exploration and we derive the condition of perturbation for the minimax optimality.From the theoretical results in Lee et al. [8], we first prove that the unbounded perturbation, whose supporting set is unbounded, cannot achieve the minimax optimality since ln(K ) factor cannot be removed.However, we also prove that we should employ a bounded perturbation to reduce the sup-optimal factor ln(K ).We finally propose a randomized version of robust UCB for the minimax optimality by combining the bounded perturbation method and modified confidence bound.We believe that the proposed methods can be extended into further structured bandit problems such as [11], [12], [13], [14], [15], [16], [17], and [18].Our contribution can be summarized as follows.
1) We analyze that robust UCB and perturbation-based exploration cannot achieve the minimax optimality.Especially, unbounded perturbation cannot remove the suboptimal factor ln(K ).2) For the minimax optimality, we propose a modified upper confidence bound and prove that its gap-independent regret bound can matches to the lower bound (K 1−1/ p T 1/ p ).Hence, the modified upper confidence bound method is minimax optimal.3) We also propose a bounded perturbation method by combining with the modified upper confidence bound and prove that its gap-independent bound also matches to the lower bound (K 1−1/ p T 1/ p ).Thus, the proposed bounded perturbation method is minimax optimal.4) For both modified upper confidence bound and bounded perturbation method, we employ the p-robust estimator in [8] that does not require ν p as a prior knowledge.In this regard, the proposed exploration methods have no dependency on ν p while, interestingly, they can achieve the minimax optimality.5) We also verify the proposed methods show superior performance compared to other robust exploration methods for heavy-tailed noise.

II. BACKGROUND
Consider a set of K actions, A := {a 1 , . . ., a K }, and corresponding mean rewards {r a 1 , . . ., r a K }.At time t = 1, 2, . . ., T , an exploration algorithm chooses an action a t and receives a noisy reward for the selected action R t,a t := r a t + t,a t (1) where t,a t is an identical and independently distributed zero mean random noise for each time step and each action.
In multiarmed bandits (MABs), r a k is generally assumed to be unknown.Then, the goal of the exploration method is to efficiently verify an optimal action a := arg max a r a .The performance of the exploration strategy is often measured by the cumulative regret over total round T , defined as follows: where r := max a∈A r a and n a (t) is the number of times selecting a over t rounds, i.e., n a (t) Hence, the smaller R T , the better exploration performance.

A. Minimax Optimality Under Sub-Gaussian Noise
In stochastic MABs, many researches usually assume that each t,a follows a σ a -sub-Gaussian distribution with zero mean, that is, the following inequality holds for all s ∈ R and a ∈ A: Under the sub-Gaussian assumption, it is well known that the gap-dependent lower bound is ( a k =a ln(T )/ a k ) and the gap-independent lower bound is ( √ K T ), respectively, where indicates a lower bound [2], [19], [20].There exist several minimax optimal methods which guarantee matching the lower bound to solve the stochastic MABs under sub-Gaussian noise.In this article, we introduce two well-known algorithms using confidence bounds under sub-Gaussian assumptions, which is highly related to the proposed method.Auer et al. [19] have proposed upper confidence bound (UCB) using the confidence bound of sample mean estimators, i.e., (2 ln(T )/n a (t)) 1/2 .Audibert and Bubeck [2] have analyzed that the minimax regret bound of UCB is O((K T ln(T )) 1/2 ) that is suboptimal.Hence, Audibert and Bubeck [2] have extended UCB to minim-max optimal strategy in stochastic MAB (MOSS) by modifying the confidence bound as (ln + (T /(K n a (t)))/n a (t)) 1/2 where ln + (x) = max(ln(x), 0).From this modification, MOSS achieved the minimax optimal regret bound ( √ K T ).

B. Minimax Optimality Under Heavy-Tailed Noise
While sub-Gaussian assumption has been well analyzed, only few methods have extended an assumption on noise to heavy-tailed noise whose pth moment is bounded, i.e., where ν p is a constant and p ∈ (1, 2] is the maximum number of the bounded moments.For heavy-tailed Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.) and the gap-independent lower bound is (K 1−1/ p T 1/ p ) [7].However, most algorithms suffer from the sub-optimality in terms of the gap-independent regret bounds.Bubeck et al. [7] have first proposed the robust UCB using the confidence bounds of general robust estimators.Bubeck et al. [7] have analyzed the regret bound of the robust UCB where the gap-dependent bound is O( a =a ln(T )/ 1/( p−1) a + a ) and gap-independent bound is O((K ln(T )) 1−1/ p T 1/ p ), respectively.However, the robust UCB requires ν p in prior to define a confidence bound of the robust estimator.Then, this condition restricts the viability of robust UCB since ν p is generally not accessible in bandit settings.Furthermore, the upper regret bound of robust UCB has the suboptimal factor of ln(T ) 1−1/ p .More precisely, Lee et al. [8] have proven that the lower bound of robust UCB is also ((K ln(T )) 1−1/ p T 1/ p ), hence, it is a tight bound.In other words, unfortunately, we cannot remove the suboptimal factor, ln(T ) 1−1/ p .A similar restriction also appears in [21].Vakili et al. [21] have proposed a deterministic sequencing of exploration and exploitation (DSEE) by exploring every action with a deterministic sequence.It is shown that DSEE has the gap-dependent bound O(ln(T )), but, its result holds when ν p and the minimum gap min a∈A/a a are known as prior information.Furthermore, in practice, DSEE often shows poor performance since the deterministic sequence cannot perform adaptive exploration.While other existing robust exploration methods have not guaranteed the minimax optimality, Wei and Srivastava [9] have recently proposed robust version of MOSS which can guarantee (K 1−1/ p T 1/ p ); however, the robust MOSS has a limitation in that ν p is an essential prior information to achieve the minimax optimality.Agrawal et al. [22] also have proposed KL inf -UCB by adding two variants to the original UCB algorithm and proved that the problem-dependent regret bound of KL inf -UCB is O(log(T ) 2/3 ); however, it also requires ν p as a prior knowledge to achieve the proposed regret bound.
The dependence on ν p is a crucial issue in a bandit problem since ν p is problem-dependent prior information.Cesa-Bianchi et al. [23] have first removed in [23] only for p = 2 by developing a robust estimator using the influence function in the Catoni's M estimator [24].For exploration, the Boltzmann-Gumbel exploration (BGE) has been proposed.We observe one interesting fact that the robust estimator proposed in [23] has a weak tail bound, whose error probability decays slower than that of the original Catoni's M estimator [24].However, BGE achieved gap-dependent bound O( a =a ln(T 2 a ) 2 / a + a ) and gap-independent bound O( √ K T ln(K )) for p = 2.While ln(K ) factor remains, BGE has a better bound than robust UCB in terms of T .Lee et al. [8] have extended Cesa's estimator to a p-robust estimator for p ∈ (1, 2] and have applied perturbation-based exploration inspired by BGE, which is named an adaptively perturbed exploration with a p-robust estimator (APE 2 ).By combining p-robust estimator and perturbation methods, Lee et al. [8] showed that APE 2 can achieve the regret bound of O(K 1−1/ p T 1/ p ln(K )) which is partially optimal with respect to T but suboptimal with respect to K as the factor of ln(K ).
In this article, we apply the idea of MOSS to our p-robust estimator in [8] where upper confidence bound of the p-robust estimator is modified to be tighter than the original UCB.By combining MOSS and p-robust estimator, we can enjoy both benefits of MOSS, i.e., the minimax optimality, and the p-robust estimator, i.e., independence on ν p .Then, we also propose a randomized version of robust UCB by extending the modification of robust UCB to the perturbation-based exploration method.A comparison of existing robust exploration methods including ours can be shown in Table I.Table I shows a gap-dependent and gap-independent regret bounds and essential prior information.

III. SUB-OPTIMALITY OF EXISTING METHODS
In this section, we discuss pessimistic results about existing methods.First, we restate the sub-optimality of the robust UCBs of Bubeck et al. [7].Second, we newly prove the sub-optimality of the unbounded perturbation methods in Lee et al. [8].The perturbation-based exploration employs a random perturbation to encourage exploration.Hence, its cumulative regret is closely related to the distribution of random perturbation and Lee et al. [8] have revealed the relationship between distribution of perturbation and cumulative regret bounds.Unfortunately, from the results of Lee et al. [8], we prove that the perturbation-based exploration is minimax suboptimal if the random perturbation is unbounded.

A. Sub-Optimality of Robust UCBs
The robust UCB employs a class of robust estimators which satisfies the following assumption.
Assumption 1 (in [7]): Let {R k } ∞ k=1 be i.i.d.random variables with the finite pth moment for p ∈ Assume that, for all δ ∈ (0, 1) and n number of observations, there exists an estimator rn (η, ν p , δ) with a parameter η such that and There exist several robust estimators that satisfy Assumption 1, such as truncated mean, median of mean, and Catoni's M estimator [24].This assumption naturally provides the confidence bound of the estimator rn , hence, we can easily employ UCB-based exploration with the robust estimators in Assumption 1.However, we would like to note that the estimator in Assumption 1 essentially requires ν p as prior information to define the estimator, which is not available under bandit setting.
Using the confidence bound in Assumption 1, we can derive a robust UCB strategy.For every step, robust UCB chooses an action based on the following strategy: where rt−1,a is an estimator which satisfies Assumption 1 with δ := t −2 .In our previous work, we have shown that there exists a MAB problem that makes the strategy (7) have the following lower bound of R T .Theorem 1 (in [8]): There exists a K -armed stochastic bandit problem for which the regret of robust UCB has the following lower bound, for T > max(10, The proof can be found in [8].Theorem 1 clearly shows that the lower regret bound of the robust UCB is ((K ln(T )) 1−1/ p T 1/ p ).The theorem tells that there always exists a MAB problem that causes the suboptimal regret bound for the robust UCB.Hence, the robust UCB cannot remove the suboptimal factor ln(T ) 1−1/ p from the gap-independent regret bound.Consequently, the robust UCB has two main drawbacks for a stochastic MAB.First, theorem 1 tells us the pessimistic fact that the sub-optimality of the robust UCB is caused by a fundamental issue of exploration strategy, rather than, by the lack of mathematical techniques such as employing a loose upper bound.Secondly, the estimators employed in the robust UCB usually require ν p as a prior knowledge.

B. Sub-Optimality of Adaptively Perturbed Exploration With Unbounded Perturbation
Lee et al. [8] have proposed an APE 2 that can guarantee the minimax optimality with respect to T while removing dependency on ν p .However, it still has a limitation in that its gap-independent regret bound is suboptimal with respect to K .Especially, we prove that unbounded perturbation cannot guarantee the minimax optimality in heavy-tailed MAB problems.
In APE 2 , Lee et al. [8] have extended Catoni's M estimator by generalizing Catoni's influence function where a new influence function ψ p (x) is defined as where sgn(x) is a sign of x, I[•] is an indicator function, and Using ψ p (x), Lee et al. [8] define a p-robust estimator and derive its confidence bounds as follows.
Theorem 2 (in [8]): Let {Y k } ∞ k=1 be i.i.d.random variables sampled from a heavy-tailed distribution with a finite pth moment, and define an estimator as where c > 0 is an arbitrary constant.Then, for all δ > 0 and The entire proof can be found in [8].Compared to Assumption 1, a p-robust estimator has clear benefits in that a p-robust estimator does not depends on ν p while robust estimators defined in Assumption 1 require ν p as prior knowledge to guarantee the confidence bounds.This property of a p-robust estimator makes APE 2 independent on ν p .However, a p-robust estimator has a drawback since the confidence bound of ( 10) is looser than Assumption 1 for a fixed δ.
By combining the estimator in (10) with a perturbation method, APE 2 selects an action based on the following decision rule: where is the number of times a has been selected, G t,a is sampled from F, and F(g) := P(G < g).
The lower bound of APE 2 is derived by constructing a counterexample as follows.
Theorem 3 (in [8]): Let F(g) be a log-concave CDF.For 1) , there exists a K -armed stochastic bandit problem where the regret of APE 2 is lower bounded by The proof is done by constructing the worst case bandit problem whose rewards are deterministic.When the rewards are deterministic, no exploration is required, but, APE 2 unnecessarily explores suboptimal actions due to the perturbation.In other words, the lower bound captures the regret of APE 2 caused by useless exploration.The lower bound tells us that tail behavior of perturbation plays a crucial role in determining the effect of K on the regret bound.From the lower bound, we can derive a novel pessimistic result on APE 2 that employs unbounded perturbation for exploration.
Corollary 1: If the support of F(g) is bounded, then, the lower bound of R T of APE 2 becomes (K 1−1/ p T 1/ p ). Corollary 1 is induced by Theorem 3. Due to the term F −1 (1− 1/K ), if G has an unbounded support, then, F −1 (1 − 1/K ) will grow as K increases and, thus, the lower bound of APE 2 has a suboptimal dependency on K .In other words, if G is unbounded, then, the lower bound of APE 2 cannot match (K 1−1/ p T 1/ p ). From this observation, we conclude that bounded perturbation is needed to obtain the minimax optimality.Furthermore, from the observation of the sub-optimality of the robust UCB, we argue that the confidence bound of robust estimator in [7] is too loose to capture the error tightly and, thus, causes unnecessary exploration.To handle this issue, we modify the confidence bound of a p-robust estimator much tighter and extend the modified confidence bound to the perturbation method.

IV. MINIMAX OPTIMAL STRATEGY FOR HEAVY-TAILED REWARDS
We propose two novel exploration methods to guarantee the minimax optimality under heavy-tailed noise.The first one is a minimax optimal robust upper confidence bound (MR-UCB), whose confidence bound is modified to a much tighter one, and the second one is a minimax optimal robust adaptively perturbed exploration (MR-APE), which is a randomized version of robust UCB using a bounded perturbation.The main benefit of MR-UCB and MR-APE is not only minimax optimality but also the minimal requirement of prior knowledge.

A. Minimax Optimal Robust UCB
In general, the regret bound of UCB often depends on the convergence rate of estimators.Especially, a robust estimator should satisfy two key properties to achieve efficient exploration performance.The first one is that the error probability decays exponentially fast and the second one is tight confidence bound for exploration.The main idea to design a minimax optimal exploration without dependency on ν p is employing a p-robust estimator with tight confidence bound.The p-robust estimator satisfies exponential decaying from Theorem 2. However, if we employ the naïve confidence bound in (11) and (12), then, its minimax regret bound is suboptimal with respect to T .Hence, we propose more tight confidence bound than the naïve confidence bound.In MR-UCB, the selection rule is defined as where ln + (x) := max(ln(x), 1).Similar to MOSS [2], we simply modify confidence bound from the naïve confidence bound, O(ln(T )), to tighter one, O(ln(T /n a (t − 1))), that becomes tighter than O(ln(T )) as the number of selecting a increases.Then, we derive the gap-dependent and gapindependent regret bounds as follows.

B. Minimax Optimal Robust Adaptively Perturbed Exploration
MR-APE is a randomized algorithm of MR-UCB.MR-APE replaces the optimism in MR-UCB with simple randomization.Instead of directly employing the confidence bound of the p-robust estimator, MR-APE is to employ a value randomly chosen between lower and upper confidence intervals using bounded perturbation within [−1, 1].Then, the selection rule of MR-APE is defined as where G t,a is a bounded random perturbation within [−1, 1] and is an auxiliary hyperparameter.If the sampled perturbation is negative, the perturbation term can be interpreted as the lower confidence bound.Otherwise, the perturbation term is similar to the upper confidence bound.Hence, MR-APE employs both lower and upper confidence bounds for decisionmaking.Furthermore, if we set G t,a = 1 and = 0 almost surely, then, MR-APE is equivalent to MR-UCB.The entire algorithm is summarized in Algorithm 1.

C. Theoretical Analysis
We provide gap-dependent and gap-independent upper bounds of the cumulative regret of MR-UCB and MR-APE.First, we derive the gap-dependent regret bounds and then, extend gap-dependent bounds to the gap-independent bounds.The main idea of our proof is decomposing the event of selecting suboptimal actions into three events.Before decomposition, we assume that r a 1 > r a 2 > r a 3 > • • • > r a K without loss of generality.Then, let us define Z := min 1<t≤T rt−1,a + β t−1,a and z a := r a − a /6.Then, using Z and z a , we define the event Ēa := {z a ≤ Z }.Based on Ēa , we decompose the expected regret into three terms as follows, for any where E t,a := {a t = a} indicates the event of selecting a at time t.By computing the bound of each term, we can derive the gap-dependent and gap-independent upper bounds.We would like to note that this decomposition technique follows the proof of MOSS [2] and it generally holds without any special assumption on reward distributions.However, in [2], the remaining part for proving the minimax optimality of MOSS heavily depends on the sub-Gaussian assumption.
In particular, to prove the minimax optimality, the second term and update n a t (t) ← n a t (t − 1) + 1 6: end for is bounded using Hoeffding's maximal inequality that cannot be employed under unbounded heavy-tailed noise.Hence, to bound the second term we employ the integration bound that provides the upper bound of the summation.Consequently, we achieve the minimax optimal regret bound without using Hoeffding's maximal inequality.
1) Gap-Dependent Regret Bound of MR-UCB: Now, we provide the gap-dependent bound of each term for MR-UCB.The upper bound of the second term can be obtained as the following lemmas.
Lemma 1: For the second term of ( 18), MR-UCB satisfies the following inequality: The entire proof of this lemma can be found in Appendix B. The main idea of to compute the upper bound of P( Ēc a j ) is to employ the integration bound where there exist a upper bound f (s) such that P( Ēc a j ) = P(Z < z a j ) ≤ T s=1 f (s) holds and the summation of f (s) can be bounded by the integral of f (s).The trick that bounds the summation by the integration will be generally used throughout the proof of our lemmas.
The final term of ( 19) can be bounded by the following lemma.
Lemma 2: For the final term of (19), MR-UCB satisfies the following inequality: The entire proof of this lemma can be found in Appendix B.
The main idea of the proof is counting the number of rounds for making confidence bound β t−1,a small enough.For small β t−1,a , the final term can be bounded by the summation of the probability of estimation error.By combining two lemmas and setting k 0 = 1 whose 1 = 0 by definition, then, we can obtain the gap-dependent regret bound.Theorem 4: Assume that ν p < ∞ and rt,k is a p-robust estimator.Then, the gap-dependent regret bound of MR-UCB is The proof is simply done by combining two lemmas and pick k 0 = 1 that makes T a k 0 = 0 since r a ) and this fact results in that robust UCB can have a smaller regret bound since ln(T ) < ln(T ) p/( p−1) .On the other hand, if a is sufficiently small, then, ln( p/( p−1) a ) becomes a negative value for a 1 and, hence, it can dominantly reduce the term ln( p/( p−1) a T ) p/( p −1) .In this regard, MR-UCB can have a smaller regret than robust UCB.From this fact, we can observe that MR-UCB is superior to robust UCB for a challenging MAB problem that has small gaps, which requires large samples to distinguish optimal action from suboptimal actions.This property makes it available that MR-UCB guarantee the optimal minimax regret bound.
2) Gap-Independent Regret Bound of MR-UCB: The gapindependent regret bounds can be derived from the similar strategies of gap-dependent bound.Now, we compute the gap-independent bounds for each term in (18) and (19).
Lemma 3: For the second term of ( 18), MR-UCB satisfies the following gap-independent inequality: The proof can be found in Appendix C. The main strategy of the proof is to bound the summation, K j =k 0 +1 P( Ēc a j )( a j − a j −1 ), by the integration, − a k 0 + 1 P(Z < r a − (u/6))du, which is borrowed from [2].Then, the probability P(Z < r a − (u/6)) can be bounded using the same technique in Lemma 1.
The third term of ( 19) can be bounded as follows.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Lemma 4: For the final term of (19), MR-UCB satisfies the following gap-independent inequality: The proof can be found in Appendix C. The proof starts from Lemma 2. We pick k 0 such that a k 0 < < a k 0 +1 where = max(e p , e −(3( p−1)/2 p) )(K /T ) 1−1/ p .Then, the gap-dependent bound in Lemma 2 is a decreasing function for a > .Hence, we can replace a of Lemma 2 with to get an upper bound.By combining two lemmas, we can obtain gap-independent bound of MR-UCB as following theorems.
Theorem 5: Assume that ν p < ∞ and rt,k is a p-robust estimator.Then, the gap-independent regret bound is From Lemmas 3 and 4, we can bound the second term of ( 18) and third term of (19) with O(K 1−1/ p T 1/ p ).Then, the remaining part of the proof is to check the first term in (18), T k 0 , is bounded by O(K 1−1/ p T 1/ p ). Fortunately, since we pick k 0 such that a k 0 < < a k 0 +1 holds, we have ).Consequently, we can guarantee that the gap-independent regret bound of MR-UCB is O(K 1−1/ p T 1/ p ) that matches the global minimax optimal regret bound for heavy-tailed MAB problems.
3) Gap-Dependent Regret Bound of MR-APE: Now, we will derive the gap-dependent regret bound of MR-APE.We can derive the regret bound of MR-APE by only proving the third term of (19) since other two terms in (18) can be bounded using the same way of MR-UCB.For the third term of (19), we first introduce x a := r a + a /3 and y a := r a − a /3.Then, let us define three events, Êt,a := {r t,a ≤ x a }, Ẽt,a := {r t−1,a + (1 + )β t−1,a G t,a ≤ y a }, and Ēt,a := {z a ≤ rt−1,a + β t−1,a }.From the definition of three events, we have E t,a ∩ Ēa ⊂ E t,a ∩ Ēt,a since z a ≤ min 1<t≤T rt−1,a + β t−1,a implies z a ≤ rt−1,a + β t−1,a .Then, we decompose E t,a ∩ Ēt,a into three subsets E t,a ∩ Ēt,a = E (1)  t,a ∪ E (2)  t,a ∪ E (3)   t,a (27) where , and E (3)  t,a = E t,a ∩ Ēt,a ∩ Êt,a ∩ Ẽt,a .Hence, the final term of ( 19) can be bounded using the following inequality: P E t,a ∩ Ēa ≤ P E (1)  t,a + P E (2)  t,a + P E (3)  t,a .
Each term has the following meanings.
1) The first event, E (1)  t,a , mainly counts the number of times that the suboptimal action a is selected due to the estimation error of rt−1,a .Hence, this term will be bounded by the error probability of the p-robust estimator.
2) The second event, E (2)  t,a , considers the case of choosing suboptimal action due to the large perturbation, G t,a , while its reward estimation is well concentrated.This term can be controlled by coefficient β t−1,a since this event depends on the magnitude of sampled perturbation.
3) The final event, E (3)  t,a , indicates that suboptimal action was selected even though rt−1,a is well estimated and the perturbation, G t,a is not too large.This event can happen when the estimation of the optimal reward is incorrect and the perturbation of the optimal action, G t,a , is not large enough to overcome the under-estimation.The basic idea of deriving bounds of E (1)  t,a , E (2)  t,a , and E (3) t,a is followed by Kim and Tewari [5], Lee et al. [8], and Cesa-Bianchi et al. [23].We apply techniques in [5], [8], and [23] to our modified confidence bound.Now, we provide the gap-dependent bounds for three terms.
Lemma 5: The probabilities of E (1)  t,a , E (2)  t,a , and E (3)  t,a can be bounded as follows: The entire proofs of the lemma can be found in Appendix D. By combining all lemmas, we can bound the third term of (19).Consequently, we have the following gap-dependent regret bound of MR-APE.
Theorem 6: Assume that the pth moment of rewards is bounded by a constant ν p < ∞, rt,k is a p-robust estimator and G is a bounded perturbation within [−1, 1], and there exists a constant M only dependent on such that P(G < (1/(1 + )))/P(G > (1/(1 + ))) < M .Then, the gap-dependent regret bound of MR-APE is where M + := max(M , 1).The proof is simply done by combining two lemmas with the proof of MR-UCB, and picking k 0 = 1 that makes T a k 0 = 0 since r a 1 = r .We can observe that the gap-dependent bound of MR-APE is the same as that of MR-UCB up to a constant.
4) Gap-Independent Regret Bound of MR-APE: Now, we can derive the gap-independent regret bound of MR-APE using the same technique of MR-UCB.
Theorem 7: Assume that the pth moment of rewards is bounded by a constant ν p < ∞, rt,k is a p-robust estimator and G is a bounded perturbation within [−1, 1], and there exists a constant M only dependent on such that P(G < (1/(1 + Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. )))/P(G > (1/(1 + ))) < M .Then, the gap-independent regret bound of MR-APE is The proof is omitted here and can be found in Appendix E. Similar to the gap-dependent bound, the gap-independent bound of MR-APE also has the same order of T and K as that of MR-UCB.Consequently, MR-APE also guarantee the minimax optimal regret bound.

5) Comparison Between MR-UCB and MR-APE:
While MR-APE and MR-UCB have the same mini-max optimal regret bound, the main difference between MR-APE and MR-UCB comes from the gap-dependent regret bounds in Theorems 4 and 6.In Theorem 4, the logarithmic term in the gap-dependent bound of MR-UCB is independent on c and only the final term O(K c p/( p−1) exp(b p ν p /c p )/ 1/( p−1) a k ) depends on c.In this regard, controlling c does not affect the order of T .However, in Theorem 6, MR-APE has an auxiliary controllable parameters M and that can affect to the logarithmic term of the gap-dependent bound of MR-APE.Furthermore, we can interpret MR-APE as the unifying framework between UCB-like exploration and perturbed exploration.Intuitively speaking, from the condition of P(G < 1/(1 + ))/P(G > 1/(1 + )) < M , most of probability mass of the perturbation is located near one, hence, MR-APE has randomness but works similar to MR-UCB.More specifically, the condition on G and M can be rewritten as is getting smaller, the constant M should become larger to satisfy the condition.In this case, MR-APE mainly employs the perturbation for exploration, rather than depends on the confidence bound.On the other hand, if P(G > 1/(1 +)) is getting bigger, most of probability mass should be located near G = 1 to make M small enough.Under this condition, MR-APE behaves similar to MR-UCB.Such property of MR-APE allows it to enable more adaptive exploration in practice, since we can control not only c but also M and using the distribution of perturbation.

A. Experimental Setup
We verify the properties of the proposed methods and compare the proposed methods to other existing methods.First, we compare two proposed methods, MR-UCB and MR-APE.Especially, we prepare various types of MR-APE that have different bounded perturbations.We separate bounded perturbations into two groups.The first group is a positive bounded perturbation whose random variable only has a positive value.The second group is a both-sided bounded perturbation whose random variable can have both positive and negative values.For the first group, we employ Bernoulli distribution and Uniform distribution with [0, 1] as a bounded perturbation in MR-APE.For the second group, we employ Rademacher distribution whose value can have −1 or 1, and Uniform distribution with [−1, 1].Hence, the proposed exploration scheme is tested in five different algorithms: MR-UCB, MR-APE with Bernoulli, MR-APE with Uniform(0, 1), MR-APE with Rademacher, and MR-APE with Uniform[−1, 1].We compare the proposed methods with existing robust exploration methods such as robust UCB [7], DSEE [21], and APE 2 with unbounded perturbation [8].For APE 2 , we utilize Fréchet and Pareto distributions as an unbounded perturbation.Hence, the comparisons are conducted with APE 2 with Fréchet and APE 2 with Pareto.Note that robust UCB and DSEE utilize the truncated mean estimator, and APE 2 , MR-UCB, and MR-APE mainly utilize the p-robust estimator.
We prepare synthetic and real-world data for simulations.First, for all synthetic simulations, we synthesize a heavy-tailed MAB problem with K actions.The optimal action has 1 mean reward and K − 1 suboptimal actions have 1 − mean reward.Hence, determines the gap between the maximum reward and other rewards.By controlling , we can measure how the gap influence the regret of each exploration method.Then, we add a heavy-tailed noise to the observation of rewards.The heavy-tailed noise is created by transforming a Pareto and Fréchet random variable.Let z t be a heavy-tailed random variable, z t ∼ Pareto(α , λ ) where α is a shape parameter and λ is a scale parameter.Then, a noise is defined as t := b t (z t − E[z t ]) where b t is a Rademacher random variable that has +1 value with probability 1/2 and get −1 value with probability 1/2.From the definition, t is a mean zero heavy-tailed noise.In simulation, we observe a noisy reward R t,a := r a + t,a for every step.Each simulation runs T rounds and we measure the time average regret R t /t := t k=1 (r a − r a k )/t for t ∈ [1, T ].Second, for real-world data, we employ cryptocurrency dataset [25] that contains daily returns of cryptocurrency from April 1, 2019 to July 1, 2021.We select ten cryptocurrency, such as Bitcoin, Ethereum, Doge, Monero, Stellar, or EOS, based on market value.Then, the goal of this simulation is to identify the most profitable currency, which is motivated by the practical scenario that an investor wants to invest a fixed budget in a cryptocurrency and get return as much as possible.For this scenario, an action is defined as buying a specific currency and the corresponding reward is defined as the daily profit.Note that it is a well-known fact that the financial data often show the inherent characteristic of heavy tails [26], [27], hence, we believe that identifying the most profitable cryptocurrency is a practical application of the proposed methods.
Consequently, we prepare four simulations.The first simulation compares the performance of exploration methods on various p and with two heavy-tailed noises.The second simulation verifies the effect of increasing K for the regret bound.The third simulation measures the effect of scale hyperparameter for the performance of exploration methods.The final simulation compares the performance of exploration methods on real-world cryptocurrency dataset.

B. Performance Comparison for Various Noises, p, and
We compare the performance of every exploration method.For MR-UCB, robust UCB, MR-APE, and APE 2 , we optimize the hyperparameter c using a grid search.We would like to note that, for robust UCB, we modify the confidence bound in Assumption 1 by multiplying a scale parameter c since  the original robust UCB consistently shows poor performance even if ν p is given.Then, c for robust UCB is also optimized using a grid search.We prepare six synthetic MAB problems by combining = 0.3, 0.7 and p = 1.2, 1.5, 1.8 for two noise types.Figs. 1 and 2 show the results of Pareto noise and Fréchet noise, respectively.
As shown in Fig. 1, first, MR-UCB consistently outperforms other exploration methods except for the case of ( p = 1.2, = 0.7).In the MAB with ( p = 1.2, = 0.7), MR-UCB shows comparable performance with robust UCB and MR-APE with Bernoulli.For = 0.3, as shown in Fig. 1(a), (c), and (e), we can observe that MR-UCB significantly outperform other methods while the performance gap between MR-UCB and other methods is marginal when = 0.7.Furthermore, the performance gap between MR-UCB and other methods increases as the order of the moment, p, decreases.This observation implies that MR-UCB shows more robust performance against heavy-tailed noise.As p is getting closer to 2, a robust estimator generally converges much faster than the case that p is close to 1, hence, reward estimators used by all exploration methods are concentrated with a fewer number of trials.This fact reduces the performance gap between MR-UCB and other methods since the algorithm can distinguish the optimal action from suboptimal actions with fewer trials.However, as p is close to 1, the convergence speed of reward estimators is getting slower and requires a lot of samples to concentrate on the true mean.It hinders the convergence speed of exploration methods, however, MR-UCB outperforms other exploration methods as shown in Fig. 1 As shown in Fig. 1, for MR-APE, we can observe that bounded perturbation usually shows the second-best performance for various settings under Pareto noises.In particular, MR-APE with positive perturbations generally outperforms MR-APE with both-sided perturbations.Furthermore, MR-APE with Bernoulli often shows the second best performance for various settings such as ( p = 1.2, = 0.3), ( p = 1.2, = 0.7), ( p = 1.5, = 0.7), and ( p = 1.8, = 0.3).However, we can observe that there is no clear dominance between MR-APE, APE 2 , and robust UCB.From the tendency shown in Fig. 1, we can observe that the performance gap between MR-APE and other exploration methods, such as APE 2 s, Robust UCB, and DSEE, increases as decreases from 0.7 to 0.3.
For Fréchet noise setting, MR-UCB also outperforms APE 2 with unbounded perturbations and Robust UCB as shown in

C. Performance Comparison for Varying K
In this experiment, we verify the effect of the number of actions in heavy-tailed bandits.We employ a Pareto noise setting with p = 1.8 and = 0.7.For all exploration methods, we measure the final time average regret after 20 000 rounds.The simulation is conducted varying K from 10, 30, 50, 70, and 100.In Fig. 3, we plot the average value over ten random seeds.For each K , we conduct the hyperparameter optimization using a grid search.
As shown in Fig. 3, all algorithms show a similar tendency that R T /T increases as the number of actions increases since the number of exploring an individual action is reduced if K increases with fixed T .Hence, the plot in Fig. 3 shows the effect of K on the cumulative regret.First, the most robust algorithm against increasing the number of cation is MR-UCB.Especially, MR-UCB outperforms all other exploration methods as the number of actions K increases.However, the performance of APE 2 with Fréchet is drastically getting worse as K increases while MR-APEs with bounded perturbations show a moderate performance drop.This result clearly supports the fact that using modified confidence bound helps to reduce the regret by removing the suboptimal factor ln(K ) from APE 2 with unbounded perturbations.Other methods except for MR-UCB and APE 2 with Fréchet show comparable performance with each other.Interestingly, Robust UCB and DSEE show similar performance to MR-APEs such as Uniform, Bernoulli, and Rademacher perturbations.These results indicate that the regret bound of Robust UCB and DSEE has the same dependency on K 1/ p as the regret bound of MR-APEs while it has suboptimal factor ln(T ) 1/ p with respect to T .In summary, we can conclude that MR-UCB outperforms other exploration methods as the number of actions increases under heavy-tailed settings since the modified confidence bound removes the suboptimal factor of K in the minimax regret bounds of MR-UCB.

D. Effect of Hyperparameter
In this experiment, we verify the sensitivity of each exploration method with respect to the hyperparameter c.For MR-UCB, robust UCB, MR-APE, the exploration tendency depends on scale parameter c.To verify the effect of c for each algorithm, we measure the final time average regret with 50 different c values after 20 000 rounds.For this simulation, we set K = 10, = 0.7, and T = 20 000 and run each algorithm with ten different random seeds.In Fig. 4, we can observe valley-shaped plots for varying hyperparameter c.In general, if c is small, then, an algorithm shows the worst regret since small c makes the algorithm rarely explore an action space.With the similar reason, if c is large, then, an exploration method also shows the worst regret since large c hinders exploitation or convergence to the optimal action.Hence, the regret is reduced at the proper range of c as shown in the valley-shaped plots in Fig. 4. For each algorithm, we would like to focus on analyzing the plateau of valley that shows sensitivity of exploration tendency with respect to hyperparameter.The wide plateau implies that the algorithm is less sensitive to hyperparameters and the proper hyperparameter can be easily found with smaller number of grid search.On the contrary, the narrow plateau indicates that the algorithm is more sensitive for hyperparameter optimizations.To visually measure the range of plateau, we mark the threshold at red dotted line in Fig. 4.
In Fig. 4(a), it can be observed that MR-UCB has the wider plateau than Robust-UCB.Hence, this plot implies that MR-UCB is less sensitive than Robust-UCB.Especially, the performance of the best hyperparameter of MR-UCB is lower than that of Robust-UCB.Consequently, the result shows that MR-UCB is more robust and has a better performance than Robust-UCB with respect to hyperparameter optimization.Comparing MR-APE with Uniform(0, 1) with MR-APE with Bernoulli, MR-APE with Bernoulli shows much robust performance with wider range of the plateau of valley in hyperparameters.Especially, MR-APE with Bernoulli perturbation has much wider plateau than MR-APE with Uniform perturbation.Furthermore, the plateau of MR-APE with Bernoulli is much wider and shows lower cumulative regret than the plateau of MR-UCB.This result shows that the randomization of MR-UCB such as Bernoulli perturbation has the effect of widening the plateau of valley in hyperparameter space.
In practice, finding a proper hyperparameter c is a demanding task for applying exploration methods in practical applications.Hence, algorithms that are less sensitive to hyperparameters are more suitable for practical problems since such properties reduce the cost of optimizing hyperparameters.From the experimental results of hyperparameter optimization, we can conclude that MR-UCB and MR-APE with bounded perturbations are more desirable for practical applications.

E. Performance Comparison for Cryptocurrency Dataset
In this experiment, we test all exploration methods on real-world cryptocurrency dataset [25].Similar to other simulations, we optimize hyperparameters of each algorithm using a grid search.In Fig. 5, we plot the average value over 10 random seeds.As shown in Fig. 5, MR-UCB shows the best performance and MR-APE with Uniform(−1, 1) shows the second best performance.Especially, among the set of bounded perturbations, the uniform perturbation on (−1, 1) shows the best performance.Furthermore, the results show the similar tendency to the results from synthetic simulations.It is worth mentioning that MR-UCB and MR-APE with Uniform (−1, 1) clearly outperform Robust-UCB, DSEE, and APE 2 .Overall, with synthetic and real-world simulations, we have verified the superiority of the proposed methods.

VI. CONCLUSION
We have studied the minimax optimality under heavy-tailed noise assumption for stochastic MABs where the pth moment of rewards is bounded by a constant ν p for 1 < p ≤ 2. We first investigated and found two critical drawbacks of existing robust explorations.First, existing robust exploration methods often depend on a robust mean estimator that requires prior knowledge about ν p where ν p is not accessible in many real-world problems.Second, we proved the sub-optimality of existing robust exploration methods for heavy-tailed rewards.Based on the analysis of the sub-optimality of existing methods, we have proposed two algorithms, MR-UCB and MR-APE, that can guarantee the minimax optimality with minimal information.Both proposed methods are independent on ν p and this fact allows us to employ the proposed exploration methods with minimal prior knowledge compared to existing exploration methods.MR-UCB utilizes the modified confidence bounds that can provide more precise confidence bound of robust mean estimators.Then, MR-APE is the randomized version of MR-UCB that employ bounded perturbation whose scale follows the modified confidence bound in MR-UCB.Furthermore, we analyzed both gap-dependent and gap-independent regret bounds of two proposed methods and guaranteed that both proposed methods have the minimax optimal regret bounds.In simulations, we demonstrate the superiority of the proposed methods for various heavy-tailed synthetic and real-world data.Furthermore, MR-UCB clearly outperforms other algorithms as the number of actions increases under heavy-tailed noise.Consequently, we can conclude that the proposed methods have benefits over heavy-tailed MAB problems.

APPENDIX A PROOF OF COROLLARY 1
Proof of Corollary 1: If the supporting interval of F(g) is bounded, then, there exists some constants A and B such that, for all y ∈ [0, 1], A < F −1 (y) < B holds.Then, the following Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
inequalities also hold, A < F −1 (1 − 1/K ) < B, due to the bounded support of F(g).From this fact, we get This fact induces the following relation: Therefore, we have
Proof of Lemma 1: To prove Lemma 1, we employ the concentration property of the p-robust estimator.We have Then, we can bound the probability P(Z < r a − ( a j /6)) as Then, we can obtain a gap-dependent bound for the second term of (18) as follows: where C is a constant independent on c, T , K , and a .
Proof of Lemma 2: To prove the upper bound, we first introduce the stopping time τ Then, we can compute the upper bound of the expectation of τ k as follows: Then, we bound the expectation of τ k by properly setting 0 .Let us take 0 as follows: For l > 0 , we have l > − p/( p−1) a k due to the definition of 0 , and thus, ln + ((T /K l))/l 1−1/ p ≤ 4 a k /6.Hence, the following condition holds: Hence, we can bound E[τ k ] as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Finally, using the bound of E[τ k ], we get Proof of Theorem 4: The proof can be done by combining Lemma 1 and 2. By combining all gap-dependent bounds and setting k 0 = 1, we can obtain the gap-dependent bounds as Proof of Lemma 3: The proof starts from Lemma 2 as follows: Then, for all gap-independent bounds, we set k 0 such that a k 0 < < a k 0 +1 where = max(e p , e −(3( p−1)/2 p) )(K /T ) 1−1/ p .For a > e p (K /T ) 1−1/ p and e p (K /T ) 1−1/ p > e −(3( p−1)/2 p) (K /T ) 1−1/ p , the upper bound in (61) is a decreasing function.Hence, replacing a with makes the upper bound greater.Consequently, we have Proof of Lemma 4: Let be (e (1/4) (K /T )) 1−(1/ p) .Let k 0 be an index of the action such that where C is a large constant including only c and p from the above inequality.From the above inequality, we can bound the integration as follows: where C is a constant greater than C( p − 1).Finally, we get P Ēc a j a j − a j −1 (77) Proof of Theorem 5: The proof can be done by combining Lemma 3 and 4. We combine all gap-independent bounds and for all gap-independent bounds, we set k 0 such that a k 0 < < a k 0 +1 where = max(e p , e −(3( p−1)/2 p) )(K /T ) 1−1/ p .Then, T k 0 < T = O(K 1−1/ p T 1/ p ) holds.Finally, we can obtain the gap-independent bounds following bound can be derived: T t=1 P E (1)  t,a = (82) The probability of E (2)  t,a can be bounded by the probability of Êτ k ,a ∩ Ẽc τ k ,a as follows: where ( a /3(1 + ))β −1 τ k ,a > 1 for l > 0 and G is a random variable within [−1, 1] and hence, P(G > 1) = 0 holds.
Finally, to prove the bound of the sum of the probability of E (3)  t,a , we barrow the idea from [3] and [23].Let F k,a := P(r τ k ,a + β τ k ,a G τ k ,a > y a ).Using F , we can derive the following bound: Proof of Theorem 6: First, using Lemma 5, we can bound the final term in (19) as follows: ≤ a k T t=1 P E (1)  t,a + P E (2)  t,a + P E (3)   t,a  where M + := max(M , 1) that combines (89) and (99).

Fig. 2 .
However, unlike the Pareto noise setting, MR-APE with bounded positive perturbations shows comparable performance with MR-UCB in various problem settings, and even outperforms MR-UCB in several settings.In particular, MR-APE with Uniform(0, 1) shows similar performance to MR-UCB including ( p = 1.8, = 0.7), ( p = 1.8, = 0.3), ( p = 1.5, = 0.7), ( p = 1.5, = 0.7), ( p = 1.5, = 0.7) and outperforms MR-UCB for ( p = 1.2, = 0.7).While MR-APE that randomizes MR-UCB shows inferior performance for Pareto noise settings, in Fréchet noise settings, MR-APE has advantages over MR-UCB.In summary, from the empirical results shown in Figs. 1 and 2, MR-UCB that employs modified upper confidence bound clearly outperforms other exploration methods for heavy-tailed MAB problems and MR-APEs shows comparable performance in general cases but dominates other algorithms in several special cases.

Fig. 3 .
Fig. 3. Effect of number of actions.The time-average final regret R T / T at the final round is plotted for different K .All regrets are measured under p = 1.8 and = 0.7 with Pareto noise distribution.The bold line is an average value of R T / T over ten different random seeds and the shaded area indicates a half-standard deviation region.

Fig. 4 .Fig. 5 .
Fig. 4. Effect of hyperparameter.The time-average final regret R T / T at final round is plotted for different c.All regrets are measured under p = 1.8 and = 0.7 with Pareto noise distribution.The red dotted line indicates R T / T = /3.The bold line is average value of R T / T over ten different random seeds and shaded area indicates a half-standard deviation region.(a) MR-UCB and robust UCB.(b) MR-APE with uniform(0, 1).(c) MR-APE with uniform(−1, 1).(d) Unbounded APE 2 .

6 Proof of Lemma 5 :
For a fixed a ∈ A, We first define a stopping time τ k for the kth selection of a.Using τ k , the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

P
ra,τ k > x a ≤ exp b p
Lemma 6, the gap-dependent regret bound of MR-APE can be obtained as follows:Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

K 1 − 1 /
p T 1/ p .(107) Output: {r T,a } a∈A 1: Initialize {r 0,a = 0, n a (0) = 0} for all a ∈ A and select a 1 , . . ., a K and receive R 1,a 1 , . . ., R K ,a K once Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Algorithm 1 Minimax Optimal Robust Adaptively Perturbed Exploration (MR-APE) Input: p, c, T, , and F −1 (y) 2: for t = K + 1, . . ., T do 3: 1−1/ p .Then, we have {Z > z a k } ⊂ {n a k ,T < τ k } from the definition of Z and selection rule of MR-UCB.Then, T t=1 P( Ēa k ∩ E t,a k ) can be first bounded by E[I[Z > z a k ]n a k ,T ].Hence, by combining two facts, we have