Quantum bandit with amplitude amplification exploration in an adversarial environment

The rapid proliferation of learning systems in an arbitrarily changing environment mandates the need for managing tensions between exploration and exploitation. This work proposes a quantum-inspired bandit learning approach for the learning-and-adapting-based offloading problem where a client observes and learns the costs of each task offloaded to the candidate resource providers, e.g., fog nodes. In this approach, a new action update strategy and novel probabilistic action selection are adopted, provoked by the amplitude amplification and collapse postulate in quantum computation theory, respectively. We devise a locally linear mapping between a quantum-mechanical phase in a quantum domain, e.g., Grover-type search algorithm, and a distilled probability-magnitude in a value-based decision-making domain, e.g., adversarial multi-armed bandit algorithm. The proposed algorithm is generalized, via the devised mapping, for better learning weight adjustments on favourable/unfavourable actions and its effectiveness is verified via simulation.


I. INTRODUCTION
Fog computing domains, such as vehicular networks, have been rapidly proliferated [1].Enabling such emerging applications to work in a pervasive uncertain environment mandates the need for intelligent decision-making (DM) to choose a suited computing server guaranteeing the quality of service, e.g., offloaded to nodes geared with powerful computing capability.To solve the provider identification problem, sequential DM has been leveraged for its ability to learn in a trial/error fashion without explicit knowledge of the environment, while facing the exploration/exploitation (ExR/ExT) dilemma [2].The exploration strategy is known as a crucial ingredient for learning-based DM: under-ExR makes the decision stick at a sub-optimal strategy, while over-ExR may incur an ExR cost.
Various exploration strategies have been introduced to address the balancing issue, which can be categorized into three main methods of selecting an action, e.g., a service provider: i) An upper-confidence bound (UCB)-type strategy, referred to as interval-estimation method [3], selects an action that has the highest estimated action-value plus the UCB exploration term, making it possible to play an action that was not explored sufficiently; ii) A greedy-type strategy, referred to as the semi-uniform (SU) method [4], consists of choosing a random action with ǫ-frequency or choosing the action with the highest estimated mean otherwise.For the latter, the estimation is based on the costs observed so far; iii) A softmax-type strategy, referred to as the probability-matching (PM) method [5], chooses actions according to a Gibbs-type probability distribution reflecting how likely the actions would be optimal, with a free parameter corresponding to inverse temperature β.With careful tuning, such a UCB-type rule is asymptotically optimal for specific cost distributions but may occur after a long period of time particularly in an adversarial environment.Using SU and PM methods requires tuning the ExR parameter, ǫ or β, vital in a varying environment but non-trivial to set in a systematic way due to lack of generality in how to adjust the factors on favourable/unfavorable actions.
As a promising direction to overcome the difficulties of controlling the ExR factors, adopting a quantum mechanism in the field of learning algorithms has been considered.Existing works in [6], [7] show that quantum learning algorithms can achieve a better ExR/ExT trade-off compared with classical learning, and learning efficiency improvement.Such quantum enhancement arises from the use of quantum subroutines such as quantum amplitude amplification (QAA) and quantum measurement (QM).QM envisions natural ExR based on the collapse postulate of quantum mechanics, which can be used for the importance-weighted Gibbs sampling without specific exploration parameters.QAA, a core in Grover's algorithm [8], updates the probability amplitudes of actions with a certain degree of importance, performed by multiple iterations, where each can be generalized to adjust weights on favorable actions.
Existing probability amplitude updating strategies [7], [9]- [11] suffer from arbitrary phase variation and probability amplitude jumping issues.Such uncertainty attributes may bring out severe eventuality in an arbitrarily varying environment with incomplete feedback, since the probability amplitude of a sub-optimal action could be amplified by an arbitrary degree.The concerns have not been resolved due to challenges associated with i) nonexistence of one-to-one mapping between phase and probability amplitudes and ii) nonsmoothness of arbitrary cost estimates, shed lighted in this work.To the best of our knowledge, this is the first work aiming at devising a quantum-inspired learning process in an adversarial environment with limited feedback.The features of this work can be summarized as follows.
• This work proposes a quantum exploration-based decision-making algorithm, where a novel probabilistic action selection is adopted for enhancing an adversarial multi-armed bandit (MAB) learning strategy [4], provoked by the amplitude amplification and collapse phenomenon in quantum computation theory.• This work extends non-classical learning algorithms using a fixed phase with flexible iterations [7] to their counter-parts, flexible phases with an iteration, in a resembling way to existing works [9]- [11].Our work differs from previous works in the ways the phases are tuned, overcoming the hardness of justifying to set a free parameter.• This work generalizes the MAB algorithm through increasing the probability amplitude of a dominant action as well as decreasing the ones of the others.This is realized by adjusting importance weights via the devised one-toone mapping between a quantum-mechanical phase and a learning-based decision probability, which otherwise conventionally requires an extra normalization [10].• This work alleviates an undesirable situation, where a suboptimal action is amplified due to uncertainty of the empirical cost estimates in an adversarial bandit setting.This is enabled by using an implicit exploration estimate process, which renders the reduction of variance and bias simultaneously and thus achieves a better ExR/ExT balance [2].Simulation results verify its effectiveness.

II. RELATED WORKS
This section presents related works in the area of quantumenhanced exploration strategy, in terms of quantum bandit problems and amplitude amplification methods.
Quantum algorithms for bandit problems have been proposed recently [12]- [15].The work in [12] initiated the study of quantum algorithms for best-arm identification of MAB, the research in [13] proved optimal results for best-arm identification of MAB with Bernoulli's arms, and the authors in [14] proposed quantum algorithms to find an optimal policy for a Markov decision process with quantum speedup.These algorithms investigate potential improvements in the respective multi-armed stochastic bandit problems.The stochastic model may be unrealistic in many applications: data collected in a sequence rarely satisfy the i.i.d assumption, and it would be naive to think that corruptions never occur.The work in [15] studied a quantum version of the Hedging algorithm, related to the adversarial model considered pessimistic in contexts where we expect learning to be reasonably possible.However, it is limited to a bandit setting.
Quantum algorithms with probability amplitude updating are in general supported by two different approaches [7], [9]- [11].One is to make use of a fixed phase with multiple Grover iterations, which however suffers from an amplitude jumping issue [7] due to discrete operations.The other is to consider a varied phase with a single iteration, which however suffers from the effects of arbitrary phase variations on the amplitudes due to nonexistence of one-to-one mapping between phase rotation and probability amplitudes [9]- [11].The work in [9] considered an empirical function mapping, e.g., setting relevant free parameters manually.However, such a manual strategy is only valid when sufficient data are available, causing unreliability.The work in [10] considered a parametric mapping that is not reliant on empirical data.However, a substantial number of function forms remain largely unexplored, and thus such parametric strategy cannot be generalized, causing incompatibility.The work in [11] relaxed the limitations of both empirical and parametric approaches.However, their approach suffers from inflexibility due to non-monotonic mapping, which fails to simultaneously amplify the dominant action and attenuate others.Additionally, none of these works considers the uncertainty of the empirical costs generated in an adversarial fashion under an information-limited environment, which could increase the probability amplitude of a sub-optimal action, leading to fatal outcomes.This work addresses the aforementioned limitations by introducing a novel action updating strategy.This strategy utilizes a local one-to-one mapping between available phase rotation and relative disparity learning scores for both dominant and dominated actions.This approach allows for the simultaneous amplification and attenuation of probabilities.In addition, cumulative learning scores are used in conjunction with an implicit exploration-based biased cost estimation.This technique effectively mitigates the uncertainty associated with importance-weighted estimators in adversarial environments.

III. SYSTEM MODEL AND LEARNING STRATEGY
This section demonstrates the system model and learningbased decision-making, applicable to offloading services.

A. System model
A service client (SC) generates tasks, while a set of service providers (SPs) k ∈ K = {1, ..., K} execute the requested tasks with their own available resources.An SC can send a task, e.g., offloading a computational task [2], t to any SP k among the set.Each task, t, is considered as a basic unit for offloading.The demand for resources from each SC may vary depending on the nature of performed applications, expressed as the multiplication of the input size q t (bits/task) and the computational complexity (cycles/bit).The service capability of an SP k depends on its resource availability (cycles/sec).The achievable up/down-link transmission rates between an SC and an SP are determined by the wireless medium characteristics.The cost for offloading a task, D t k , includes the cost for uploading the input to an SP k, and the execution cost at the SP, downloading the result to the SC.
This work defines the unit service cost reflecting the service capability of each candidate SP k, e.g., the cost of processing one bit of input data for task t on SP k, as l t k = D t k /q t .One aim of this work is to minimize the average unit cost by optimizing the SP selection for each task in each round, k t .We design a learning-based task offloading (TO) algorithm minimizing the expectation of the unit cost, formulated as P : is a sequence of unit cost for the t-th task in the task set T , and T = |T | is the number of tasks.The significance of a learning algorithm depends on the adopted benchmark policy which the algorithm is measured against.The learning regret measuring how much the SC regrets choosing its pulled action-sequence over the one with the optimal policy, is expressed as correspond to the expected cumulative costs incurred by an algorithm and the optimal solution k * = arg min k T t=1 l t k /T .

B. Online learning decision-making in bandit setting
Consider a framework of online learning where an SC selects one SP, k ∈ K based on an unknown cost function.There exists a trade-off between exploiting the experiential best SP for instantaneous costs and exploring the other SPs for potential benefits.The trade-off is formulated as a MAB problem specified by K and l t k , t ∈ T .In an adversarial MAB, randomized policy is used such that an SC draws an arm according to a probability distribution, k ′ ∼ p t = [p t k ] k∈K .One may employ weighted-average randomized strategy with potentials to achieve a cumulative cost as small as that of the best action [19].An arm k is assigned with the selected probability for task t, p t k proportional to weighted accumulated cost caused by that arm in the past, , the importance-weighted mechanism assigns exponentially higher probability to strategy with lower cumulative scores up to t−1 due to ∂p t ∂L k < 0 where L k = Lt−1 k .The scores reinforce the success of each strategy measured by the estimated TO cost, so an SC would rely on the strategy with the lowest one.

IV. QUANTUM AMPLIFICATION EXPLORATION STRATEGY
We develop a quantum learning-based TO algorithm, enabling an SC to learn the TO costs of candidate SPs and to choose an SP in aid of quantum subroutines.

A. Learning system with quantum concepts
An action in a learning system is represented with a quantum state, inspired by the advantages of quantum computation.Prior to the action selection carried out by observing the state according to collapse postulate of QM, the state specified by probability amplitude is updated by a QAA process.
1) Quantum basics: The fundamental information unit in quantum computation is the quantum bit (qubit).A qubit denoted as |0 and |1 corresponds to the states 0 and 1 for a classical bit.Also, a qubit can lie in both |0 and |1 at the same time, a linear combination of |0 and |1 , expressed as |Ψ = g 0 |0 + g 1 |1 where g 0 and g 1 are complex coefficients.This quantum phenomenon is called state superposition principle.When we measure a qubit in superposition |Ψ , the qubit system would collapse into one of its basic states |0 with probability |g 0 | 2 or |1 with probability |g 1 | 2 .Thus, g 0 and g 1 are in general called probability amplitudes whose magnitude and argument represent amplitude and phase, respectively, satisfying |g 0 | 2 +|g 1 | 2 = 1.According to quantum computation theory, a fundamental operation in the quantum computing process is a unitary transformation U on the qubits.If one applies a transformation U to a superposition state, the transformation will act on all basis vectors of this state and the output will be a new superposition state obtained by superposing the results of all basis vectors.The transformation can simultaneously evaluate the different values of a function for a certain input and it is called quantum parallelism.
2) Collapsing action selection: A quantum state |Ψ can describe the state of a quantum system.The work in [7] proposed a formal representation for the quantum system with multiple actions.Let K be the number of actions, K = 2 n where n qubits are used to represent eigenactions 1 .For an nqubit system, its quantum state can be represented with tensor product of n independent qubits |Ψ = |Ψ 1 ⊗|Ψ 2 ⊗• • •⊗|Ψ n where ⊗ means tensor product and |Ψ v represents the v-th (v ∈ [1, n]) qubit in the superposition state of |0 and |1 .According to [7, Prop.1], for an n-qubit learning system, its quantum state at t can be expressed as |Ψ t = a∈A t g t a |a where A t is the set of 2 n eigenactions, each of which with n length of a binary string, and g t a is the complex coefficient, the probability amplitude 2 of eigenaction |a subject to a∈A t |g t a | 2 = 1.The index t is omitted below for ease of description.The quantum representation establishes a bridge between the eigenactions A and the arms K, shown by |Ψ = a∈A g a |a → k∈K g k |k .The actions can be represented by log 2 K qubits, denoted by |1 ,• • •,|K .An SP selected by an SC before any QM is implemented on a superposition state |Ψ which would collapse to one of its eigenactions with probability p k = |g k | 2 , |Ψ → |k when an agent measures the quantum state according to the collapse postulate of quantum mechanics [7].Such quantum collapse phenomenon can be considered as creating information on action selection strategy, e.g., k ′ ∼ p where p 3) Amplifying probability amplitude: Before the collapse, the probability amplitudes of eigenactions can be reshaped via a QAA subroutine, e.g., Grover iterations, each of which gradually modifies the collapsing probabilities.The evolution of a system is described by a unitary transformation performed on the superposition states of its possible eigenactions to amend the probability amplitudes updated after n-Grover iterations on |Ψ 0 , a state before amplification, viewed as where |Ψ 0 = k∈K g k |k and G is a Grover iteration which has two substeps, an oracle query and a diffusion operation, built in a form of the unitary as follows where U (φ1,m) is an operation based on an oracle query, shifting the phase of the target action 3 |m with φ 1 , and U (φ2,Ψ0) is a diffusion operation, rearranging the phases of all actions with The actions in the classical system are denoted as the corresponding orthogonal bases and are called the eigenactions in a quantum system. 2 Amplitudes correspond quantum probabilities representing the chance that a quantum state will be collapsed to when being observed. 3Classically, m = arg max k p k , while non-classically done by [18].
transposes of |m and |Ψ 0 .While two operators have no effect on m except normalization, they amend the target action's amplitude.

B. Quantum amplitude amplification based exploration
The effect of the Grover iterations on |Ψ 0 , due to its probability updating nature, can be used as a quantum learning strategy.A natural question is how to amplify/attenuate the amplitudes appropriately, yielding a better exploration strategy.
1) Controlling probability amplitude: Note that the parameters, φ 1 , φ 2 , and n in (1) and (2) determine how the probability amplitudes are updated.The transformation can be executed with proper values of the parameters corresponding to importance weights for the eigenactions.Different amplitude updating approaches have been considered in [7], [9]- [11].Generally, one is to fix n = 1 with varied values of φ 1 and φ 2 as learning-related factors, and another is to use a feasible value of n with fixed values of φ 1 and φ 2 .Since the latter suffers from intermittent update in the amplitudes, the former is adopted in this work, i.e., n = 1 with varied φ 1 and φ 2 .2) Mapping phase/probability amplitudes: Note that the overall effect of G on |Ψ 0 is a two-substep phase rotation amplitude enabling to update probability amplitude, i.e., by selecting feasible φ 1 and φ 2 , it is possible to manipulate the values of ̺ and ς.While existing works in [9]- [11] focused on updating the probability amplitude of a target action only, e.g., amplifying/attenuating the amplitude for a good/bad action, they have limited capability of generalizability and complexity: requiring i) a free parameter selection indicating an amplified/attenuated degree but varying for different situations and ii) a re-normalization updating probability amplitudes of untarget actions, both of which are due to lack of one-toone mapping between quantum probability and phase rotation amplitudes.This work proposes a pipeline to support the mapping operation by designing a local monotone function.1) sin 2 (φ 1 /2)(cos φ 2 −1)+2 sin φ 1 sin φ 2 .It is straightforward to conclude that ̺ and ς are designed to be larger or smaller than 1, respectively but conversely, irrespective of φ 1 and φ 2 .Based on the phase matching condition [17], φ = φ 1 = φ 2 , their second derivatives w.r.t φ also have signs opposite each other due to ∂ 2 (1−̺)

Lemma 1. (Impact of G) The updated coefficients in amplification/attenuation, defined as the ratio between the amplitudes of targeted/untargeted actions, after being acted by an operator G and before that, can be expressed as
= p m κ ′ where κ ′ = (4 − 8p m ) cos φ + 8p m cos 2φ.Such a converse relation between ̺ and ς allows focusing on updating one of them.
An action is rewarded/punished with higher/lower unit effort.To determine an updating degree, e.g., establishing how much it would be amplified/attenuated, the differences in learning scores between the optimal arm and sub-optimal ones can be considered, k , representing the relative disparity between targeted and untarget actions.Due to the fact that the values are lower than or equal to 1 for all actions, we map the average obtained relative disparity D to the ratio ς via an appropriate adjustment of φ.To diminish the probabilities of untarget actions proportional to D, one may find a range where probability amplitudes vary monotonically.
Next, we show how to establish φ for the amplitude amplification, by identifying local monotonic function of ς on φ and specifying a one-to-one mapping between D to ς.

Proposition 1. (Finding of φ)
The ratios ̺ and ς can be controlled via a phase φ = − arccos W (1−ςmin) D+ςmin where W Proof.Note that a ratio of ς is monotonically increasing within a specified range.The ratio ς has local maximum/minimum points at φ = 0, π, arccos 1 − 1 2pm , each of which satisfying ∂ς ∂φ = 0.And it increases in φ, ∂ς ∂φ > 0, when case i) sin φ < 0 and cos φ > 1 − 1 2pm , or case ii) . While for case i) a phase value of φ may have different maximum values of ς for different p m in its increasing range, for case ii), a ratio value of ς monotonically increases in φ ranged from where and reaches the maximum equal to 1 only at φ = 0 irrespective of p m , which allows us to focus Algorithm 1 Quantum amplification exploration strategy 1: Input: Set |Ψ ← updating (̺, ς) with φ set by Prop. 1 Set k ′ ← measuring |Ψ and play the strategy k ′ 7: Get l k ′ and update W with η t , γ t by Prop. 2 8: end for 9: Output: sequences t∈T l t k ′ > 0 on case i), see Fig. 2. Note that a ratio value of ς = 1 − p m κ in Lemma 2 increases w.r.t a phase value of φ = − arccos (W ς ) satisfying Eq. ( 3).The feasible φ is set to be proportional to the average obtained relative disparity D which could be one-toone mapped to the range of ς given p m .Thus, the ratios, ̺ and ς, can be controlled via φ = − arccos W (1−ςmin) D+ςmin .
Remark 1. (Profiles of φ and ς) Note that ς decreases in p m due to ∂ς ∂pm < 0 in Prop. 1, and thus attenuated probabilities are achieved, see Fig. 2. For a high p m , the impact of φ on ς becomes large, and thus φ can be tuned within a small variation range for the updating.Contrarily, for a relatively small p m , a much larger degree of freedom on φ adjustment is configured, a natural way to avoid local maxima with a relatively small p m .Setting φ tunes ̺ and ς, simultaneously.
3) Processing implicit cost estimation: An SC selects an arm for a task and receives the cost from the selected arm, not from the others.The cost from an arm k = k ′ could not be observed due to incomplete feedback in the bandit problem.One may use an unbiased estimate, lt k = , but it could cause large fluctuation in the cost due to inverse-proportion to p t k .Instead, this work considers Exp3 algorithm endowed with implicit exploration (IX)-style cost estimates [16], which controls the variance at the price of extra bias.After each action, the cost estimate is calculated as lt , where γ t ∈ (0, 1] is the implicit learning rate.While actions with large costs are set to be negligible probabilities by the classical recipe [19], such an implicit price allows them to have low but non-negligible ones and to be chosen occasionally.Thus, the estimator could guarantee performance with high probability.

C. Proposed algorithm
The workflow of the proposed algorithm (Algorithm 1) can be divided into three parts: i) interaction, ii) estimation and iii) selection.While the first part is about a typical interaction as an external learning process, the last two parts correspond to a classical and quantum-inspired operation as an internal learning process.An iterative method is used to link the conventional outer and inner processes such that the classical information is conveyed from a step t to the next t + 1 via interaction between an agent and the adversary, including: strategy playing, feedback getting, and cost suffering.The internal learning process is characterized by the score updating rule, and the local selection rule defined by what action is output given the score (selection).The algorithm is designed in a modular way so that its quantum-inspired part can be treated as a separate building block where the quantum enhancement is exhibited, whose source lies in the use of quantum subroutines to perform each internal selection process.The probability distributions p t ∈ R K are passed to the quantum subroutines where, instead of sampling one action in a classical manner, in a quantum setting, one sample can be obtained by preparing the state |Ψ 0 = k∈K g t k |k where |g t k | = p t k , updating it with the proposed amplification, see Prop. 1, and measuring the updated |Ψ , e.g., collapsing action selection.
Proposition 2. The quantum strategy with φ = 0 can achieve better regret than the one with φ= 0, when η t > Remark 2. Note that the collapse of a quantum state is not real selection, but just a fundamental phenomenon when the state is measured, resulting in i) a good ExR/ExT balance and ii) a natural action selection without setting parameters unlike conventional approaches.The agent can explore its strategies in superposition in a way that guarantees a provable regret improvement in its learning time over its classical analogue.
V. PERFORMANCE EVALUATION This section conducts numerical studies to assess the learning performance of the proposed algorithm.
Consider an SC, requesting the computational resource from candidate SPs.The distance between the SC and each SP is to follow a uniform distribution, d ∼ U[0, d r ] where d r is the communication range equal to 400 m.The transmission power of the SC is 24 dBm, the channel bandwidth is 10 MHz, and noise power is −174 dBm/Hz, and large/small-scale fading gains follow 128.1 + 37.6 log 10 (d) and Rayleigh distributed with unit variance, respectively.The interference effects on the co/adjacent channel are assumed to be ignored [2].Consider 5 SPs with maximum CPU frequency, F k ∈ {6, 6, 5, 4, 3.5} GHz for T = 3e3.For an SP, the allocated CPU frequency to the SC is a fraction of the maximum distributed from 20% to 50%, but arbitrarily constrained [2].The computational complexity and task size are set to 1e3 cycles/bit and 1e6 bits/task.
The proposed quantum algorithm is compared to the conventional counterparts in terms of the learning regret.Those counterparts include choosing arms based on i) upper confidence bound such as UCB [20], ii) current knowledge with a probability 1 − ǫ such as ǫ-Greedy actions in an adversarial environment, compared to the counterparts.This is because QAA process associated with implicit exploration-style cost estimates allows to simultaneously amplify/attenuate the probabilities smoothly yielded from the learning scores, thus reducing the average regret by 50% and 40% from those of Exp3IX and QB with a sole ratio tune case (̺ > 1, ς = 1) requiring re-normalization [11].Fig. 3(b) demonstrates that the superior performance of the proposed algorithm is valid for different numbers of SPs K.A finegrained implicit exploration approach could achieve higher and more robust performance, obtaining lower empirical mean and standard deviation of the regret than others.guides us to set a lower φ (Prop.1).ii) As K increases, the selected action m with a given p t m has higher dominance than the others, p t m ≫ 1−p t m K−1 , and thus the chosen φ becomes lower, resulting in larger variability of φ. iii) Meanwhile, the minimum limit of φ increases starting from p t m equal to 1 4 by Eq. ( 3) and the probability gap proportionally relative to the reduced range of φ yields the larger φ.Choosing an appropriate value of φ = 0 allows for simultaneously amplifying the amplitude of a dominant action while attenuating the ones of the others, thereby leading to better performance (Prop.2).
The proposed algorithm has the potential for powerful computation in complex unknown environments, leveraging related quantum apparatuses.The quantum-inspired bandit algorithm is designed for quantum computers and motivated by quantum mechanics, but it is effective on traditional computers as well.This is due to two key aspects: (i) the collapse action selection strategy uses quantum measurement postulates to balance ExR-ExT trade-offs, without relying on empirical exploration parameter settings, and (ii) the probability magnitude updating strategy leverages quantum-mechanical phase control to simultaneously boost/suppress learning strategies based on the learning score, following the quantum superposition principle and without requiring additional normalization.

VI. CONCLUSION
This work proposed a quantum-inspired bandit learning algorithm to reduce the service cost under an adversarial environment.The proposed QAA approach allows for the new action update strategy and novel probabilistic action selection, provoked by the amplitude amplification and collapse postulate in quantum computation theory, respectively, together with a devised mapping between a quantum-mechanical phase in a quantum domain, and a distilled probability-magnitude in a value-based decision-making domain.This method effectively balances convergence speed and learning quality, outperforming traditional exploration approaches.Numerical results demonstrate its superiority over conventional methods.

Fig. 3 (
c) depicts the corresponding solution behaviors of ς and φ w.r.t p m .i) The probability of a dominant action increases alongside the learning progress.A larger gap of probabilities between the dominant action and overall dominated actions, p t m and k∈K\m p t k K−1 =1 η t ′ lt ′ k , where lt ′ k is the cost estimate from the arm k for task t and η t ′ ∈ (0, 1] is the learning rate.Considering exponential potential with the score, W t k = e − Lt−1 1 t and γ t > 1 2t .Proof.Assume that a dominant arm's index is m, L m ≤ L k , ∀k ∈ K, one non-dominant arm selection k ∈ K\m