Global and Local Convergence Analysis of a Bandit Learning Algorithm in Merely Coherent Games

Non-cooperative games serve as a powerful framework for capturing the interactions among self-interested players and have broad applicability in modeling a wide range of practical scenarios, ranging from power management to path planning of self-driving vehicles. Although most existing solution algorithms assume the availability of first-order information or full knowledge of the objectives and others' action profiles, there are situations where the only accessible information at players' disposal is the realized objective function values. In this article, we devise a bandit online learning algorithm that integrates the optimistic mirror descent scheme and multi-point pseudo-gradient estimates. We further prove that the generated actual sequence of play converges a.s. to a critical point if the game under study is globally merely coherent, without resorting to extra Tikhonov regularization terms or additional norm conditions. We also discuss the convergence properties of the proposed bandit learning algorithm in locally merely coherent games. Finally, we illustrate the validity of the proposed algorithm via two two-player minimax problems and a cognitive radio bandwidth allocation game.


I. INTRODUCTION
Recent years have witnessed increasing interest in the analysis of multi-agent systems and large-scale networks, which find a wide range of applications such as thermal load management of autonomous buildings [1], power management in sensor network [2], and path planning and control of self-driving cars [3], with prospects for further applicability in optimal drug delivery in the treatment of diseases [4] and control of environmental pollution [5], etc.One primary objective in multi-agent systems is to devise local protocols for each agent, by following which, the resulting group behavior is optimal as measured by a certain system-level metric [6].With its origins in [7], game theory offers the theoretical tools to model and examine the strategic choices and associated outcomes of rational players who make decisions in a non-cooperative manner.In particular, in the Nash equilibrium problem (NEP), this group of players seeks to reach a stationary point known as Nash equilibrium (NE), where no rational player has any incentive to unilaterally deviate from it.
In order to devise an algorithm for the NEP or its variants, it is crucial to have access to the first-order information, i.e., the partial gradient of the local objective function of each player, the evaluation of which nevertheless usually requires the action profiles from all players.In view of this, in some studies [8], [9], [10], the availability of first-order oracles is taken as a given, whereas some other studies [11], [12], [13] investigate scenarios where a communication network exists and players are willing to communicate with their trusted neighbors and keep local estimates of others' action profiles.Despite the progress discussed above, there are many real-world scenarios where players only have access to the observed objective values of selected actions, which makes the bandit/zeroth-order learning strategy a compelling choice.Our primary objective in this work is to develop an online learning algorithm for multi-player continuous games that are globally or locally merely coherent with bandit information.
Related Work: There have been several recent notable contributions to the field of bandit learning in games.In their work [14], Bravo et al. proposed a bandit version of mirror descent (MD), which guarantees a.s.convergence to an NE when the game is strictly monotone and achieves a convergence rate of O(1/t 1/3 ) for strongly monotone cases.By employing a barrier-based method, Lin et al. [15] improved the convergence rate for strongly monotone games from O(1/t 1/3 ) to O(1/t 1/2 ).Similar convergence rates have also been reported in [16], [17], [18].Huang et al. [19] developed two bandit learning algorithms by integrating residual pseudo-gradient estimates into single-call extra-gradient schemes that ensure a.s.convergence to critical points of pseudo-monotone plus games.Moreover, in strongly pseudo-monotone plus games, by employing the proposed algorithms, the convergence rate is further elevated to O(1/t 1− ).
To extend the analysis beyond the realm of strictly monotone and pseudo-monotone plus games, Tatarenko et al. [20] utilized the single time-scale Tikhonov regularization and a doubly regularized approximate gradient descent strategy to develop an algorithm that converges to NEs in probability when the game is monotone and four decaying sequences are tuned properly.In a recent study [21], Gao et al. introduced an algorithm that integrates second-order learning dynamics and Tikhonov regularization and established the a.s.convergence of the sequence of play under the assumption that there exists at least one interior variationally stable state (VSS).Yet, the convergence is contingent on the norm condition that the 2norm of the state sequence should be greater than that of the VSS, which can be challenging to verify during the iterative process.
In the literature on variational inequalities (VIs) and their stochastic versions (SVIs), Mertikopoulos et al. [22] showed that the vanilla MD converges when the problem is strictly coherent, a relaxed variant of strict monotonicity, but fails to converge in merely coherent VIs.In contrast, the extragradient (EG) method is capable of achieving convergence to a solution in all coherent VIs, but it requires the exact operator values.In the presence of random noise in operator values, strict coherence is necessary to establish the convergence of the EG iteration.Similar convergence analysis is also reported in [23] for pseudo-monotone plus SVIs.To address the challenges posed by random noise, Iusem et al. [24] developed an extra-gradient method for pseudo-monotone SVIs that incorporates an iterative variance reduction procedure and established both asymptotic convergence and convergence rates in terms of the residual function for the proposed algorithm.
In the realm of multi-player games without global monotonicity or coherence, there exists a body of research that delves into games satisfying the weak Minty variational inequality or negative comonotonicity: Pethick et al. [25] and Cai et al. [26], [27] devised algorithm that ensure convergence under the deterministic setting; Diakonikolas et al. [28] proposed a generalization of the extra-gradient method that ensures convergence to a stationary point for unconstrained problems; Pethick et al. [29] extended their previous work to stochastic cases and designed algorithms that converge to solutions for constrained problems for a random iterate.Another significant body of research has focused on local solutions when global regularity conditions are absent: Mertikopoulos and Zhou [8] investigated the local convergence properties of mirror descent in deterministic and stochastic cases; Hsieh et al. [30] focused on a class of single-call extra-gradient methods in Euclidean space and established local geometric convergence results for deterministic cases and a local convergence rate of O(1/t ) for stochastic cases.These local convergence rate results are later generalized by Azizian et al. [31] to Banach spaces over a range of Legendre exponents.
Contributions: In this work, we develop a bandit online learning algorithm and establish the a.s.convergence of the generated sequence of play under the regularity condition that the game is merely coherent, which is broader and more general than the games investigated in [14], [15], [16], [17], [18].The proposed algorithm leverages the optimistic mirror descent (OMD) [30], [31] and a single-call extra-gradient scheme as the backbone, which allows us to deal with the absence of strict coherence and reduces the query cost induced by the extra step.Alongside the OMD updates, the multi-point pseudo-gradient estimation is employed and the decaying rate of the variance of zeroth-order estimations can be controlled by properly tuning the query count per iteration.In contrast to [21], despite the requirement in our approach that every solution is globally merely variationally stable, we avoid enforcing the additional norm condition in [21,Thm. 1].Additionally, we investigate games with only local mere coherence and establish that, by utilizing appropriate initializations, the generated actual sequences of play can converge to critical points (CPs) with sufficiently high probability.Furthermore, the validity of the proposed algorithm is verified through two two-player minimax problems and a cognitive radio bandwidth allocation game.
Organization: In Section II, we provide a formal formulation of the multi-player games and briefly introduce optimistic mirror descent.Section III presents the multi-point pseudogradient estimate and offers insights into the associated systematic and stochastic errors.Subsequently, in Section IV, we present the proposed algorithm and provide the main convergence results in globally merely coherent games.Section V is dedicated to the examination of local convergence for the proposed learning algorithm.In Section VI, to demonstrate the theoretical findings and the effectiveness of the proposed algorithm, we conduct simulations for two-player zero-sum games and the cognitive radio bandwidth allocation game.Section VII concludes the article and highlights potential extensions and applications.
Basic Notations: Let R ++ := (0, +∞) and N + := N\{0}.For a set of vectors A conference version of this article can be found in [32], which mainly focuses on the convergence analysis under the global mere coherence assumption.

A. GAME FORMULATION
In a multi-player non-cooperative game G with N players, indexed by N := {1, . . ., N}, each player i ∈ N aims to optimize its own local objective J i by adjusting its action x i ∈ X i ⊆ R n i , which can be described as follows: minimize where x −i := [x j ] j∈N −i denotes the stack action of other players that parameterizes the objective J i with N −i := N\{i} and x := [x j ] j∈N ; X i denotes the feasible set of player i, and for brevity, we let X := j∈N X j ⊆ R n represent the global strategy space and X −i := j∈N X j ⊆ R n −i with n := j∈N n j and n −i := j∈N −i n j .Our analysis primarily lies within Euclidean space; however, it has the potential to be extended to finite-dimensional Hilbert spaces.Our blanket assumptions for the objective functions J i 's and the local feasible sets X i 's will be as follows.
Assumption 1: For each player i, the local objective function J i is continuously differentiable in x over the global strategy space X.Moreover, its individual strategy space X i is compact and convex, and has a non-empty interior.
Given the smoothness posited in Assumption 1, a singlevalued operator that we will leverage extensively throughout is the pseudo-gradient operator F : R n → R n .It is defined as the concatenation of all the partial gradient operators, i.e., ( Before proceeding, we remark that Assumption 1 implicitly implies that F is Lipschitz continuous on X with some constant L, i.e., for any x and x ∈ X, we have As for the solution concept, we focus on critical points (CPs) [33, Sec.2.2], a more relaxed solution concept than Nash equilibria (NEs), whose definition is given as follows.
Definition 1 (Critical Points): A decision profile x * ∈ X is a critical point of the game G if it is a solution to the associated (Stampacchia) variational inequality (VI), i.e., where •, • represents the canonical inner product.CPs are the fixed points of the "linearized" best-response iterate x → argmin x ∈X F (x), x and can be perceived as local NEs [33].CPs form a superset of NEs and coincide with them when J i is convex and continuously differentiable in x i for all i [34,Sec. 1.4.2].We postulate that the games discussed in this work admit at least one CP inside X.
In this work, our aim is to propose a new algorithm that is applicable to a broader class of games as compared to strictly monotone games and pseudo-monotone plus games.Moreover, we intend to further relax pseudo-monotonicity assumptions that are usually imposed upon the structure of the game to the ones merely upon equilibria.Two assumptions are employed in Sections IV and V to facilitate the analysis of global and local convergence, respectively.
Assumption 3 (Local Mere Coherence): The game G is locally merely coherent (around a CP set X * ⊆ X) if there exists a neighborhood U with a non-empty interior, such that X * ⊆ int(U ) and for every CP x * ∈ X * , F (x), x − x * ≥ 0 for all x ∈ U ∩ X.
We can infer that the set X * is compact due to the inherent properties of the problem setup, i.e., the feasible set is compact, and a CP should fulfill (4).
Remark 1: The reason why we assume that every CP is merely variationally stable in the above assumptions is that we leverage the residual function ε(•) defined in Lemma 4 to prove convergence.Since the convergence of ε(•) only implies the existence of a convergent subsequence to a CP, this condition is needed to pass the subsequence convergence to the whole-sequence convergence.In contrast, [21] only requires that there exists a variationally stable x * , and constructs an energy function specified for x * to prove the convergence.Yet, another norm condition X k 2 ≥ x * for all k is posited regarding the generated sequence (X k ) k∈N + and the verification of it can be challenging.If there are multiple solutions satisfying variational stability, at the very beginning of the iteration, it might be unclear which solution one should focus on and the sequence will converge to, and the choice of energy function requires some extra care.

B. OPTIMISTIC MIRROR DESCENT
In this subsection, we shall provide a brief overview of the optimistic mirror descent (OMD) algorithm, as well as related concepts and results.As an extension of the Euclidean projection, the mirror map ∇ψ * : R n → R n is defined as: where ψ : domψ → R is a so-called distance-generating function (DGF) with domψ denoting a convex and open set where ψ is well-defined.The DGF ψ fulfills the following conditions [35,Sec. 4.1]: i) ψ is continuously differentiable and μ-strongly convex for some μ > 0; ii) ∇ψ (domψ ) = R n ; iii) cl(domψ ) ⊇ X and lim x→∂ (domψ ) ∇ψ (x) * = +∞.The definition of DGF ψ allows us to introduce a pseudo-distance called the Bregman divergence, which is defined as: ∀p, x ∈ domψ.To let D(p, •) represent a certain distance measure to p and use this measure to define a neighborhood of p, we make the following assumption.Assumption 4 (Bregman Reciprocity): The chosen DGF ψ satisfies that if the sequence (x k ) k∈N + converges to some point p, i.e., x k − p → 0, then D(p, x k ) → 0.
Then, the Bregman divergence generates the prox-mapping P x,X : R n → domψ ∩ X for some fixed x ∈ domψ ∩ X that plays a critical role in mirror descent and its variants: With all these in hand, the OMD [30], [31] can be expressed as below.
where (τ k ) k∈N + denotes a proper sequence of step sizes.The update consists of the following two steps.Given the base state X k at step k, in the look-forward step, the leading state X k+1/2 is procured by updating X k with the proxy F (X k−1/2 ) queried at X k−1/2 rather than the exact pseudo-gradient F (X k ) queried at X k to reduce the oracle call per iteration.This step is essential in anticipating the landscape of F and facilitating the convergence when F is merely monotone, i.e., F (x) − F (y), x − y ≥ 0, for all x and y feasible [36].In the state-updating step, the base state X k is revised to X k+1 following the pseudo-gradient information F (X k+1/2 ).The OMD falls into the single-call category, distinguishing itself from the conventional extra gradient algorithm [24] by exclusively utilizing the first-order information at the leading state X k+1/2 , rather than at both X k and X k+1/2 .

III. MULTI-POINT PSEUDO-GRADIENT ESTIMATION
In this article, we examine the scenario where the first-order information at the leading state, i.e., F (X k+1/2 ) is not readily available, and players need to estimate them based on the realized objective function values.A prevalent technique in the literature of first-order information estimation methods is the simultaneous perturbation stochastic approximation (SPSA) approach [14].For each i ∈ N, let B i , S i ⊆ R n i denote the unit ball and the unit sphere centered at the origin.At each iteration k, before implementing the SPSA estimate, we initially undertake the following perturbation step: where u i k is randomly sampled from S i ⊆ R n i and we define u k := [u i k ] i∈N ; δ k represents the random query radius at iteration k; B(p i , r i ) ⊆ X i is an arbitrary fixed Euclidean ball within the feasible set X i that centers at p i with radius r i ; In the merit of the feasibility adjustment in ( 9), the action to be taken will sit within the feasible set, i.e., X i k+1/2 ∈ X i and Xk+1/2 := [ X i k+1/2 ] i∈N ∈ X.Then SPSA estimation can be expressed as n i δ k J i ( Xk+1/2 )u i k .Nevertheless, as previously noted in [14], the SPSA approach incurs an increasing estimation variance when the query radius is reduced to improve the estimation accuracy, which results in conservative choices of updating step sizes τ k and significant degradation of the convergence rate.To resolve this conundrum, there has been increased consideration given to schemes such as two-point estimation and residual estimation to keep the variance bounded.On account of this, we consider the multi-point pseudo-gradient estimation (MPG) scheme, whose counterparts in the field of optimization can be found in [37].At every iteration k, each player i executes the perturbation step in (9) (T k + 1) times in an independent manner, takes the action X i k+1/2,t , and observes the associated realized objective function values J i ( Xk+1/2,t ), where the variable t ∈ N is an index of the multiple samples taken per iteration.The multi-point pseudo-gradient estimate can be formulated as below: where (u i k,t ) t = 0,...,T k are i.i.d.random variables uniformly distributed over S i ; the action taken by player i is given by X i k+1/2,t := (1 To simplify the presentation, we will henceforth use Ĵi k,t to represent the realized objective value J i ( Xk+1/2,t ) for the t-th sample at iteration k.Prior to delving into the properties of MPG, we first outline the probability setup to streamline our later discussion.Let ( , F, P) denote the underlying probability space.The filtration (F k ) k∈N + is constructed as , which captures the update that results in X k , i.e., the entire information up to and including iteration k − 1.Then to characterize MPG, we start by considering the following decomposition of it: For brevity, we let the stochastic error.To facilitate later analysis, for each J i , we introduce the δ-smoothed objective function Ji δ : where The lemmas presented below provide an examination of the Algorithm 1: Zeroth-Order Variance-Reduced Learning of CPs Based on Optimistic Mirror Descent (Player i).
r i to be the center and radius of an arbitrary ball within the set X i ; T k satisfies Randomly sample the direction u i k,t from S i 6: Take action X i k+1/2,t

8:
Observe the realized objective function value Ĵi k,t := J i ( X i k+1/2,t ; X −i k+1/2,t ) 9: end for 10: properties of B i k and V i k , which will be later employed in the proof of the main theorem.Their proofs are reported in Appendix A.
Lemma 1: Suppose that Assumption 1 holds.Then at each iteration k, the conditional expectation satisfies In contrast to the single-point or two-point estimates, the advantage of utilizing MPG is primarily demonstrated in the following lemma, which measures the decaying rate of the stochastic error w.r.t. the number of samples.
Lemma 2: Suppose that Assumption 1 holds.Then at each iteration k, the squared norm of

IV. A VARIANCE-REDUCTION LEARNING ALGORITHM AND CONVERGENCE ANALYSIS
In view of the properties of OMD introduced in Section II-B, we design a zeroth-order algorithm for merely coherent games by incorporating MPG into OMD, the precision of which can be controlled by adjusting the sample size per iteration.Each player of the group possesses their own local μi -strongly convex DGF, denoted by ψ i .Additionally, the function ψ (x) := i∈N ψ i (x i ) with x := [x i ] i∈N represents the group DGF, which is μ-strongly convex.The proposed approach is outlined in Algorithm 1.
The Robbins-Siegmund (R-S) theorem serves as a heavylifting tool in the field of stochastic optimization to examine the convergence of sequences.Its formal statement is presented as follows.
In the meantime, unlike the typical extra-gradient method, OMD leverages the pseudo-gradient F (X k−1/2 ) from the last iteration when updating to the leading state X k+1/2 .This approximation brings the stochastic error V k−1 2 into the recurrent inequality which, due to the absence of the averaging effect, does not possess a decaying upper bound and prevents us from applying the R-S theorem.Motivated by the consideration above, our next step will be establishing a variant of the R-S theorem by relaxing the condition imposed upon the sequence (ξ k ) k∈N + .The proofs for the results in this section are reported in Appendix B Theorem 1: Let ( , F, P) be a probability space and s. Lemma 4 (Standing Inequality): Suppose Assumption 1 holds and the step size τ satisfies (τ L/ μ) 2 ≤ 1/12.For the iteration k ≥ 3, the following recurrent relation holds: where the residual function is defined as ε(x) := x − P x,X (−τ F (x)) 2 and the errors are captured by 2 .With these results available, we can establish the following conclusion about the convergence of Algorithm 1 and the sufficient conditions to guarantee it.
Theorem 2: Consider a game G. Suppose that Assumptions 1, 2, and 4 hold.In addition, the sequence of query radius (δ k ) k∈N + and the sequence of the reciprocal of sample size (1/T k ) k∈N + are monotonically decreasing and satisfy The step size τ satisfies (τ L/ μ) 2 ≤ 1/12.Then the base state (X k ) k∈N + as well as the leading state (X k+1/2 ) k∈N + converge a.s. to a CP x * of G.Moreover, the actual sequence of play also satisfy lim k→∞ Xk+1/2,t = x * a.s., for arbitrary t.

V. LOCAL CONVERGENCE OF THE BANDIT LEARNING ALGORITHM
This section is dedicated to exploring the scenario in which the mere coherence property does not hold on the whole feasible set X but instead on a limited vicinity of certain CPs.
In preparation for further analysis, we postulate the following Lipschitz assumption on the group DGF ψ.As a reminder, the group DGF ψ is defined as the sum of individual DGFs, i.e., ψ (x) := i∈N ψ i (x i ).
Assumption 5: The group DGF ψ is L-smooth on X, i.e., for arbitrary x a and x b in X, An equivalent condition is that ∇ψ : As a result, for each player i ∈ N, its DGF ψ i is Li -smooth with the constant Li ≤ L. Assuming Assumption 3 holds, we can identify a smaller region around the CP set X * as Ũ := {x : D(X * , x) ≤ } ⊆ U , where D(X * , x) := inf x ∈X * D(x , x).For each x * ∈ X * , we also let Ũ (x * ) := {x : D(x * , x) ≤ } ⊆ Ũ .It is straightforward to verify that Ũ = ∪ x * ∈X * Ũ (x * ).In light of the relation holds for all x ∈ Ũ (x * ) as well.In the forthcoming analysis, we will center around the following two sets that take feasibility into account: To facilitate our analysis, for an arbitrary x * ∈ X * , we define ˆ B k,x * and ˆ For conciseness, we shall henceforth drop the subscript x * in ˜ V k,x * , ¯ V k,x * , etc., for notational simplicity and use the following notations: The variables S k and R k represent upper bounds for the cumulated errors introduced by the stochastic error (V t ) t=1,...,k .Define the event E x * k for k ≥ 3 as follows: In particular, we set Furthermore, we draw attention to that the values of S k and R k are dependent on x * and E x * k is tied to , although this dependence is not explicitly captured in the notations.The proofs of this section are reported in Appendix C.
The event E x * k represents that up to iteration k, the cumulated error induced by the stochastic error never goes beyond the chosen threshold /16.In Lemma 7, we will prove that if E x * k happens, then the leading state X t+1/2 will stay within the region of attraction U ε (x * ) for t = 1, . . ., k + 1.
Lemma 7: Suppose Assumptions 1, 3, and 5 hold, and there exists an x * ∈ X * such that X 1 ∈ U /2 (x * ) ⊆ U .Moreover, τ and the monotonically decreasing sequence (δ k ) k∈N + are properly selected such that Then on the event E x * K , the sequence (X t+3/2 ) t≤K will not escape U (x * ).
To leverage the conditional invariance of U (x * ) regarding the whole sequence (X k+1/2 ) k∈N + , we construct the limiting event that imposes an upper bound on stochastic errors: In Theorem 3, we will prove that the probability measure of event E x * ∞ can be made arbitrarily close to 1 by letting the sample size sequence (T k ) k∈N + increase rapidly enough.
Theorem 3: Suppose Assumptions 1, 3, and 5 hold, X 1 ∈ U /2 , and τ and (δ k ) k∈N + satisfy the conditions listed in (16).Let p ∈ (0, 1) be an arbitrary but fixed constant.The sequence (T k ) k∈N + is monotonically increasing and fulfills α V ˜ ( 2τ 2 μ + ≤ p with ˜ := (( 16 − 1 4 ) ∨ 1 4 ) ∧ ( 16 ) 2 .Then for any x * ∈ X * with X 1 ∈ U /2 (x * ), the probability of event E x * ∞ satisfies P(E x * ∞ ) ≥ 1 − p.While the occurrence of the event E x * ∞ depends on the particular x * ∈ X * selected, the conditions outlined in Theorem 3 that ensure its probability can be close to 1 are uniform across X * and do not rely on x * .Likewise, the conditions stated in Lemma 7 to guarantee the invariance of X t+3/2 regarding U (x * ) do not depend on x * .
Finally, we will show in Theorem 4 that if the random sample ω belongs to event E x * ∞ , the actual sequence of play will locally converge to a CP x ∈ U (x * ).
Remark 2: Combining the results of Theorems 3 and 4 yields that if all the conditions given in these two theorems are fulfilled and (T k the generated sequence of play Xk+1/2,t will converge to a CP with probability no less than 1 − p.

VI. NUMERICAL EXPERIMENTS
In the conference version [32, Sec.V], a rock-paper-scissors (RPS) game and a least square estimation game are examined, both of which satisfy global mere coherence.The RPS game leverages the negative entropy as DGF and its mirror map can be reduced to a softmax function, where the numerical comparison with [21] is also included.In this section, we conduct two sets of numerical experiments that only satisfy local mere coherence but not global mere coherence.We note that the scope of these two games is not covered by the results in [10], [15], [21].

A. TWO-PLAYER MINIMAX PROBLEMS
In this subsection, we use two two-player minimax saddlepoint problems to illustrate the effectiveness of the proposed method.Similar numerical examples have been previously discussed in [25], [39], which takes the following form: minimize Specifically, we consider a minimax problem that is formulated as follows: where x * := [0; 0] is a global CP for the feasible region.The pseudo-gradient field underneath this saddle point problem is given by F : the region X under consideration contains both an attracting and a repellent limit cycle, as proved in [25, Prop.2].The experimental results are depicted in Fig. 1.The background color displays the value of The underlying pseudo-gradient field and the attracting limit cycle are graphically presented in Fig. 1(a).In the simulation, we choose the query radius as δ k = 0.1(k + 10) −1.1 .Fig. 1(b) displays the actual sequences of play, with the legends providing a comprehensive account of the parameters selected.Fig. 1 indicates that the appropriate selection of the initial point within the basin of attraction results in a converging sequence towards the CP x * .When the sample count T k per iteration is insufficient, the estimation error may temporarily or even permanently drive the sequence away from the solution, as evidenced by the red curve.In a similar vein, another example featuring a smaller basin of attraction is formulated in the following manner: where . We can procure the CP x = [0.1422;0.2346] via direct numerical computation.In Fig. 2(a), we visualize the underlying pseudo-gradient field and the values of F (x), x − x * and highlight the attractive limit cycle with the solid grey curve.In our simulations, we manipulate T k and X 0 , and the results in Fig. 2(b) indicate that while increasing T k decreases the estimation error, proper selection of X 0 remains a crucial factor for achieving convergence to the CP.

B. COGNITIVE RADIO BANDWIDTH ALLOCATION PROBLEM
We consider a cognitive radio bandwidth allocation game whose transmissions are over single-input single-output (SISO) frequency-selective channels [36], [40].It is composed of P primary users (PUs) and N secondary users (SUs), with the SUs indexed by N := {1, . . ., N}.Each SU i competes against each other to maximize its own information rate, while simultaneously accounting for the cost incurred by determining its power allocation vector x i ∈ X i ⊆ R S over the S ∈ N + subcarriers.The objective for each SU i ∈ N can be characterized by the following expression: where (σ i s ) 2 represents the thermal noise power over the subcarrier s; H i j s denotes the channel transfer function between the secondary transmitter j and the receiver i; [x i ] s represents the s-th entry of the vector x i , which accounts for the power allocation decision of subcarrier s.Additionally, each SU i must adhere to a set of local constraints, which include prescribed transmit power and acceptable levels of degradation on the performance of the PUs.The local feasible set of SU i is described as ++ and bi ∈ R ++ ; Q pi s denote the channel transfer function between the secondary transmitter i and the primary receiver p over the subcarrier s; I pi tot is the maximum interference allowed to be generated by the SU i at the PU p over the whole spectrum.
When conducting the experiments, we consider a game with P = 3 PUs, N = 10 SUs, and S = 5 subcarriers.Let τ = 0.01.An interior CP x * is found and we verify numerically that the symmetric part of the Jacobian of the pseudo-gradient operator at x * is positive definite, which entails that it fulfills Assumption 3. The starting point X 0 is initialized in a proper neighborhood of x * .Four different sets of query radius δ k and query count T k have been chosen for implementation.In Fig. 3(a), a comparison of the relative distances to x * reveals that the convergence rate of the actual sequence of play is positively correlated with the rates of increase in T k and decrease in δ k .When T k remains a constant or merely grows sublinearly, the actual sequence of play will be bounded away from x * and fail to converge to it.Fig. 3(b) displays a comparison of updating step lengths for various choices of parameters, indicating that the curves associated with summable (δ k ) k∈N + and (1/T k ) k∈N + exhibit fewer fluctuations and maintain a decreasing trend.The rolling averages with a window size of 100 are depicted through the opaque curves, while the original fluctuations are illustrated by semi-transparent curves.

VII. CONCLUSION
In this work, we investigate bandit learning in multi-player continuous games with an emphasis on handling merely coherent cases.A new learning algorithm is proposed by integrating the idea of optimistic mirror descent and multi-point pseudo-gradient estimation.Under the assumptions posited and the conditions that the sequences of query radius δ k and the reciprocal of sample size T k are absolutely summable, the actual sequence of play generated by the proposed algorithm is shown to converge a.s. to a CP of the globally merely coherent game.For games featuring only local mere coherence, we establish the convergence of actual sequences of play in some neighborhoods of CPs with high probability.There are several potential directions for future exploration.The first one is relaxing the requirements for the number of samples per iteration T k , since the superlinear growth of T k may prevent the application of the proposed algorithm when the bandit feedback is inadequate.Furthermore, when it comes to a large-scale player network, the asynchronicity of the updates is a prevalent issue and the cost of synchronization is prohibitive, which is further exacerbated by the multi-point scheme considered.We intend to address these questions in future work.

A. PROOF OF SECTION III 1) PROOF OF LEMMA 1
By the tower property Fk := σ {F k ∪ σ {u k,0 }} ⊇ F k and the linearity of conditional expectation, we have Based on the fact that Xk+1/2 ∈ F k , we have the following relation holds a.s.: With the above results in hand, the norm of systematic error B i k can be reformulated as ) , and the proof for Lemma 2 of [19] directly carries over.

2) PROOF OF LEMMA 2
Using the definition of MPG and the linearity of conditional expectation, we have: For each pair (s, t ) with s = t, denote Fk,s := σ { Fk ∪ σ {u k,s }} and the conditional expectation of the inner product can be reformulated as follows: where (a) follows from the fact that Fk,s ⊇ Fk and ( Ĵi k,s − Ĵi k,0 )u i k,s is Fk,s -measurable; (b) and (c) can be deduced by applying the same arguments in Lemma 1. Combining the observations above yields: For the stochastic error ), applying the results above gives: The difference ( Ĵi k,t − Ĵi k,0 ) 2 can be further bounded as: where, in (a), we apply the mean value theorem for differentiable function and let Z denote some convex combination of Xk+1/2,t and Xk+1/2,0 ; for the relation (b) we let ∇i := max x∈X ∇ x J i (x) and apply the definition in (9).Consequently, it can be directly inferred that

B. PROOF OF SECTION IV 1) PROOF OF THEOREM 1
Before proceeding, we attribute the proving technique leveraged below to that of [41, Th. 2.3.5],while we provide complete proof for a simplified version and fill out some omitted steps of the reference for the completeness of this work.By letting ζk := k t=2 ζ t−1 for k ≥ 2 and ζ1 = 0, the recurrent inequality can be expressed as Likewise, let ξk := k t=2 ξ t−1 for k ≥ 2 and ξ1 = 0, and we have 0 ≤ ξk ξ∞ .It follows from the monotone convergence theorem that Through the integration of this definition into (B.1),we can construct a new recurrent inequality as follows: Based on the observation that ξ∞ − ξk ≥ 0, we can let Zk := which forms a sequence of nonnegative random variables, and deduce that:  Since the sequence ( ζk ) k∈N + is non-negative, monotonically increasing and bounded from above, its limit exists a.s., i.e., lim k→∞ ζk = ζ∞ a.s.Moreover, due to the surrogate relation that ζ∞ ≤ Z∞ and E[ Z∞ ] < ∞, we then obtain E[ ζ∞ ] < ∞.Therefore, we arrive at the conclusion that where we apply the decomposition of MPG.By appealing to the Cauchy-Schwarz inequality and the L-Lipschitz continuity of F , we can derive that 2 represents the error term, which we aim to demonstrate as being suitably diminutive.
To facilitate the convergence analysis in the merely coherent scenario, we upper bound − X k+1/2 − X k 2 as follows: where in (a), ε(x) := x − P x,X (−τ F (x)) 2 serves as a residual function and we leverage the 1/ μ-Lipscthiz continuity of P X k ,X [19, Lemma A.1 iv)].By the observation that ε(x ) = 0 is equivalent to the zero inclusion that 0 ∈ N X (x ) + τ F (x ), we can assert that x is a CP of G ⇐⇒ ε(x ) = 0.In light of the upper bound derived above and the choice of step size (τ L/ μ) 2 ≤ 1/12, (B.4) can be reformulated as: where ˆ k,2 := μ/(120 while it can be recursively obtained that for all k ≥ 3, 2 to both sides of (B.5) and substituting X k+1/2 − X k−1/2 2 of the R.H.S. with the proceeding inequality produce: where ˆ k,3 := μ/(15 L 2 ) k,3 .Further manipulating the coefficients of X k+1/2 − X k−1/2 2 gives ∀k ≥ 3: where we set 3) PROOF OF THEOREM 2 Utilizing the fact that x * is a critical point of the game and Assumption 2 is satisfied, we can infer that, when X k+1/2 ∈ X, F (X k+1/2 ), X k+1/2 − x * ≥ 0. By invoking Lemma 4 and taking the conditional expectation E[• | F k ] of both sides, we can deduce that The parameters satisfy The application of Lemma 2 allows us to characterize the squared norm of the stochastic error V k , resulting in Moreover, the inner product involving the stochastic error Through the synthesis of the aforementioned findings, we can ascertain that k≥3 E[ ˆ k ] < ∞.Then the application of Theorem 1 allows us to assert the following: i) 2 converges a.s. to some L 1 random variable.These results entail that there exists a sample set ˆ ⊆ and P( ˆ ) = 1 such that for any ω ∈ ˆ , the above statements i) − iv) hold true for the deterministic sequences (X k (ω)) k∈N + and (X k+1/2 (ω)) k∈N + .Moreover, since (X k (ω)) k∈N ∈ X and the map P x,X (−τ F (x)) is continuous in x, there exists a subsequence (k m ) m∈N + such that X k m (ω) m→∞ → x and lim m→∞ ε(X k m (ω)) = ε(x ) = 0, i.e., x is a CP of G.We can then substitute x for x * in iv).Since ii) suggests that k→∞ → 0, we can assert from iv) that D(x , X k (ω)) admits a finite limit.In conjunction with Assumption 4, it follows that D(x , X k m (ω)) m→∞ → 0 and hence D(x , X k (ω)) k→∞ → 0, i.e., the base states (X k (ω)) k∈N + converge to x .Combining this result with iii) yields that the leading states (X k+1/2 (ω)) k∈N + converge to x , and the a.s.convergence of the actual sequence of play ( Xk+1/2,t (ω)) k∈N + to x is directly derived from ( 9) and δ k k→∞ → 0.

C. PROOF OF SECTION V 1) PROOF OF LEMMA 5
By employing the definition of MPG, it can be attained that where regarding (a), it stems from the mean value theorem for differentiable function and letting Z denote some convex combination of Xk+1/2,t and Xk+1/2,0 ; in (b), we apply the Cauchy-Schwarz inequality and let ∇i : 2) PROOF OF LEMMA 6 By applying the "three-point identity" of the Bregman divergence [35,Sec. 4.1], we can relate X k+1/2 to X k as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. (c) where (a) is the outcome of Assumption 5; (b) can be deduced from that X k+1/2 = P X k ,X (−τ G k−1 ) and P X k ,X is 1/ μ-Lipschitz continuous; in (c), we employ Lemma 5 and D(p, x) ≥ μ/2 p − x 2 .It immediately entails that D(x * , X k+1/2 ) ≤ .
3) PROOF OF LEMMA 7 We prove this property by induction.For the first iteration, X 3/2 = X 1 and X 2 = P X 1 ,X (−τ G 1 ), and it follows that where we note that , and by similar arguments, it follows that Again using Lemma 6, we have X 7/2 ∈ U (x * ).
To prove the statement, we will utilize an inductive argument.For an arbitrary k ∈ {3, 4, . . ., K}, suppose that X t+1/2 ∈ U (x * ) holds for all 3 ≤ t ≤ k, and we aim to show X k+3/2 ∈ U (x * ).By applying Lemma 4, neglecting the negative terms on the R.H.S., and telescoping them across t = 3, . . ., k, we have Since by the inductive hypothesis, 2 .Combining the properties above yields: We then proceed to upper bound ˆ k by separating it into the parts associated with systematic errors and stochastic errors, i.e., ˆ k ≤ ˆ B k,x * + ˆ V k,x * .After applying Cauchy-Schwarz inequality and triangle inequality, ˆ B k,x * can be upper bounded as: On account of the postulated summability k∈N + δ k < ∞, we can choose a proper sequence of query radius such that k≥3 ¯ B k,x * ≤ (1/16) .We then move on to examine  Under the condition that X 1 ∈ U /2 , it ensues that {x ∈ X * : D(x, X 1 ) ≤ /2} = ∅, and we can select an arbitrary x * ∈ X * that satisfies X 1 ∈ U /2 (x * ).In the subsequent proof, unless otherwise stated, we will adopt the shorthand notation ˜ V k and ¯ V k to refer to ˜ V k,x * and ¯ V k,x * for brevity.With this in hand, we construct the following recurrent relation for k ≥ 3 where we further expand 1 , since X k+3/2 ∈ U on event E x * k as proved in Lemma 7; . Then taking the expectation of both sides of (C.1) gives: Using the results above, it can be shown that E[(S 3 ) 2 + R 3 ] ≤ 2τ 2 α V μT 3 + 5 μ 4 L 2 • α V T 1 .By telescoping (C.2), we obtain Since (E x * k ) k≥2 is a contracting sequence of events, we have ) and hence Then ((E x * k ) c ) k≥2 is an expanding sequence of events and (E x * k ) c (E x * ∞ ) c .By the continuity of probability measure, P((E x * k ) c ) P((E x * ∞ ) c ) and choosing proper (T k ) k∈N + yields and hence P(E x * ∞ ) ≥ 1 − p.

5) PROOF OF THEOREM 4
We start by fixing an arbitrary x * ∈ X * such that X 1 ∈ U /2 (x * ).Applying the standing inequality from Lemma 4 regarding x ∈ X * that can be different from x * and taking the indicator function 1 into account, we have where on the event E x * k−1 , we immediately have X k+1/2 ∈ U (x * ) ⊆ U and F (X k+1/2 ), X k+1/2 − x ≥ 0. Since E regardless of the specific choice of x * .By applying the extended version of the R-S theorem (Theorem 1), we arrive at the following claims: i) k−1 converges a.s. to some L 1 random variable.These results entail that there exists a sample set ˆ ⊆ and P( ˆ ) = 1 such that for any ω ∈ ˆ ∩ E x * ∞ , the above statements i) − iv) hold true for the deterministic sequences (X k (ω)) k∈N + ⊆ U 7 /8 (x * ) and (X k+1/2 (ω)) k∈N + ⊆ U (x * ) and all the indicator functions admit the constant value 1.
denotes their vertical stack.For a vector v and a positive integer i, [v] i denotes the i-th entry of v.We let • denote the 2 -norm and •, • represent the canonical dot product.Let cl(S) denote the closure of set S, int(S) the interior, and ∂S the boundary.The symbols a ∧ b and a ∨ b stand for the lesser and the greater of the two real numbers a and b, respectively.
2 ) a.s.for every i ∈ N. Moreover the systematic error B k := [B i k ] i∈N possesses a decaying upper bound B k ≤ α B δ k for some positive constant α B .

FIGURE 1 .
FIGURE 1.The pseudo-gradient field F of (19) and the actual sequences of play Xk+1/2 generated by Algorithm 1.

FIGURE 2 .
FIGURE 2. The pseudo-gradient field F of (20) and the actual sequences of play Xk+1/2 generated by Algorithm 1.

2) PROOF OF LEMMA 4
By applying the standing recurrent inequality of OMD [19, Lemma A.2][22, Prop.B.3] and letting x * denote one CP of G, we can obtain the following relation for the k-th iteration: