Load Balancing Multi-Player MAB Approaches for RIS-Aided mmWave User Association

In this paper, multiple reconfigurable intelligent surface (RIS) boards are deployed to enhance millimeter wave (mmWave) communication in a harsh blockage environment, where mmWave line-of-sight (LoS) link is completely blocked. Herein, RIS-user association should be considered to maximize the users’ achievable data rate while assuring load balance among the installed RIS panels. However, maximum received power (MRP) based RIS-user association will overload some of the RIS boards while keeping others unloaded, which causes RIS load to unbalance and decreases the users’ achievable data rate. Instead, in this paper, an online learning methodology using centralized multi-player multi-armed bandit (MP-MAB) with arms’ load balancing is proposed. In this formulation, mmWave users, RIS boards, and achievable users’ rates act as the bandit game players, arms, and rewards. During the MAB game, the users learn how to avoid associating with the heavily loaded RIS boards, maximizing their achievable data rates, and balancing the RIS loads. Three centralized MP-MAB algorithms with arms’ load balancing are proposed from the family of upper confidence bound (UCB) MAB algorithms. These algorithms are UCB1, Kullback-Leibler divergence UCB (KLUCB), and Minimax optimal stochastic strategy (MOSS) with arms’ load balancing, i.e., UCB1-LB, KLUCB-LB, and MOSS-LB. Numerical analysis ensures the superior performance of the proposed algorithms over MRP-based RIS-user association and other benchmarks.


I. INTRODUCTION
Reconfigurable intelligent surface (RIS) is a well-thoughtout and talented enabler for future sixth-generation (6G) wireless communications due to its ability to smartly configure the wireless communications environment [1]. RIS consists of enormous passive antenna elements capable of intelligently controlling the wireless communication channel utilizing phase shifters (PS). This can be done by smartly The associate editor coordinating the review of this manuscript and approving it for publication was Bilal Khawaja .
controlling the Electromagnetic wave (EM) propagation by either strengthening or weakening it in specific directions based on the undergoing application without any need for the complicated radio frequency (RF) chains as in conventional communication systems [1]. This cheap and easy-toinstall technology attracted researchers and practitioners to investigate RIS-aided communication systems and apply it in various wireless communication applications [2], [3], [4], [5], [6], [7].
Due to the small coverage of mmWave transmission, especially in harsh blockage environments, many RIS boards need to be deployed within the mmWave BS coverage area to support mmWave users. Thus, RIS-user association becomes a critical and challenging problem, as mmWave users should be associated with the RIS board maximizing their achievable data rate. At the same time, load balancing should be maintained among the deployed RIS boards. This problem is computationally infeasible due to the incredible number of RIS-user patterns that need to be investigated, which is growing exponentially with the incremental number of RIS boards and mmWave users. Also, the constraint of RIS panels load balancing further increases the complexity of the problem. Furthermore, applying the conventional maximum received power (MRP) based RIS-user association will overload some RIS boards, especially those located in the vicinity of numerous users. Consequently, RIS load unbalancing will occur, and users' rates will be decreased as the RIS resources should be shared among many associated users. In this paper, we will focus on mmWave RIS-user association issue assuming that transmit (TX)/ receive (RX) beams between BS and RIS, and between RIS and UE can be optimally adjusted using exhaustive searching or any advanced BT techniques exist in literature such as given in [6], [11], and [12].
In this paper, online learning is a brilliant tool to address the mmWave RIS-user association problem efficiently. In this context, it is handled using centralized multi-player multiarmed bandit (MP-MAB) [28] with arms' load balancing. MAB is an efficient online learning methodology, where the player aims to maximize his long-term reward via playing over the arms of the bandit game [29], [30], [31]. In this formulation, the users acting as the players of the MP-MAB game try to associate with lightly loaded RIS boards working as the arms of the MP-MAB game to increase their long-term achievable data rates acting as the rewards of the game. The information about RISs' loads is distributed among mmWave users through mmWave BS using control channels. Thus, users play the game concurrently while avoiding collisions, i.e., associate with the heavily loaded RIS boards. This can be done by knowing the current RISs' loads based on the current RIS-user association pattern. Accordingly, users can learn, in a time-by-time fashion, the best RIS-user association pattern maximizing their attainable rates while maintaining load balance among RIS boards with negligible computational complexity. Thus, the main contributions of this paper can be summarized as follows: • The problem of mmWave RIS-user association in multiple RIS multi-user scenarios is considered to maximize users' achievable data rates while maintaining load balance among the deployed RIS boards.
• This optimization problem is reformulated as a sequential centralized MP-MAB game with arms' load balancing, where the users, the RIS boards, and the achievable users' data rates act as the players, the arms, and the profit of the bandit game, respectively. During the game, the users learn how to avoid associating with the heavily loaded RIS boards, increasing their achievable data rates while assuring load balance among RIS boards.
• Three centralized MP-MAB algorithms with arms' load balancing, coming from the family of upper confidence bound (UCB) [32], namely UCB1-LB, Kullback-Leibler UCB-LB (KLUCB-LB), and minimax optimal stochastic strategy-LB (MOSS-LB), are proposed to address the formulated bandit game and to compare their performances. These algorithms are modified versions of the famous UCB1 [32], KLUCB [33], and MOSS [34] MAB algorithms, where a normalization factor indicating the current RIS loads is added to the typical UCB1, KLUCB, and MOSS equations. Thus, the players, i.e., mmWave users, can jointly consider the current RIS loads when selecting their associating RIS boards.
• Extensive numerical analysis is conducted to prove the effectiveness of the proposed UCB1-LB, KLUCB-LB, and MOSS-LB algorithms over their naïve counterparts, i.e., UCB1, KLUCB and MOSS without load balancing. Also, we prove their effectiveness over the conventional MRP and random-based user association. The obtained results reveal the superior performance of MOSS-LB over the other proposed schemes. The remainder of this paper is organized as follows; Section II summarizes the literature review. Section III details the proposed mmWave multi-RIS multi-user scenario, including the mmWave RIS channel model and the RIS-user association optimization problem. Section IV provides the proposed UCB1-LB, KLUCB-LB, and MOSS-LB MP-MAB algorithms. Finally, Section V provides performance evaluation results of our envisioned MAB schemes followed by the concluded remarks in Section VI.

II. LITERATURE REVIEW
The evolution of EM materials has improved the attractiveness of the prospective RIS concept for addressing future wireless network difficulties [1]. RISs handle not only VOLUME 11, 2023 half-wavelength antenna configuration challenges but also provide cost and energy savings advantages. Furthermore, RISs can redistribute EM waves to reduce power usage considerably. Additionally, advanced wave propagation settings rely on RISs to deliver extensive, energy-efficient, and continuous wireless connectivity [4]. In communication networks, the RIS modulator and RIS relay are two key technologies. Phase/amplitude modulations employing RIS were addressed in [7], where tuning the RIS reflection coefficients can modulate the received signals from the antenna. The transmitted power from the BS to the RIS board is mirrored to the mobile user in RIS relaying via intelligent tuning of the PSs [12], [17].
Few studies have investigated the influence of RIS adoption on mmWave networks. The authors of this paper suggested in [12] and [35] a two-stage MAB-based methodology for mmWave RIS networks to find the optimum link that maximizes the achievable spectral efficiency. Furthermore, they deployed a MAB-aided RIS-mounted aerial in [5] to improve the mmWave connectivity for user equipments (UEs) in hotspot regions. The work in [6] investigated a dual approach of BS's hybrid precoders and RIS's passive precoder to optimize the spectral efficiency in a RIS-empowered mmWave MIMO configuration via managing the various timing channel state information (CSI). The work in [13] proposed a tractable probabilistic framework for the coverage analysis of RIS-enhanced mmWave systems. In [14], a federated learning (FL) approach with mmWave-enabled RIS networks was presented. In [15], the authors examined an efficient cascaded channel estimation formulation for RIS-assisted mmWave MIMO systems, whereas in [16], they employed atomic norm reduction for RIS-aided mmWave channel estimation. The work in [17] dealt with the hybrid precoding (HP) method for multi-user RIS-enabled mmWave systems. In addition, a deep learning-based methodology was introduced in [18] to obtain the best possible transmission rate capability for RIS systems. The authors of [19] explored beam management of RIS-empowered mmWave telecommunication and provided a machine learning (ML)-based approach. A static and adaptively mixed relay RIS topologies were studied in [20], and they outperform traditional benchmarking RIS architectures. The authors of [21] provided a precise formulation for signalto-noise power ratio (SNR) statements of RIS-empowered mmWave amplify and forward (AF) relay networks. They developed the AF relay's optimal power allocation approach to acquire the ideal PSs for optimizing the end-to-end SNR. The authors of [22] employed sequential fractional programming (SFP) and forward-reverse auction (FRA) techniques to handle the joint optimization of passive beamforming in mmWave-assisted RIS, power allocation, and user association. Furthermore, the research in [23] examined RIS-aided mmWave BT configurations to determine the best proper beams and reflection coefficients.
Recently, some research works studied the problem of RIS-user association like that given in [24], [25], [26], and [27]. In [24], the authors performed signal-to-interference plus noise ratio (SINR) analysis for RIS distributed multiinput single-output (MISO) system under a particular RIS-user association pattern. However, the authors considered neither the problem of finding out the optimal RIS-user association pattern nor mmWave communication. The authors of [25] investigated the issue of joint beamforming and user association in RIS-assisted mmWave communication. However, they considered the case of one RIS board and never referred to optimizing multi-RIS multi-user scenarios like the current work. In [26], the BS-RIS-user association was considered in downlink communications. Still, the authors simplified the problem by considering one user associated with one RIS and one BS without considering any load balance among the RIS boards and mmWave communications. In [27], RIS aided BS-user association is investigated, where one RIS board is deployed in a multi-BS multi-user scenario, and the BS-user association is addressed utilizing the RIS reflected paths. Also, joint active/passive beamforming was investigated during the BS-user association optimization. However, the authors considered neither multi-RIS multi-user scenario nor mmWave communications. Also, they used an exhaustive search algorithm to find the optimal BS-user association, which requires high-performance computational resources, especially for large numbers of BSs and users. To the best of our knowledge, there is no work in literature considering the problem of RIS-user association in multi-RIS multi-user scenarios using mmWave communications. Figure 1 shows the proposed system model of mmWave RISuser association. In this figure, Q RIS boards, e.g., Q = 4, are deployed in the mmWave BS coverage area to support mmWave communication to K UEs, where the direct LoS links between mmWave BS and UEs are entirely blocked by the existing blockers. The mmWave BS controls the PSs of the distributed RIS boards using dedicated control links. Each UE should associate with one of the deployed RIS boards in this scenario to provide virtual LoS communication by routing around the blocker. The scenario where UE is connected to multiple RIS is left for our future investigations. The selected RIS board should maximize the UE's achievable data rate by considering both the received power and the RIS load. The traditional MRP-based RIS-user association approaches may result in RIS load unbalance, as shown in Fig. 1, where RIS boards 2 and 4 are heavily loaded while RIS boards 1 and 3 are lightly loaded. This reduces the achievable data rates of the users related to the heavily loaded RISs due to resource sharing. Thus, an efficient RIS-user association algorithm is needed to realize the mmWave multi-RIS multiuser scenario efficiently.

A. mmWave BS-RIS-UE CHANNEL MODEL
In this channel model, the mmWave BS is equipped with a uniform linear array (ULA) of M antenna elements, the RIS board is equipped with a uniform planner array (UPA) of N antenna elements, and UE is assumed to have a single antenna. The mmWave BS controls the PSs of the RIS antenna elements through the dedicated control channel. Thus, the received signal strength at UE k from mmWave BS through RIS board q can be expressed as: where P is the TX power. y k is the received (RX) symbol at UE k and n k ∼ CN 0, σ 2 0 is the complex additive white Gaussian noise (AWGN) at UE k. For simplicity, in (1) we neglected the interference among UEs as mmWave RIS is highly directional communication with negligible interference among users [8]. f i ∈ C M ×1 is the precoding vector of size M × 1 containing the PS vector at the BS, 1 ≤ i ≤ |F|, where F is the space of all available precoding vectors at the BS. q j ∈ C N q ×N q is the diagonal matrix of size N q × N q containing the PS vector of RIS board q in its diagonal, 1 ≤ j ≤ q , where q is the space of all available PS matrices of RIS q. H Bq ∈ C N q ×M is the N q × M channel matrix between BS and RIS board q. h H qk is the channel vector of size N q × 1 between RIS board q and the UE k, where H denotes Hermitian transpose. Assuming limited scatterers mmWave channel model as given in [12] and [36], H Bq and h H qk can be expressed as: where L Bq indicates the number of channel paths between BS and RIS board q, while L qk indicates that between RIS board q and UE k. a l and b l represent the large-scale fading coefficients following complex Gaussian distributions as as given in [36]. Herein, d is the separation distance between TX and RX, ρ l (d 0 ) represents the path loss at a reference distance d 0 = 1 m, and α is the path loss exponent. χ l ∼ CN 0, σ 2 χ l is the shadowing term of the path l. In (2), V q π are the azimuth and elevation AoD at the RIS board q. Generally, V q (θ, φ ) is expressed as [12]: where 0 ≤ c, s ≤ N q − 1 , r is the antenna spacing, and λ is the wavelength. Likewise, g B (ϑ ) is expressed as [12]: where 0 ≤ m ≤ (M − 1).

B. RIS-USER ASSOCIATION OPTIMIZATION PROBLEM
Assuming optimal beamforming between BS, RIS, and UE, the RIS-user association aims to find the optimal RIS-user association pattern, which maximizes the sum rates of the users while maintaining load balance among the deployed RIS boards. This optimization problem can be expressed mathematically as: where W is the available bandwidth and ψ qk is the achievable spectral efficiency in bps/Hz of UE k when associated with VOLUME 11, 2023 RIS board q, which can be expressed as: In (6), I QK ∈ I QK is the association matrix of size Q × K contains the association pattern between RIS boards and UEs, where I QK ∈ {0, 1} Q×K is the space of all association matrices. If user k is associated with RIS board q, one is placed in the (q, k) element of the I QK matrix, and 0 otherwise. In (6a), the term indicates that resources of RIS board q are equally shared among UEs associated with it. The second constraint in (6c), i.e., Q q=1 I qk = 1 ∀k, means that UE k should be associated with one RIS board only, while the third constraint in (6d), i.e., K k=1 I qk ≤ K ∀q, means that many UEs up K can be associated with one RIS board q. The fourth constraint in (6e) implies that Jain's fairness index of the RIS loads should be bounded between 1/Q and 1, which means that load balance should be maintained among the deployed RIS boards.
The optimization problem given in (6) is a non-linear integer programming (NIP) problem, or more specifically it is a non-linear 0 -1 programming. This is because the decision variables are restricted to be 0 or 1 in addition to the non-linear constraint given in (6e). As stated in [37] and [38], these kind of problems are categorized as NP-hard problems. If we consider exhaustively searching all possible solutions for (6), then we need to search over Q K different I QK matrices, i.e., the number of candidate association matrices exponentially increases when increasing Q and K . Moreover, the non-linear constraint given in the (6e) highly complicates the exhaustive search solution as the selected association matrix I * QK should not only maximize the achievable data rates of the UEs, but also maintain load fairness among the deployed RIS boards. So far, at the best of our knowledge, there is no trivial efficient algorithm for this problem as explained in [37] and [38] and their associated references.

IV. PROPOSED MP-MAB WITH ARMS' LOAD BALANCING
In this section, we will re-formulate the RIS-user association problem as a time sequential optimization problem to maximize the long-term users' achievable data rates while considering RIS load fairness. Then, an online learning approach is suggested to address the problem by assuming it as an MP-MAB game with arms' load balancing. In this section, we will briefly explain the MAB principle. Then, we will reformulate (6) as a time sequential optimization problem. Finally, we will detail the proposed MP-MAB with arms' load balancing algorithms, i.e., UCB1-LB, KLUCB-LB, and MOSS-LB, to implement the proposed bandit game.

A. MAB PRINCIPLE
MAB is an efficient self-learning methodology where a player aims to increase its long-term profit while playing over the arms of the bandit. The available information to the player is the observed arms' rewards. Throughout the bandit game, the player compromises between consistently exploiting the arm with the highest average reward/payoff or exploring new ones, known as the exploitation-exploration compromise of the MAB games [39]. The bandits can be categorized as stochastic when the rewards come from independent and identical distributions (i.i.d) or adversarial when they come from unknown distributions. Also, based on the number of players involved in the MAB game, it can be a single-player MAB (SP-MAB) or a multi-player MAB (MP-MAB). In the former category, only one player is playing over the arms of the bandit, while in the latter category, multiple players are playing simultaneously over the arms of the bandit. In MP-MAB, collisions happen where several players simultaneously select the same arm. Based on the collision model, the arm's reward may be shared among the collided players, or none of the collided players gains a reward. Also, based on the distributed information among the players, the MP-MAB game can be classified as centralized and decentralized. In the centralized MP-MAB, the players are permitted to exchange their individual previous rewards' observations, which reduces the collisions as much as possible. However, in the case of decentralized MP-MAB, no information is exchanged among the players, and every player acts selfishly, which results in high collisions among the players. In this case, the player tries to learn the collision patterns and struggles to avoid them to maximize his profit while playing the game. In this paper, when a collision happens, i.e., many users associated with the same RIS board, they will share the RIS resources equally. As the BS in the RIS-user association problem knows the load on every RIS board through the dedicated control link between them, BS can easily share the previous RISs' load observations among the users. Thus, centralized MP-MAB can be implemented to avoid collisions among the users when selecting their associated RIS boards, which results in maximizing the users' achievable data rate and maintaining load balance among the deployed RIS boards. Hence, it can fulfill the targets of the optimization problem given in (6).

B. RIS-USER ASSOCIATION OPTIMIZATION PROBLEM RE-FORMULATION
Applying the MAB hypothesis to (6) reformulates it as a time sequential optimization problem to maximize its long-term cost function sequentially over the time horizon. Thus, (6) can be re-written as follows: where, In (7), 1 ≤ t ≤ T H , where T H ∈ Z + is the time horizon and Z + is the set of all positive integers. I QK ,t is the selected BIS-user matrix at time t, where I qk,t is set to 1 if user k is associated with RIS board q at time t, and 0 otherwise. ψ qk,t given in (7g) stands for the achievable spectral efficiency of user k when associated with RIS board q at time t. In (7g), H Bq,t and h qk,t refer to the channel matrix and the channel vector between BS and RIS board q, and between RIS board q and user k at time t. Similarly, q,t j and f i,t represent the PSs diagonal matrix of the RIS board q and the BS PS vector at time t, respectively. In (7), the optimal RIS-user pattern I * QK is obtained by maximizing the long-term achievable user sum rates over the time horizon in a time-by-time fashion. Also, the term indicates that resources of RIS board q are equally shared among UEs associated with it at time t.

C. PROPOSED MP-MAB ALGORITHMS WITH ARMS' LOAD BALANCING
The time sequential optimization problem given in (7) can be considered a centralized MP-MAB game with the users/players who intend to maximize their long-term achievable data rates over time when selecting their associated RIS boards while avoiding collisions with the other users. The powerful of using online learning like MAB schemes over conventional solutions like brute force in user association problem comes from its ability of addressing the problem in a distributive manner unlike the conventional high complex centralized schemes. These centralized methods are not suitable in the highly massive mmWave RIS systems since they require excessive amount of information and incur large computational and signaling costs. In other words, they need to gather all network information and then decide which mmWave RIS-user association pattern maximizing the

Algorithm 1 UCB1-LB-Based mmWave RIS-User Association
Output: q * k, t Input: q , ∀q ∈ Q, F, T H . Initialization: Each RIS board q, will be selected once, and the corresponding ψ qk,t is observed, 1 ≤ t ≤ |Q|. 1 for t = |Q| + 1 : T H do 2 1. Associate with RIS board q * k, t and obtain the reward ψ q * k,t :

end for
total system rate while maintaining load balance among RIS boards. Instead, by the means of online learning, the RIS-user association is done by the users themselves in time-by-time fashion only based on their pervious observations about achievable data rates and RIS loads. Thus, they will learn the lightly loaded RIS boards that maximize their achievable data rates by themselves without the complicated centralized management/control from the network side. Although we consider static environment in this paper, the proposed MAB technique can track networks dynamics without the need of collecting the whole network information every time for reconsidering the whole RIS-user association, which is left for our future investigations. Recently, the application of MAB schemes in conventional BS-user association, without using RIS, for both with and without load balancing gained a lot of attention as given in [37] and [38]. In these papers, the authors emphasized on the powerful of using online MAB schemes over conventional centralized approaches to address BS-user association problem, which motivates us to apply it in mmWave RIS-user association with load balancing problem.
Herein, we will propose three UCB-enhanced MP-MAB algorithms with arms' load balancing, namely UCB1-LB, KLUCB-LB, and MOSS-LB, implemented by each user. We aim to compare their performances when employed to the mmWave RIS-user association problem. Generally, UCB is one of the famous MAB algorithms [32], which intends to increase the upper confidence bound of the selected arm, where it balances between exploiting the highest average reward arm and exploring the less selected ones. UCB1 was the naïve algorithm of the UCB family [32], then KLUCB [33], and MOSS [34] were proposed as two more efficient UCB variants with lower regret bounds than UCB1.

1) PROPOSED UCB1-LB ALGORITHM
UCB1 [32] is one of the most MAB algorithms that can best address the exploitation-exploration tradeoff of the MAB VOLUME 11, 2023 game by compromising between always selecting the arm with the highest average reward so far or exploring the less selected one. The proposed UCB1-LB will be implemented in each user in the centralized MP-MAB game, where the information of the RIS loads is distributed among the users by the mmWave BS. Towards that, a term representing the normalized RIS loads is added to the exploration term of the original UCB1 equation. Thus, the user explores not only the less selected RIS board but also the lightly loaded one when selecting its associated RIS board at time t.
Algorithm 1 gives the proposed UCB1-LB algorithm. The inputs to the algorithm are ∀q ∈ Q, q , F, and T H , and the output is q * k, t , i.e., the selected RIS board for associating user k at time t. For initialization, user k will associate with all available RIS boards at once and observe their attainable spectral efficiencies ψ qk,t for 1 ≤ t ≤ |Q|. Then for |Q|+1 ≤ t ≤ T H , it associates with the RIS board q * k, t based on the following equation: whereψ qk,t−1 is the average spectral efficiency achieved by user k from associating with RIS board q till time t − 1, and β qk, t−1 is the number of times RIS board q was selected for associating user k till time t − 1. The added parameter η q,t−1 , 0 ≤ η ≤ 1, indicates the observed normalized load on the deployed RIS boards just before the decision time t. Herein, η q,t−1 is defined as follows: In (9), the numerator indicates the traffic load on RIS board q at t − 1, while the denominator indicates the maximum traffic load on all deployed RIS boards at time t − 1. Thus, in UCB1-LB, the termψ qk,t−1 represents the exploitation term while the term 2 ln(t) β qk, t−1 − η q,t−1 represents the exploration term, where both the less selected RIS boards, as well as those lightly loaded, are explored. Thus, (8) means that based on user k past observations up to time t -1, it will select its next associating RIS board at time t. where time t could be frame by frame basis. This selected RIS board should have the highestψ qk,t−1 value and the lowest load, i.e., lowest η q,t−1 value. Thus, if an RIS board has low value of η q,t−1 , i.e., η q,t−1 ≈ 0, it will be considered as lightly loaded RIS board by user k and its selection possibility will be high, and vice versa. By this way, the lightly loaded RIS boards will have high preferences to be selected at every time slot, which implies fair load distribution among RIS boards and consequently improves the Jain's fairness index constraint.
After selecting q * k, t, the corresponding ψ q * k,t is obtained and its related parameters β q * k,t andψ q * k,t are updated

Algorithm 2 KLUCB-LB-Based mmWave RIS-User Association
Output: q * k, t Input: q , ∀q ∈ Q, F, T H . Initialization: Each RIS board q, will be selected once, and the corresponding ψ qk,t is observed, 1 ≤ t ≤ |Q|. 1 for t = |Q| + 1 : T H do 2 1. Associate with RIS board q * k, t and obtain the reward ψ q * k,t :

end for
as follows:

2) PROPOSED KLUCB-LB ALGORITHM
The KLUCB [33] approach is a quasi-optimal MAB strategy that effectively addresses the exploration-exploitation conundrum. The main distinction between KLUCB and UCB1 is that the upper confidence bound is defined using Chernoff's bound [33]. As a result, it not only outperforms UCB1 regarding regret bound for bounded rewards [0,1] but is also more efficient for limited time horizons [33]. Furthermore, it is adaptable, rapid, and straightforward and appears to have a significant theoretical and practical advance over the commonly used UCB1 [33]. Algorithm 2 explains the proposed KLUCB-LB in detail. The inputs to the algorithm are q , ∀q ∈ Q, F, and T H , while the output is the selected RIS board for associating user k at time t, i.e., q * k, t. For initialization, user k associates with each RIS board q at once, and its corresponding spectral efficiency is obtained, i.e., ψ qk,t . For |Q| + 1 ≤ t ≤ T H , user k associates with one of the deployed RIS boards based on the following KLUCB modified equation:

Algorithm 3 MOSS-LB-Based mmWave RIS-User Association
Output: q * k, t Input: q , ∀q ∈ Q, F, T H . Initialization: Each RIS board q, will be selected once, and the corresponding ψ qk,t is observed, 1 ≤ t ≤ |Q|. 1 for t = |Q| + 1 : T H do 2 1. Associate with RIS board q * k, t and obtain the reward ψ q * k,t :

end for
where, f (t) = log(t) + 3log (log(t)) and γ (µ 1 , . Accordingly, 2(ψ qk,t−1 , µ q,t ) 2 is solved to find out µ q,t for RIS board q at time t. In (12), the RIS load parameter, i.e., η q,t−1 , is added to the exploration term of the standard KLUCB equation. After selecting the RIS board q * k, t for associating user k at time t, its related parameters β q * k,t andψ q * k,t are updated using (10) and (11), as highlighted in Algorithm 2.

3) PROPOSED MOSS-LB ALGORITHM
MOSS [34] is another modified UCB policy via removing the entire log factor from the confidence levels of UCB1 [32]. However, MOSS is the first scheme that made such a modification, and it relies on prior knowledge of the time horizon, which can be relaxed. In MOSS, the confidence level is selected upon the number of plays of the individual arms, time horizon value, and the number of actions/arms. MOSS has a better regret bound than UCB1 and is flexible to stochastic and adversarial environments.
Algorithm 3 gives the proposed MOSS-LB, where the inputs to the algorithm are q , ∀q ∈ Q, F, and T H , while the output is q * k, t indicating the selected RIS board for associating user k by time t. For initialization,ψ qk,t = 0 and β qk, t = 1, ∀q ∈ Q. For 2 ≤ t ≤ T H , an associating RIS board is selected based on the following equation: whereψ qk,t−1 is the average spectral efficiency results from associating user k with RIS board q till t −1. Like UCB1-LB, the RIS load term η q,t−1 is added to the exploration term to the standard MOSS equation, where η q,t−1 is give in in (9). After selecting q * k, t its corresponding spectral efficiency is obtained, i.e., ψ q * k,t and its corresponding parameters β q * k,t andψ q * k,t are updated using (10) and (11), as shown in Algorithm 3. Table. 1 summarizes the comparisons between the proposed MAB algorithms with respect to their advantages and disadvantages.
In the proposed MAB schemes, re-association is done at every time t, which may arise the concern of re-association overhead. However, the mmWave RIS-user association problem is unique and different from the conventional BS-user association given in [37] and [38],where all users are already linked with the main BS, and the RIS boards are used for just providing additional reflected paths via adjusting their PSs towards its related UEs. That is the whole PS adjustments (i.e., BT) among BS, RIS boards and UEs can be done in the initial phase. Then, the re-association can be done in a frame-by-frame basis with negligible overhead thanks to the beamforming information distributed by the mmWave BS to both RIS boards and UEs. However, the investigation of re-association overhead in conjunction with dynamic channel environment will be the subject of our future investigations.   RIS-user association, will be given. In the simulation scenario, a mmWave BS is located at the center of the simulation area, where several RIS boards are deployed at its borders, and users are distributed inside it. Also, in the case of collision, where several users associate with the same RIS board, resources are equally shared among the collided users using time division multiple access (TDMA). The other essential simulation parameters are summarized in Table. 2, unless otherwise stated.

A. PERFORMANCE COMPARISONS WITH NON-LOAD BALANCING MP-MAB COUNTERPARTS
In this part of the numerical analysis, we compare the performance of the proposed UCB1-LB, KLUCB-LB, and MOSS-LB algorithms against their naïve counterparts, i.e., UCB1, KLUCB, and MOSS without RIS load balancing. To implement naïve UCB1, KLUCB, and MOSS algorithms, the RIS load balancing term η q,t−1 is removed from (8), (12), and (13), respectively. A simulation area of 2500 m 2 is constructed, and 4 RIS boards/panels are in its boundaries, where the mmWave BS is in its center. Figure 2 shows the average total system rate against the number of randomly distributed users, i.e., K . As this figure explicitly shows, as the number of users increases, the average total system rate is also incremented for all schemes involved in the comparisons. Moreover, the proposed MP-MAB algorithms with (w) arms' load balancing outperform their counterparts without (w/o) load balancing. It is interesting to notice that MOSS has the best performance over both KLUCB and UCB1 for both cases w and w/o arms' load balancing. This is due to its adaptability to stochastic and adversarial environments with better regret bounds than KLUCB and UCB1, which are unaffected even after adding the load balancing term. At K = 4, MOSS-LB, KLUCB-LC, and UCB1-LB outperform their MOSS, KLUCB, and UCB1 counterparts by 12%, 14% and 12.5%, respectively. These values become 31%, 36.5% and 43% when K = 20, respectively. Moreover, MOSS-LB outperforms KLUCB-LB and UCB1-LB by 1.6% and 9% when K = 4, and 0.9% and 4.6% when K = 20, respectively. Figure 3 shows the average user data rate comparisons between the proposed MP-MAB algorithm w and w/o arms' load balancing against the number of users. As the number of users increments, the average user rate is decreased due to the increased number of collisions, i.e., multiple users associated with the same RIS board, where resource sharing occurs. Still, the proposed MP-MAB algorithms with arms' load balancing have average user rate performances better than those achieved by MP-MAB algorithms without load balancing. This comes from the collision avoidance capability of the proposed algorithms, thanks to the proposed RIS load balancing term. Also, it is interesting to note that the average user rate performances of the MP-MAB algorithms without load balancing nearly match each other and decrease with high rates when increasing the number of users. This is due to the dominancy of the users' collision effect. However, the average user rate performances of the proposed MP-MAB algorithms with arms' load balancing are slightly decreased when increasing the number of users due to resolving users' collisions effectively. Moreover, MOSS-LB shows the best performance among all schemes involved in the comparisons. From Fig. 3, at K = 4, MOSS-LB, KLUCB-LB and UCB1-LB outperform MOSS, KLUCB, and UCB1 by 21%, 23%, and 22%, respectively. These values become 3.73, 3.27, and 2.74 times when K = 20. Also, MOSS-LB outperforms KLUCB-LB and UCB1-LB by 1.63% and 6.6% when K = 4, and by 20% and 61% when K = 20, respectively. Figure 4 shows the RIS load fairness index against the number of users, where Jain's fairness index is used to evaluate RIS load fairness as given in constraint (6e). As shown in Fig. 4, the proposed MP-MAB algorithms with arms' load balancing have a higher RIS load fairness index than those without load balancing. Generally, the load fairness among the RIS boards is increased as the number of users increases due to the high number of associated users per RIS panel, causing the RIS boards to saturate at nearly equal data rates. However, due to its added RIS load balancing term, the proposed MP-MAB algorithms with load balancing show a higher RIS load fairness index than those without load balancing. For example, at K = 4, MOSS-LB, KLUCB-LB, and UCB1-LB have RIS load fairness better than MOSS, KLUCB, and UCB1 counterparts by 2.4%, 1.7%, and 4%, respectively. These values become 8%, 9%, and 13% when K = 20, respectively. The proposed MP-MAB algorithms have nearly the same RIS load fairness performance, and the MOSS-LB has the best performance among them, where it outperforms the proposed KLUCB-LB and UCB1 by 1.2% and 1.7% when K = 4, and 1.4% and 3% when K = 20. This nearly equal RIS load balancing performance comes from the added RIS load balancing given in (8), (12), and (13).

B. PERFORMANCE COMPARISONS WITH OTHER BENCHMARK SCHEMES
Herein, we compare the performance of the proposed MP-MAB schemes with arms' load balancing with two benchmark RIS-user association schemes: the traditional MPR-based RIS-user association and random association. In this simulation analysis, a simulation area of 400 m 2 is considered, where the mmWave BS is in its center, the RIS boards are deployed in its borders, and the mmWave users   are randomly distributed inside it. We considered this small simulation area to simulate a highly dense network to show the effectiveness of the proposed MP-MAB schemes over other benchmarks. VOLUME 11, 2023 Figures 5, 6, and 7 show the average total system rate, the average user rate, and the RIS load fairness index of the schemes involved in the comparison against the number of users using 4 RIS boards. As shown in Fig. 5, the proposed MP-MAB schemes with arms' load balancing achieve higher average total system rate performance than other benchmark schemes, where random association shows the worst performance. This comes from load balancing awareness of the proposed schemes, where a user will associate with a lightly loaded RIS board maximizing its achievable data rate. Moreover, MOSS-LB has the best performance and is slightly better than KLUCB-LB. Interestingly, the average total system rate of all compared schemes tends to saturate after a certain value of K . Still, the saturated data rates of the proposed schemes are higher than those of MRP and random-based user association. This comes from the high number of associated users per RIS board in this highly dense scenario causing the RIS boards to operate at their saturated data rates. However, the load balance among the RIS boards offered by the proposed MP-MAB schemes raises these saturated rates. On the other side, the conventional MPR-based RIS-user association has bad performances at all tested K values and saturates at low data rates. Actually, MRP is like associating the users with their nearest RIS boards irrespective of their load status. This results in load unbalance among the distributed RIS boards, where users may associate with highly loaded RIS boards while lightly loaded RIS boards are kept free. As RIS resources will be equally shared among its associating users, users associated with heavily loaded RIS boards will receive low rates. This is the reason why the proposed MAB algorithms have higher average total system (user) rate performances than the conventional MRP scheme. Yet, MPR still has better total system rate performance than associating users randomly. From Fig. 5, at K = 4, the proposed MOSS-LB, KLUCB-LB, and UCB1-LB outperform MPR (random) based association by 52% (70%), 53% (68%), and 24% (38.23%), respectively. These values become 22% (42%), 22% (42%), and 21.4% (41.4%) when K = 20.
As shown by Fig. 6, the average user rates of all compared schemes are decreased when the number of users is increased due to sharing the RIS resources among a high number of associated users. Still, the proposed MP-MAB algorithms with arms' load balancing have better average user rate performances than MPR and random-based user association, and MOSS-LB has the best performance. At K = 4, the proposed MOSS-LB, KLUCB-LB and UCB1-LB outperform MPR (random) based association by 43% (62%), 40% (58%), and 19% (34%), respectively. These values become 43% (62%), 30% (47%), and 29% (44%) when K = 20.
As shown by Fig. 7, the proposed MP-MAB algorithms with arms' load balancing have a higher RIS load fairness index than other benchmarks. It is interesting to note that MPR has the worst load fairness performance, even worse than the random user association. This is because users associate with RIS boards providing maximum received power irrespective they are lightly or heavily loaded, causing RIS  load to unbalance. MOSS-LB gives the best RIS load fairness performance among the compared schemes. Generally, as the number of users increases, the RIS load fairness index increases and saturates due to the increased number of associated users per RIS board, causing RIS boards to saturate at nearly equal data rates. However, the load balancing capability of the proposed schemes increases their RIS load fairness index over MPR and random association. From Fig. 7, at K = 4, the proposed MOSS-LB, KLUCB-LB, and UCB1-LB outperform MPR (random) based association by 65% (38%), 64% (37%), and 53% (28%), respectively. These values become 25% (7%), 24.3% (7%), and 23.7% (46.4%) when K = 20. Figures 8-10 show the average total system rate, the average user rate, and the RIS load fairness index of the schemes involved in comparisons against the number of RIS boards using 20 users.
As shown by Fig. 8, as the number of RIS boards increases, the average total system rates of all compared schemes are increased, but they are not saturated like the case in Fig. 5. This is because as the number of RIS boards is increased for the same number of users, the number of associated users per RIS board will be decreased. Thus, RIS resources will not be shared among many users, tremendously increasing the achievable data rates of the RIS boards. Again, the proposed MP-MAB algorithms with arms' load fairness show better average total system rate performance than MRP and random association, and MOSS-LB gives the best performance. From Fig. 8, at K = 4, the proposed MOSS-LB, KLUCB-LB, and UCB1-LB outperform MPR (random) based association by 20% (36%), 20% (36%), and 19.4% (35%), respectively. These values become 37.3% (52%), 30% (44%), and 13.3% (25.4%) when K = 20.
As the number of associated users per RIS board is decreased via increasing the number of deployed RIS boards, the average user rates of all compared schemes will be increased, as shown by Fig. 9. This is due to increasing the available RIS resources per user, as stated above. Like Fig. 8, the proposed MP-MAB algorithms with arms' load balancing have better average user rate performances than MRP and random association, and MOSS-LB has the best performance. From Fig. 9, at K = 4, the proposed MOSS-LB, KLUCB-LB, and UCB1-LB outperform MPR (random) based association by 20% (25.6%), 11.4% (16.5%), and 7.7% (12.1%), respectively. These values become 34% (48.4%), 25% (38.3%), and 12.5% (24.5%) when K = 20.
As shown by Fig. 10, RIS load fairness indices of all compared schemes are decreased as the number of RIS boards increases because of the high variety of their associated loads. Due to their load balance awareness, the proposed MP-MAB schemes achieve higher load fairness indices and decrease at slower rates than MRP and random association. As previously explained in Fig. 7, the conventional/traditional MRP-based user association has the worst RIS load fairness performance due to its unfair RIS-user association approach. Moreover, its RIS load fairness index decreases at a high rate, which is faster than the other compared schemes. From Fig. 7, at K = 4, the proposed MOSS-LB, KLUCB-LB, and UCB1-LB outperform MPR (random) based association by 24.65% (7.2%), 23.8% (6.5%), and 23.52% (6.2%), respectively. These values become 2.44 times (25%), 2.4 times (23%), and 2.37 times (21.4%) when K = 20. Figures 11 and 12 show the total system rate convergence of the proposed MOSS-LB, KLUCB-LB, and UCB1-LB against t using 20 users and 4 RIS boards, and 12 RIS boards, respectively. These figures show that all proposed schemes converge after t = 100, and MOSS-LB has the best convergence rate among the proposed techniques. Interestingly, in Fig. 11, all proposed methods almost converge to the same total system rate value. However, in Fig. 12, a high divergence in their achieved maximum total system rate appears, where MOSS-LB converges to the highest total system rate value, then KLUCB-LB, and UCB1-LB, respectively. This is because, in Fig. 11, the 20 users are associated with only 4 RIS boards, increasing the load on them and getting them to operate at their full capacities. However, in Fig. 12, the 20 users are associated with 12 RIS boards resulting in  high divergence in their data rates. As MOSS-LB is better than KLUCB-LB and KLUCB-LB is better than UCB1-LB in finding out the RIS-user association pattern that fairly spreads the users' loads among the RIS boards, a high divergence in their achievable total system rates occurs, as shown by Fig. 12. At t = 100, MOSS-LB, KLUCB-LB, and UCB1-LB converge to 97.1% (93%), 96% (94.1%), and 93% (97.5%) of their maximum total system rates when the number of RIS boards is equal to 4 (20), respectively.

C. COMPLEXITY ANALYSIS
The time complexity of the RIS-user association algorithm consists of two sources: the first one comes from the mmWave BT process to select the associating RIS board, while the second source is the computational complexity of the algorithm. The first source is considered the primary source of time complexity as the BT process between mmWave TX and RX to find out the best beam pair may take up to T BT = 50 msec as given in [40]. In the proposed MP-MAB algorithms and random association, the user only associates with one RIS board at a time with  a time complexity of T BT . However, in MRP-based user association, the user should do BT with all deployed RIS boards and then associates with the one offering maximum received power with total complexity of QT BT . Thus, the proposed schemes have Q times lower BT time complexity than MRP-based association with much better performance, as given by the previous numerical results. The second source of time complexity is negligible compared to the first one as it only depends on instructions execution time, which is machine-dependent. For the proposed MOSS-LB, KLUCB-LB, and UCB1-LB algorithms, the computational complexity comes from selecting the associating RIS board using their designated equations and updating its related parameters with a computational complexity of O(Q + 1). However, the computational complexity of MRP comes from selecting the highest received power RIS board with a computational complexity of O(Q). For random selection, its computational complexity comes from generating a random number and then selecting the associating RIS board based on it with a computational complexity of O(1). Table 3 summarizes the time complexities of the compared schemes. We can conclude from the previous complexity analysis that the proposed MP-MAB algorithms with arms' load balancing have much better performance than MRP and random association combined with low time complexity.

VI. CONCLUSION
In this paper, the problem of optimal mmWave RIS-user association constrained by RIS load balancing was considered, and its optimization problem was formulated. This problem is an NP-hard problem, and it is difficult to obtain the optimal solution, especially when considering the constraint of RIS load balancing. Instead of associating RIS's users using the conventional MRP, the problem was considered a centralized MP-MAB game with arms' load balancing. It was reformulated as a sequential time optimization problem that can be solved using MAB algorithms. Towards that, three MP-MAB algorithms with arms' load balancing were proposed, namely UCB1-LB, KLUCB-LB, and MOSS-LB, where all three algorithms come from the famous UCB family. The performance of the three algorithms was extensively investigated under different scenarios, where MOSS-LB showed the best performance among them. Moreover, all proposed schemes showed better performances than MRP and random association concerning average total system rate, average user rate, and RIS load fairness index. Also, MOSS-LB showed the best total system rate convergence among the proposed MP-MAB schemes. Furthermore, the proposed algorithms showed lower complexity comparable to random selection and better than MRP. The study presented in this paper encourages widely applying online learning to address several challenges in RIS-aided wireless communications.
KOHEI HATANO received the Ph.D. degree from the Tokyo Institute of Technology, in 2005. Currently, he is an Associate Professor with the Faculty of Arts and Science, Kyushu University. He is the Leader of the Computational Learning Theory Team, RIKEN Center for Advanced Intelligence Project (AIP). His research interests include machine learning, computational learning theory, online learning, and their applications.
EIJI TAKIMOTO received the Ph.D. degree from Tohoku University, in 1991. Currently, he is a Professor with the Department of Informatics, Kyushu University. His research interests include computational learning theory, online learning, and computational complexity.
MOHAMED ABDEL-NASSER (Senior Member, IEEE) received the master's degree from Aswan University, in 2013, and the Ph.D. degree (cum laude) from the University of Rovira i Virgili, Spain, in July 2016. His master's thesis was focused on developing medical image registration methods based on an artificial immune system and the doctoral thesis was on developing efficient artificial intelligence-based breast cancer image analysis approaches. Currently, he is the Director of Research with the University of Rovira i Virgili, leading research projects in biomedical and bioinformatics engineering, where he is also supervising nine Ph.D. students. He was the technical leader of many research and development projects in biomedical and bioinformatics engineering, funded by the European Commission and local funding agencies. He is an Electronics and Communications Associate Professor with Aswan University, Egypt. He is the Principal Investigator of the European Project Ensenyo Project (AI-Powered online education platform). He is a coauthor of more than 100 scientific papers in international journals and conferences. His research interests include biomedical and bioinformatics engineering and applied artificial intelligence. In 2020, he was listed among World's Top 2% Scientists. In 2017, he received the Marc Esteva Vivanco Prize for the best Ph.D. dissertation on artificial intelligence, Spain. In 2022, he received the State Encouragement Award from the Egyptian Academy of Science and Technology for his excellent research track. He serves as a guest editor and a reviewer for many special issues in indexed journals and conferences.