A Multi-Armed Bandit Algorithm for IRS-Aided VLC System Design With Device-to-Device Relays

This paper presents a communications framework to overcome the connectivity constraints due to the nonavailability of the line-of-sight transmissions in indoor optical communication systems. This nonavailability can arise for various reasons, such as blockages due to physical objects, unfavorable device orientations or large distances between the transmitter and the receiving devices. The proposed system utilizes multiple intelligent reflecting surface (IRS) arrays and device-to-device (D2D) communications. The D2D communication is realized using infrared (IR) light-emitting diodes (LEDs) with limited output power for eye safety. The performance of this system depends significantly on the assignment of the mirrors in the IRS arrays to the appropriate user links and a direct combinatorial assignment search is too complex to implement. The proposed approach identifies the assignment of each mirror in the IRS arrays as a multi-armed bandit (MAB) problem, and the assignment of all the mirrors together as a combinatorial MAB (CMAB) problem. Since a simultaneous movement of all the IRS mirrors during the implementation of the CMAB algorithm could cause frequent link disruptions, a CMAB algorithm with low disruptions (CMAB-LD) is proposed to obtain the best mirror assignment with low link disruptions. Simulation results demonstrate that the proposed algorithm can provide significant improvement in reward performance and the total reward increases by more than 100% over random mirror assignments when the channels are blocked with high probabilities. In small size problems, the proposed CMAB-LD is found to achieve the global optimal solution in just a few rounds of full arm explore operations.


I. INTRODUCTION
Due to the constant demand for higher data rates, lower latency, better connectivity, and higher user device densities, the global discussion on the sixth generation (6G) wireless communication is well under way [1].As the radio frequency (RF) spectrum has become increasingly crowded, there has been research focus on other underutilized frequency bands and their incorporation into existing technologies.This includes the visible light communication (VLC) systems [2], [3], [4], [5], [6], [7].VLC provides many advantages, such as large and unlicensed bandwidths, no interference with existing RF technology, and potential applications in The associate editor coordinating the review of this manuscript and approving it for publication was Barbara Masini .scenarios where RF is prohibited for health reasons [8].Additionally, since light cannot penetrate walls, the large VLC bandwidth is reusable in each room and has less need for physical layer security than RF systems, which require advanced physical layer security techniques [9].However, the use of VLC involves problems of its own, namely the high reliance on line-of-sight (LoS) paths which may be blocked due to the presence of objects in the communication pathway [10].These blockages can cause severe outages, and thus degrade the user experience significantly.
Although RF, in general, is less reliant on the LoS path for communication, channel degradation still occurs in RF.Recently, the use of intelligent reflective surface (IRS) has received high attention in the RF literature due to its ability to actively manipulate the channel [11], [12], [13].The use of IRS in RF has been shown to provide an improvement in the capacity and the coverage range of the base station [14].Following the success of IRS in RF systems, the use of IRS in VLC has also garnered increasing interest in recent times [15], [16], [17], [18].To facilitate rotatable mirror IRS in VLC, the authors in [15] develop an additive IRS model under the assumption that the light source is small with respect to the source-reflector distance.Similar to RF, the VLC IRS can also improve capacity and coverage range.However, more importantly, they aid in the handling of the LoS blockages that commonly occur in VLC [19].Additionally, even in unblocked scenarios, the VLC IRS can potentially serve users that have the VLC access point (AP) out of their field-of-view (FoV) due to incorrect user orientations.In all cases, the mirror angles must be appropriately configured to provide large benefits.The authors in [16] formulate a nonlinear optimization problem to optimize each mirror's angles to maximize the sum rate and present an iterative algorithm.In [17], the authors formulate the mirror assignment problem as a convex optimization problem by relaxing the binary mirror assignment variable.The sum rate is maximized subject to each user having its minimum rate met.The authors in [18] use the model developed in [15] to establish a binary association for each mirror.Then the sum rate is maximized by jointly optimizing the binary mirror associations and resource allocation using the minorization-maximization algorithm.
Another method to overcome channel blockage is to utilize the device-to-device (D2D) communication [20], [21], [22], [23], [24], [25], [26].The D2D communication can aid in serving users that are otherwise outside of the direct coverage of the AP by having some users act either as an information source or as a relay for other users.In VLC, a user may be out of the coverage of the AP due to reasons such as: 1) the VLC AP is out of the FoV of the user, 2) the user-AP distance is too large to facilitate an adequate data rate for the user, or 3) the potential communication paths between the VLC AP and the user are blocked.In [20] and [21], the authors consider a network that uses VLC downlink and RF D2D.In [22], the authors consider the problem of optimizing the selection of RF or VLC bands for D2D.The authors in [23] and [24] develop an all light fidelity (LiFi) downlink and D2D system for industrial scenarios.The use of an all VLC downlink and D2D system poses problems for scenarios with humans present, as the D2D light-emitting diode (LED) emissions may cause stress to human eyes.However, infrared (IR) communication can be used to achieve D2D and uplink communication without disturbing human eyes.The use of IR for D2D is discussed in [26], [27], and [28].The articles [29] and [30] discuss IR uplinks which are also fundamentally similar to D2D.A variety of discussions regarding LoS, reflected, and diffuse IR channels are given in [28] and [29].When using IR LEDs, it is important to carefully limit the output optical power of the LED as high power IR can damage human eyes.If the optical IR power is properly restricted, IR LEDs provide a method to facilitate D2D communication without irritating human eyes.In [25], an RF system is considered that simultaneously utilizes the IRS and D2D techniques.To the best of our knowledge, no existing work combines the IRS and D2D techniques in an all optical indoor system.
In this paper, we consider an IRS-aided downlink VLC system with IR D2D relays.A user that is served directly by the AP may act as a relay for another user, in which case it transmits using the full-duplex amplify-and-forward method.Each mirror has the freedom to be assigned to either VLC links or IR D2D links.One of the leading challenges with an all optical VLC downlink and IR D2D system is the over reliance on the LoS path for communication as the orientations of the users may frequently prevent them from receiving data from the AP or the D2D relay.Through the simultaneous use of the IRS and D2D techniques, we aim to improve the performance and vastly extend the coverage of the VLC AP by circumventing blockages and providing many potential alternative paths.Hence, in our work we consider multiple IRS arrays, which may be on the ceiling or walls.Note that in a strictly VLC downlink system, an IRS array on the ceiling is of little to no practical use as the VLC AP mostly radiates in the downward direction.In the case of the IR D2D, the user that acts as a relay may be in a much better position to take advantage of the ceiling IRS array.We are interested in the appropriate assignment of all the IRS mirrors to the VLC and IR D2D links.
Our contributions are as follows.First, we present a novel framework of an IRS-aided VLC downlink system with IR D2D communications.The framework incorporates IR optical power limit for eye safety purposes and allows IRS mirrors to be located at the ceiling.Mirrors in an array can be allocated to VLC or IR links.Unlike other works in this area, we include imperfect mirror pointing and user location uncertainties.Second, we develop a multiarmed bandit (MAB) formulation of mirror assignments where the assignment of a mirror to a link forms a MAB.We solve it as a combinatorial multi-armed bandit (CMAB) problem similar to [31] but with important differences.We use reward-based probabilities during the explore phase unlike uniform probabilities used in [31].Further, we ensure that the mirror movements cause low disruptions during both explore and exploit phases.During the explore phase, we move only one mirror at a time allowing more mirror combinations to be searched.Our timing aspects also differ from [31] in the sense that, during each round, we study the performance over a longer time after all the mirrors are moved.The proposed algorithm is called CMAB with low disruption (CMAB-LD).Third, we present a detailed analysis on the signal-to-interference-plus-noise ratio (SINR) of the system.Bounds on the regret and convergence are given.Finally, numerical results are presented to show excellent performance of CMAB-LD.For the examples considered, the CMAB-LD achieves a typical gain of nearly 60% in reward over a random mirror assignment approach, and this gain increases to more than 100% for higher channel blocking probabilities.The CMAB-LD also provides a fairly uniform reward distribution among the users.In a simple scenario considered, the CMAB-LD achieves the globally optimal full arm in less than 6 full arm rounds.
The rest of the paper is organized as follows.The IRSaided hybrid VLC downlink and IR D2D system model is given in Section II.The CMAB-LD algorithm is described in Section III.We present SINR, regret and convergence analysis in Section IV.Numerical results and discussion are given in Section V. Finally, we conclude in Section VI.Notation: Bold-face lowercase letters represent vectors.We use , and E[•] to denote set cardinality, transpose, Euclidean distance, maximum, and expectation functions respectively.N (m, σ 2 ) denotes Gaussian distribution with mean m and variance σ 2 .

II. SYSTEM MODEL
We consider a VLC downlink system that simultaneously utilizes D2D and multiple mirror IRS arrays as shown in Fig. 1.Without loss of generality, the D2D operations occur in the full duplex mode.The D2D links use IR and transmit information using the amplify-and-forward method [32].We assume that each user device is equipped with a VLC photodiode (PD), an IR PD, and an IR LED.In order to eliminate interference between VLC and IR, we assume that each visible light (VL) PD is equipped with a VL-pass filter, and each IR PD is equipped with an IR-pass filter [29].We assume that the AP employs a multiple access technique such as the time-division multiple access (TDMA).
There are a total of J IRS mirror arrays distributed around the room.Each array consists of Q mirrors so that the total number of mirrors is M = JQ.The q-th mirror in the j-th IRS array is referred to as the m-th mirror, where m = (j − 1) × Q + q.Each IRS mirror can be separately configured to aid a user's communication regardless of whether the user is served by the VLC downlink or an IR D2D relay.We assume that each mirror may only be assigned to a single user.The system is managed by a central unit (CU), which is responsible for coordinating the overall configuration of the IRS mirrors.
The users in the system are divided into K 1 direct users (DU) and K 2 indirect users (IU) so that the total number of is a DU if it has at least one nonzero LoS or mirror sublink (or channel) connecting it to the AP.The k-th user, k = K 1 + 1, • • • , K , is an IU if there is no nonzero LoS or mirror channels connecting it to the AP while there exists at least one nonzero LoS or mirror sublink connecting it to a serving relay DU.The classification of DU and IU may be based on several factors, such as the distance from the AP, user orientation, user FoV and the minimum signal-to-noise ratio (SNR) required at the user.Only DUs may act as a D2D relay for an IU, hence K 1 ≥ K 2 .Further, we restrict each DU to be able to serve at most one IU, and each IU is restricted to be served by at most one DU.Whereas any of the users, k = 1, 2, • • • , K , can be a possible transmitter or a receiver, the VLC AP can only work as a transmitter and is referred to as the 0-th transmitter.The link between the receiver k and the transmitter i is referred to as link l, where l Each link may consist of multiple sublinks, each denoted by (m, l), where m denotes the mirror number and l is the link.The notation (0, l) denotes the LoS sublink.In Fig. 1, the link 1 between DU 1 and the AP consists of two sublinks: sublink (0,1) directly from the AP and sublink (m,1) via mirror m.The mirror sublink can provide an alternative communication path if the LoS sublink (0,1) is blocked, thus enhancing connectivity.If the LoS sublink is not blocked, the mirror sublink can still provide SNR enhancement for link 1.Let h (m,l) t represent the channel corresponding to sublink (m, l) at time t.Any sublink may be blocked with a probability P b due to objects or users in the environment.If a sublink is blocked, the corresponding channel becomes h (m,l) t = 0. We assume that each user has at least one unblocked sublink and each mirror is unblocked for at least one link.A summary of system variables is given in Table 1.
Let the optical signal transmitted from the VLC AP at time t be x t .The VLC AP is driven by the electrical signal x ′ t , where x t = ηx ′ t , η is the LED gain (W/A) of the VLC AP [33].Note that the nonlinear response of the LED can be compensated for by predistortion [34].

1) VL MODEL
We denote the total signal received for DU link l at time t as where ] is the (1 × M ) row vector of mirror channels at time t for the link l, b ] is the row vector of binary assignment variables, b (m,l) t ∈ {0, 1}, at time t for the link l, and w (l) t ∼ N (0, σ 2 l ).The first term in (1) corresponds to the desired signal, whereas the second term represents noise.The model for the channel coefficients h (m,l) t for a given physical setup is presented in the Appendix.The SNR for DU link l at time t is calculated as where is the average electrical power of x ′ t .
2) IR MODEL Next, we consider the signal received for IU link l.Let L (i) be the set of active IU links.We assume that the IU link l is served by a parent DU link s.At time t, the DU link s receives electrical signal ȳ(s) t and amplifies it by a factor α (s) t for transmission to IU link l.The amplifying factor α (s) t will be discussed in Section II-3.The IR optical power signal sent from the transmitter of IU link l is expressed as where η is the IR LED gain and ȳ(s) t can be found using (1).The signal received for IU link l is calculated as where w(l) t ∼ N (0, σ 2 l ).The second term in (4) represents the LoS interference from active relays other than the relay serving link l.Note that a mirror interference term from relays other than the relay serving IU link l is absent in (4) because the probability of interference through the IRS is very low, as discussed in [18].The SINR can be found for IUs by substituting (3) into (4).The first term in (4) is expanded as x(l) t where the term t is the desired signal.The second and third terms in (4) represent interference and noise, and their impact can be summarized using (6).Thus, the SINR at the receiver of IU link l, served by parent DU link s, can be calculated as where where s ′ is the DU link serving IU link l ′ .

3) IR POWER SCALING
The output IR power of the IU link l's LED has to satisfy max(x (l) t ) ≤ Pmax , where Pmax is the eye-safe IR power limit.Hence we express α Observe that ȳ(s) t is usually very small with respect to desired signal x t h(s) t for DUs.Hence, w (s) t is omitted in the derivation of α (s) t and we write We assume max(x ′ t ) is proportional to P as in [35] and we define max(x ′ t ) = ξ P, where ξ is a constant.Finally, we express the upper bound on α In a real scenario, α t may be truncated to ensure that the IR optical power never exceeds safety limits.

III. CMAB-LD ALGORITHM
In this section, we present the CMAB-LD algorithm to assign mirrors to various links.Note that an exhaustive search on the optimal mirror assignment will require searching over K M assignments.To simplify this, we identify the assignment of a single mirror to a link as a MAB, and the assignment of all the M mirrors as a CMAB problem.
In a classical MAB problem [36], [37], an agent has to choose an action out of several available options to maximize the cumulative reward over time.Each action is called an arm.In our situation, each mirror requires an action of assignment to one of the links.Thus, there are K arms for each mirror.Since there are M mirrors, each requiring an action, they together form a CMAB problem.To proceed further, let s (m) t denote the assignment of mirror m at time t, and thus s (m) t = l means that the m-th mirror is assigned to link l at time t.The single mirror assignment will be referred to as the partial arm.On the other hand, using s t ] to denote the assignment of all the mirrors, we obtain the full arm i played at time t.

A. OVERVIEW
When a mirror is assigned to a link, the implementation takes place by orienting the mirror to direct the signal at the receiving user's device of the link.This relies on the estimated 3-D user device position which has a typical error on the order of a few cm.For example, the use of the fingerprinting method in VLC is investigated experimentally in [38], where the authors are able to obtain an average 3-D positioning error of 3.65 cm in an average time of 1.77 ms.In addition, the motor arms that point each mirror may have small errors.Therefore, there will be differences in the channel responses from one time epoch to the next when a mirror moves.This randomness is incorporated in our work.
We use a non-negative reward function for link l at time t given as (l) where γ (l) t denotes the SNR of link l.The use of the logarithmic function for each link ensures proportional fairness in throughput-based problems [39].This reward function aims to optimize each user's SNR by manipulating the IRS elements [40].Further justification for the reward function is given in Section IV.Finally, the full arm reward function is where L is the set of active links.We assume that each IU is paired with the closest available DU before the CMAB-LD begins.Note that it would be possible to simultaneously consider active links, user pairing, and mirror assignment in the CMAB formulation.However, from a practical perspective, changing D2D pairing in the CMAB formulation would result in excessive handover overhead.Therefore, this approach is not explored.Further, when a mirror m is blocked for a link l, it remains blocked for that link for the full duration of the algorithm.
The CMAB-LD algorithm builds on the ϵ-greedy naïve sampling (NS) algorithm proposed in [31].In that work, the author demonstrates its superiority against multiple variations.The CMAB-LD algorithm can be summarized as follows.
1) Full arm explore with probability ϵ 0 , or full arm exploit with probability (1 − ϵ 0 ).2) If full arm explore is selected: an ϵ 1 -greedy MAB is performed for each s (m) t independently.Then the selected full arm t ] is added to the set S of previously played full arms.
3) If full arm exploit is selected: choose the highest mean reward full arm in the previously played full arm set, S. The CMAB-LD algorithm begins by creating a random initial assignment by selecting s (m) t ∈ L uniformly for each m at t = 1 to determine the initial full arm, s (1) 1 .It is possible that during mirror assignments, a certain link with no LoS path gets only a single mirror.This link is referred to as a lifeline link (LL) since the removal of the single mirror from this link disrupts communication in that link.A mirror that serves an LL is referred to as a lifeline mirror (LM).The CMAB-LD algorithm emphasizes on reducing disruptions by disallowing the removal of any LM during the exploration phase and removes it only for the shortest possible duration during the exploit phase.Various sets of links and mirrors needed to understand the algorithm are defined as follows.The set of links that have the LoS path unavailable is defined as L(0) = {l|h (0,l) 0 = 0}.Similarly, the set of mirrors assigned to link l at time t is M (l) .We define the set of LLs as and the set of LMs as Next, we define L(m) as the set of links that have been found to obtain no reward increase in a previous round when the mirror m was assigned to them.

B. GAIN OR LOSS IN REWARD
When a mirror m, assigned to a link j at time t − 1, is moved and assigned to a link l at time t, then link j's reward value may decrease and the link l's reward value may increase at time t.It is important to note that links l and j can potentially be DUs.If these DUs were serving as relays for IUs, then the reward of the corresponding IUs will also be impacted by the movement of mirror m as well.We denote n as the potential IU link that relies on the parent link l for its relay operation, and p as the potential IU link that uses parent link j for relay.Hence, we can model the change in the reward function based on the decrease in reward values of links j and p and the increase in the reward values of links l and n.The m-th partial arm reward function is given as where (+m,l) t is the gain in reward for adding mirror m to link l, and (−m,j) t is the gain (loss in real terms) in reward for removing mirror m from link j.We express them as where are the direct reward gains due to the movement of mirror m, F t−1 are the indirect reward gains due to the movement of mirror m, and λ u,v = 1 if link u acts as a relay for link v and λ u,v = 0 otherwise.
The algorithm requires that we maintain the number of times played and mean reward of each full arm i, N (i) and µ (i) n respectively, for i = 1, . . ., |S|.We recursively define the mean after the n-th time full arm i is played, µ (i) n , as In addition, we must maintain the number of times the partial arm (m, l) has been played, N (m,l) , and the mean reward for that partial arm, µ (m,l) n , for m = 1, . . ., M and l = 1, . . ., K .We recursively define the mean after the n-th time partial arm (m, l) is played, µ In the algorithm box, we refer to µ n and µ (m,l) n without the n index for simplicity.Note that at time t = 0, all mirrors m are unassigned.Hence, in the initialization, we calculate (14) as (−m,j) t = 0 for all mirrors in the initial round since each mirror m is not assigned to a link j at time t = 0. We denote s * and s 15768 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Further, we denote s (m) * as the m-th element of s * .A summary of the algorithm variables is given in Table 2.

C. ALGORITHM DESCRIPTION
The CMAB-LD algorithm begins with the initialization steps performed by Lines 1 through 19 given in the algorithm box labeled as Algorithm 1.Note that each mirror assignment in Algorithm 1, Function 1 or Function 2 produces a channel response that includes pointing error uncertainty as described at the end of the Appendix.Before the mirrors are assigned, Lines 1 -5 find the initial rewards for the links and produce the set of links L(0) that do not have LoS paths.Lines 6-16 assign each mirror randomly and calculate the partial arm reward function.The counters to keep track of time as well as the number of arms played, the mean partial arm rewards, and the mirror set for each link are updated.The set of links L(m) that see no reward benefit from using mirror m is updated for each m.Otherwise, L(m) is initialized as empty.Lines 17-19 set the full arm and calculate the reward function.The counter, mean, and full arm set S are updated.The LL and LM sets are determined.In Lines 20-25, the algorithm performs H rounds of repetitions between full arm explore with a probability of ϵ 0 or full arm exploit with a probability of 1 − ϵ 0 .Accordingly, the explore or the exploit function given in the boxes is activated.During each round, the best full arm s * is updated.
In the full arm explore function (Function 1), each mirror m is handled sequentially.Line 3 checks if the mirror m is an LM in which case the assignment of mirror m is not changed.Otherwise, a decision is made to explore MAB m (Line 6) or exploit MAB m (Line 9).It is important to note that until mirror m has found a nonzero reward from a previous link assignment, there is no point in performing a partial arm exploit for mirror m.Hence, through Lines 5 and 6, we perform a partial arm explore with probability , where a = 1 if µ (m,l) = 0 for all l ∈ L. If MAB m is explored, then Line 7 finds the set of priority links for the next assignment and Line 8 assigns the mirror to a link l with a probability p (m,l) t at time t, such that l∈L p (m,l) t = 1.The priority set of links is defined as as the set of links that are available for mirror m that also have zero reward, and thus need immediate attention.The assignment probability p (m,l) t is calculated as This ensures that users with lower rates are prioritized for mirror assignment when explore is chosen for MAB m.This differs from the uniform probability based selection of [31].Lines 13-17 check if the created full arm s (i) is already in the full arm set S. If it is, the algorithm finds the index of the full arm.If not, the algorithm creates a new index for the new full arm, initializes the counter and mean variables, then adds it to the set S. Lines 18-20 calculate the full arm reward function and update the counter and mean variables.Lines 21-23 check if there was no gain from the assignment of mirror m, in which case, the algorithm adds the mirror assignment to L(m) .Lines 24 and 25 update the link and mirror sets.Line 27 ensures that after the last mirror is allocated, the mirror assignment is maintained for the remainder of the scheduling period.
In the full arm exploit function (Function 2), Lines 1-13 sequentially move each mirror to the optimal assignment while ensuring that if a mirror is taken from an LL then that LL immediately gets its own optimal mirror.This is repeated until all LLs are taken care of in that cycle.Lines 14-17 calculate the full arm reward function, update the mirror sets for each link, and determine the LL and LM sets.Line 18 ensures that after the last mirror is moved, the mirror assignment is maintained for the remainder of the scheduling period T.
In both full arm explore and exploit functions, the mirrors are moved sequentially.In the case of a full arm explore, we assume that each mirror assignment can be completed in a time of τ 1 .Therefore, the total time required to assign all the mirrors is M τ 1 .This duration is referred to as the change period (CP) as shown in Fig. 2.After the last mirror is assigned, the algorithm maintains the assignments for the remaining duration, T − M τ 1 , which is called the dwell period (DP).Note that the algorithm does not allow moving any LM during a full arm explore.Therefore, there is no disruption in any link during a full arm explore.Coming to a full arm exploit, it is assumed that each mirror movement  Set t ← t + 1.

8:
Assign mirror m to random link l ∈ L so that s (m) t = l.

FIGURE 2.
A depiction of the mirror movement timing for the full arm explore and full arm exploit functions.The △τ 1 and △τ 2 are the times it takes to move each mirror in the explore and exploit full arm functions respectively.Note that because the explore full arm function involves additional decision making the CU, △τ 1 ≫ △τ 2 can be completed in a time duration of τ 2 .Since an exploit operation can be completed with much lower computations than an explore operation, we have τ 2 ≪ τ 1 .The CP duration is M τ 2 , and after the last mirror is assigned, the full arm exploit maintains the assignment of the mirrors for the full duration of the DP, T −M τ 2 .It is possible that a link may undergo disruption at most for a duration τ 2 during the full arm exploit phase.This will occur if an LM is removed but the algorithm immediately provides a replacement mirror.Since τ 2 is small, its effect is negligible and its negative impact can also be mitigated using error correction codes.Set t ← t + 1. = 0 for all l ∈ L set a = 1 else a = 0.

IV. CMAB-LD ANALYSIS
In this section, we will show that a direct optimization of each active link's SNR results in a greedy solution, where the users with the best channel conditions are favored highly.The logarithmic reward function given in (10) provides a fairer reward distribution.We will then present bounds on the regret and convergence of CMAB-LD.

First recall that
t−1 is the direct gain in reward for link l due to mirror m being assigned to link l at time t.If link l serves as a parent link to relay data for another link n, then F t−1 is the indirect gain in reward for link n due to mirror m being assigned to link l.Similarly, the direct gain (which is a loss) in reward for unassigning mirror m from link j is expressed as If link j also acts as a relay for link p, we express the indirect loss in reward for unassigning mirror m from link j as t−1 .An alternative expression for F (+m,l,l) t can be given as 15770 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.end repeat.
12: end if.13: end for.14: Calculate r(s (i) ) using (11).15: Update M (l) based on s * for l = 1, . . ., K .16: Set t ← t + M .17: Find Lt and Mt using ( 12) and ( 13).18: Maintain assignment s (i) for the period where γ (24), where γ t−1 in (24), where γ t−1 is the SNR gain for link p when mirror m is unassigned from link j.We express γ (+m,l,l) t and γ (−m,j,j) t in a single variable as γ (±m,l,l) t .For l ≤ K 1 , or DU links, we express γ where is the non-line-of-sight (NLoS) mirror channel for sublink m of link l at time t, and link l is replaced with link j in the (−m) case.For l > K 1 , we express γ where link l is replaced with link j in the (−m) case and link s serves as the relay link for link l or j.We write γ (+m,l,n) t and γ (−m,j,p) t as a single variable as where we calculate α with link n replaced with link p and the serving link l replaced with link j in the (−m) case.
When moving mirror m we seek to maximize γ (+m,l,l) t while minimizing γ (−m,j,j) t . The term (g 2 ±2 h(l) t−1 g) appears in ( 25), (26), and (27), and in the case of link l, we seek to maximize (g 2 + 2 h(l) t−1 g).Even though the term ( h(l) t−1 ± g) 2 appears in the denominators of ( 26) and ( 27), which includes the term (g 2 ± 2 h(l) t−1 g), it can be shown that the numerator in ( 26) and ( 27) increases quicker with (g 2 ± 2 h(l) t−1 g).It can be observed from the 2 h(l) t−1 g term that mirrors with strong channels are best paired with links that have strong channels already, where h(l) t−1 is the composite channel for link l at time t − 1.In the case of link j, we seek to minimize (g 2 − 2 h(l) t−1 g).This can be accomplished by taking mirror m from links where h(l) t−1 is low, or weak links.In the case that the log reward function is used, we seek to maximize F (+m,l,l) t while minimizing F (−m,j,j) t .It can be observed from (24) that for link l, F (+m,l,l) t is maximized when γ (+m,l,l) /(1 + γ 1 typically, we write that F (+m,l,l) t is maximized when γ (+m,l,l) /γ (l) t−1 is large.Similarly, F (−m,j,j) t is minimized when γ (−m,j,j) /γ (j) t−1 is small.This indicates that the largest increase in reward is obtained when low impact mirrors are taken from strong links and given to weak links.It follows that using (10) as a reward function will encourage a much more fair mirror allocation than using γ (l) t as a reward function directly.
The previous discussion is elaborated in Fig. 3 using numerical examples for both linear and log-based SNR optimization approaches.We consider an example scenario with K 1 = 2 DUs, K 2 = 2 IUs, Q = 4 mirrors per IRS array, and M = 24 mirrors in total.The rest of the parameters will be discussed in Section V and are given in Table 3.In this scenario, mirrors are sequentially assigned to the link that provides the most increase in the sum linear SNR or sum log SNR.Note that the final results are displayed using (10) for both the linear and the log based methods.Observe that optimizing the linear SNR reward function results in Link 3 being abandoned entirely.Instead, the linear SNR method results in the improvement of Link 1, which has the highest SNR.In contrast, optimizing the log SNR reward function results in Link 3 being well connected.Additionally, the log SNR method only results in a small decrease for Link 1 in comparison to the linear SNR method.Hence, the It is hard to preemptively know p (m, * ) t and p(s ).We define the set of useful link assignments for mirror m as L (m) = L − L(m) , omitting the t subscript for simplicity.Hence, we bound (35) as P (m)  ≥ P (m) min with P (m) min corresponding to for sufficiently large |S|.Note that min l∈L (m) p (m,l) t > 0 via (23).It follows that After each mirror has moved during a full arm explore, the probability of not selecting s * is (1−P ′ ).Since the probability of selecting a full arm explore is ϵ 0 , the portion of regret R 1 incurred due to a full arm explore using ( 32) is Under the worst-case scenario assumption that P ′ = 0, we can bound R 1 as By adding (34) and (40), the average regret per round can be expressed as Thus, the cumulative regret over H rounds is

C. CONVERGENCE
The probability of selecting s * in a full arm explore round is guaranteed to be more than P ′ since the individual mirror assignments performed sequentially during a full arm explore produce multiple full arms and each one can potentially be the optimal arm (see Line 16 in full arm explore function).Let P * r be the probability of selecting s * in a full arm explore round so that P * r > P ′ .The probability P ′ is bounded in (38).It is expected that after t ′ full arm rounds, ϵ 0 t ′ rounds have been full arm explore rounds.The probability that s * is not present in the set S after t ′ full arm rounds is Since P ′ > 0, it follows that P * tends to 0 as t ′ increases.Although this bound closely follows [31], the convergence in the CMAB-LD will be much faster than this bound since P * r > P ′ .

V. SIMULATION RESULTS
Unless otherwise stated for a specific figure, we use the parameters defined in Table 3, which are taken mostly from [17], [18], and [29].Without loss of generality, we assume P = 1.Each channel value is produced based on the locations of the users, their orientations and the IRS as described in the Appendix.While modeling each channel, we also generate a binary blocking variable that is 0 with probability P b and 1 with probability (1 − P b ).Then each channel value is multiplied by its binary blocking variable.Each time a mirror is given a new link assignment, it points optimally based on the estimated positions of the receiver and the transmitter in the link.To model pointing uncertainties, a Gaussian noise sample is added to the true position of the user each time a mirror is moved to assist the user's link as described in the Appendix with σ p = 0.01.We use the 10 cm ×10 cm mirror size used in other VLC IRS works [16], [17], [18].Our IRS arrays are placed centered on each of the four walls, with two IRS arrays located on the ceiling.Unless stated otherwise, our results use an ensemble averaging over 300 iterations, where we take the average of the best full arm reward.Note that we do not include the effect of the DP shown in Fig. 2 in our results.In Fig. 4, we investigate the impact of ϵ 0 and ϵ 1 on the performance of the CMAB-LD algorithm.Of the cases shown, the case with (ϵ 0 = 0.8, ϵ 1 = 0.2) converges the quickest.It is worth noting that a higher ϵ 0 will typically result in faster convergence, but the cumulative regret will also increase in the long run for higher ϵ 0 values as shown in (42).It is interesting to note that, for the same value of ϵ 0 = 0.8, the case with (ϵ 0 = 0.8, ϵ 1 = 0.2) provides faster convergence than the (ϵ 0 = 0.8, ϵ 1 = 0.8) case.This suggests that ϵ 1 should be selected to be smaller.This implies that the best method for fast convergence is to full arm explore frequently, while exploring just a few mirrors at a time and exploiting the most of them.This allows the CMAB-LD algorithm the flexibility to find better mirror assignments,  while primarily exploiting assignments that the algorithm has found to provide high reward in the past.
In Fig. 5, we display the rewards of the individual users for three cases: none, random and CMAB-LD.The 'none' case represents user rewards when the mirrors have not yet been assigned, the 'random' case represents rewards after the mirrors have been assigned randomly, and the 'CMAB-LD' case represents rewards after the CMAB-LD algorithm.Some users are unconnected initially due to blockages in the physical environment and poor receiver orientations with respect to the AP.Intelligent assignment of the mirrors can alleviate these problems and get users connected by taking advantage of unblocked paths.In this case, users 1-5 are DUs and users 6-10 are IUs.Without using the mirrors, only users 2, 4, and 5 could communicate, and none of the IUs were connected as their SNR values are zero.After the random mirror assignment, one DU and two IUs are still left unconnected.The CMAB-LD algorithm ensured connections to all users in this case with the lowest reward being 13.94.The rewards are also fairly distributed among the users.The total full arm reward is 58.95 when no mirrors FIGURE 6.Comparison of reward-based exploration probabilities and uniform exploration probabilities.The reward-based exploration probabilities are calculated using (23).
are assigned and it increases to 116.38 after mirrors are initialized randomly.The CMAB-LD raises the final reward to 182.28 with all users being provided connectivity.
Figure 6 shows the CMAB-LD performance comparison between the reward-based mirror allocation and uniform probability based mirror allocation.The reward based mirror allocation probabilities are calculated using p (m,l) t in (23), and the uniform mirror allocation method uses p (m,l) t = 1/K .We show results for blocking probabilities P b = 0.2, 0.5, 0.8 and 0.9.The reward-based mirror allocation performs much better than uniform probability allocation for all blocking probabilities P b and the performance gap is higher for larger blocking probabilities.This is because the reward-based allocation uses mirrors targeting to connect users that are unconnected.In the case that an unconnected user gets connected due to the assignment of a mirror, a large full arm reward is obtained.The uniform allocation method does not prioritize unconnected users.Also, the CMAB-LD converges rather slowly for higher blocking probabilities.This is due to more NLoS mirror paths getting blocked, resulting in more failed attempts by CMAB-LD to get the users connected.The figure also shows a vertical line corresponding to t = M = 96.This is an important marker, since at this time all the M = 96 mirrors, based on the random assignment of Line 8 in Algorithm 1, have been allocated.Therefore, the total reward at this time acts as a baseline reference to characterize the gain in reward achievable from the CMAB-LD algorithm over randomly assigned mirrors.For high P b cases of 0.8 and 0.9, the full arm reward values at t = 96 are 67 and 23 respectively.It can be seen that the CMAB-LD algorithm provides an increase of more than 100% in total reward over the random mirror assignment rewards in these cases.It is also obvious that for higher blocking probabilities, the maximum obtainable full arm reward decreases as the available solution space gets reduced.

FIGURE 8.
The CMAB-LD is compared against the UCB-1 strategy in a scenario where the globally optimal mirror assignment can be feasibly known.
for a total of M = 54, 96, 150, and 216 mirrors respectively.Figure 7 shows that the maximum obtainable full arm reward increases with M .However, the gain obtained from increasing M decreases as M grows.The reason is as follows.Initially, if a mirror is assigned to a link l that has a low SNR γ (l) t−1 , there will be a high gain in reward.This can be seen from ( 24) as the denominator in the second term within the log function is small.Once a mirror is assigned to a low SNR link, the SNR of the link rises and there is going to be a diminishing return in reward from assigning additional mirrors to that link.This diminishing return impacts the maximum achievable full arm reward, the gap between curves for different M values, and the convergence speed of the algorithm.The early convergence of the CMAB-LD is roughly the same regardless of the number of mirrors M used.
To provide a benchmark comparison, we consider a simple case of K 1 = 2 DUs, K 2 = 1 IUs, J = 6 IRS panels, and Q = 1 mirror per IRS panel.This scenario has K JQ = 3 6 = 729 total full arms to choose from.Therefore, the globally optimal full arm is found by exhaustively calculating the full arm reward for each full arm and choosing the best.We also use the upper confidence bound 1 (UCB-1) method which uses the decision metric of where i UCB-1 is the full arm selected by UCB-1.Note that we do not restrict the UCB-1 to single mirror movements.Instead, for each single mirror movement in CMAB-LD, we allow the UCB-1 to move all the mirrors in search of the optimal full arm, thus giving UCB-1 an additional benefit.The results are shown in Fig. 8.The figure shows that the CMAB-LD converges to the optimal full arm in roughly 35 mirror movements, which is about 6 full arm rounds in this scenario.The UCB-1 method does not converge to the optimal full arm method in the figure, as the UCB-1 method must first try each full arm at least once.In this figure, UCB-1 is equivalent to trying each full arm sequentially.This can additionally be observed from (44).Note that if given enough time, UCB-1 would strike a good balance between exploration and exploitation.However, the use of UCB-1 becomes increasingly infeasible for larger scenarios, such as the scenarios used in the other figures with 10 96 full arms.

VI. CONCLUSION
A novel all-optical indoor wireless communication system using IRS arrays with D2D relays is presented to provide user connectivity in the presence of channel blocking.The system allows the use of multiple IRS arrays located at multiple places, including the ceiling.The system can handle user location uncertainties.A novel CMAB-LD algorithm is presented to determine the best mirror assignment to provide improved user connectivity with fairness.A detailed analysis on the reward selection, regret and convergence bounds is given.The cumulative regret upper bound is shown to depend linearly on the number of full arm rounds.Simulation results demonstrate that the best method for fast convergence of CMAB-LD algorithm is to full arm explore frequently, while exploring just a few mirrors at a time and exploiting the most of them.The numerical results also show that the proposed algorithm can provide significant improvement in reward performance and the total reward may increase by more than 100% over random mirror assignments when the channel blocking probabilities are high.In small size problems, the proposed CMAB-LD algorithm was found to achieve the global optimal solution just in a few rounds of full arm explore operations and it outperforms UCB-1 method that will require too many operations to explore the options.Possible future directions of research include adaptations of the algorithm to a dynamic scenario with frequent user movements.

APPENDIX
At time t, the LoS channel gain for sublink (0, l) is given as when the transmitter of link l is within the FoV of the PD, and h (0,l) t = 0 otherwise.In (45), ρ is the responsivity of the PD, A = A r CF 0 is a constant, A r is the area of the PD, C is the optical concentrator gain, F 0 is the optical filter gain, ν is the Lambertian index of the LED, d l is the receiver-transmitter distance for link l, φ l is angle of irradiance for the transmitter of link l, and θ l is the angle of incidence for the receiver of link l.The time index t is omitted from the right-hand-side (RHS) variables d l , φ l , and θ l in (45) for simplicity.We can calculate the cosine terms in (45) using the normal vectors of the transmitter and receiver.First, we define vk and vi as unit vectors that point outward along the axis of receiver k and transmitter i respectively.They are also referred to as the transmitter and receiver orientation vectors.We express the orientation of the k-th user, vk , in terms of the polar and azimuth angles of the receiver as where ω k and β k are the polar and azimuth angles of the receiver respectively.Recall that link l is composed of the receiver and transmitter pair (k, i) and define dl as a unit column vector pointing outward from the PD of receiver k towards the LED of transmitter i.Therefore, the corresponding vector from the transmitter will be − dl .Then cosφ l = −v T i dl and cosθ l = vT k dl .All angles and unit vectors are shown in Fig. 9.
In [15], the authors develop an approximate model for the NLoS channel under the assumption of a point source.The model is valid for cases where the source is small and sufficiently far from the mirror.At time t, the NLoS mirror channel gain for sublink m of link l is given as h when the transmitter of link l is within the FoV of the PD and h (0,l) t = 0 otherwise.In (47), θ m,l is the angle of incidence between the receiver of link l and mirror m, φ m,l is the angle of irradiance between the transmitter of link l and mirror m, d , we can represent the assignment of mirrors without the need to optimize mirror angles, which would otherwise form a nonlinear optimization problem [16].
Finally, our model considers mirror pointing uncertainty due to positioning errors and pointing imperfections.To express this mathematically, we first define p m,l .The noise vector n ensures that the NLoS channel h (m,l) t is different each time mirror m is assigned to link l, which captures the impact of random pointing errors.

FIGURE 1 .
FIGURE 1.An example of the IRS-aided VLC downlink system with IR D2D.There are a total of K = 5 users that include K 1 = 3 DUs and K 2 = 2 IUs.

1
is the indirect SNR gain for link n when mirror m is assigned to link l.For link j, we calculate F (−m,j,j) t is the direct SNR loss when mirror m is unassigned from link j.Similarly, we calculate F (−m,j,p) t

FIGURE 4 .
FIGURE 4. Full arm reward of the best full arm versus discrete mirror movement t is shown for a variety of full arm and partial arm explore probabilities ϵ 0 and ϵ 1 .

FIGURE 5 .
FIGURE 5.A comparison of user rewards when no mirrors are assigned, after mirrors are assigned randomly, and after the CMAB-LD has assigned mirrors.

FIGURE 7 .
FIGURE 7. The performance of the CMAB-LD is compared for various values of M, the total number of mirrors.

FIGURE 9 .
FIGURE 9. A depiction of the unit vectors and angles used to calculate (45) and (47).
m,l cosθ m,l , is the distance between the transmitter of link l and mirror m, d(2) m,l is the distance between the receiver of link l and mirror m, and κ is the reflective coefficient of the IRS array.The time index t is omitted from the RHS variables d (1) m,l , d (2) m,l , φ m,l , and θ m,l in (47) for simplicity.We can calculate the cosine terms in (47) as cosφ m,l = vT i d(1) m,l and cosθ m,l = vT k d(2) m,l , where d(1) m,l is a unit column vector pointing outward from the transmitter of link l to mirror m and d(2)m,l is a unit column vector pointing outward from the receiver of link l to mirror m.In (47), we assume that the mirror orientation vector vm , for mirror m, is set such that the vector d(1) m,l represents the incident direction, and − d(2) m,l represents the direction of the reflection[15].Paired with a binary assignment variable b

i ∥ 2 .
true positions of the transmitter and the receiver in link l respectively.Let p(m) be the position of mirror m.For the NLoS mirror sublinks, the vectors d(1) m,l and dNote that all distances, irradiance angles, and incidence angles in (47) depend on position vectors p p(m) .To model the pointing errors, we add a 3-D noise vector n to p (l) 2 prior to generating the mirror channel in simulation.Each component of the noise vector n is distributed as ∼ N (0, σ 2 p ), where σ 2 p is the positional noise variance.When mirror m is assigned to link l, we calculate h

TABLE 1 .
Frequently used system variables.