Accelerated Deep Reinforcement Learning for Uplink Power Control in a Dynamic Cell-Free Massive MIMO Network

We investigate the deep reinforcement learning (DRL) framework for uplink power control in a cell-free massive multiple-input, multiple-output (MIMO) network. Although DRL does not require prior sets of training data as opposed to supervised or unsupervised machine learning approaches, existing methods suffer from substantial convergence time, which is prohibitive in a highly dynamic or large-scale mobile environment. To address this crucial issue, we propose a DRL framework that capitalizes on prioritized sampling to speed up the learning process, thereby enabling rapid adaptation to the variations of the wireless environment. The proposed method is not only tailored to user mobility, but also to network variations due to device activation and deactivation. Numerical results demonstrate the effectiveness of our proposed algorithm, as it exhibits near-optimal performance, outperforming the benchmark schemes in terms of the guaranteed rate and total power consumption, with much faster convergence.

recently gained popularity with its ability to solve various resource allocation problems with much lower computational complexity while achieving good performance [5], [6].In particular, deep reinforcement learning (DRL) has garnered much interests, as it does not require any prior knowledge in terms of training datasets that are not always available, since it relies on reward feedbacks that are usually inherent in target wireless systems [7].In [8], it was employed for power assignment in uplink cell-free massive MIMO considering both sum rate and user fairness.A DRL-based algorithm for downlink power allocation was presented in [9] assuming different optimization objectives, shown to outperform deep learning and conventional optimization techniques.However, these works only considered a static scenario that is rather unrealistic and does not take advantage of the full potential of DRL.In [10], DRL was utilized for uplink power control to maximize the network sum rate while satisfying individual user rate constraints.While both static and mobile users were considered, it did not tackle one of the challenges of DRL, which is convergence speed.Specifically, the DRL system must be re-trained whenever the wireless environment changes.A slow convergence implies that by the time this retraining is finished, the environment is likely to have changed again, making decisions to be outdated.To speed up the training process, prioritized sampling was used in [11], where, in contrast to uniform sampling, certain experiences that are regarded as more important are selected more often.This technique was utilized in [12] to solve a power allocation problem for balancing SINR maximization and power minimization.However, their investigation only considered a fully static, point-to-point scenario.
In this letter, we design a DRL-based framework for uplink power control in cell-free massive MIMO, which, unlike previous studies, not only considers both static and mobile users, but also another layer of dynamicity pertaining to the activation and deactivation of devices fluctuating over time, allowing us to integrate more realistic scenarios.For instance, Internet of Things (IoT) devices have limited battery power, and thus, go into a sleep mode to prolong their battery life.However, existing algorithms require precisely knowing the UEs' ON/OFF patterns that are difficult to predict, as they depend on the battery state of the individual devices.This motivates us to design a method that is capable of learning such dynamic behaviors on the go.Our proposed modelfree DRL method is specially crafted such that the power control decision is solely based on the user rate feedback, hence neither the CSI nor the device activation pattern is required to be known in advance.Additionally, we augment the vanilla deep deterministic policy gradient (DDPG) DRL algorithm [13] with prioritized sampling to ensure that the system is able to quickly adapt to the changes in the wireless environment, as in the case of a live network.The prioritization is dictated by the temporal-difference (TD) error [7], which is automatically calculated when updating DDPG.Therefore, it does not incur additional computational complexity.Our main contributions are summarized as follows.1) We first provide the mathematical formulation of the target uplink power control problem, aiming at maximizing the guaranteed rate of a cell-free massive MIMO network.2) We describe the proposed DDPG-based method, which, unlike existing schemes, is specifically designed to adapt to various dynamics of the wireless environment in an online fashion, such as user mobility and variations of UE activation patterns over time.This is realized by capitalizing on TD-error-based prioritized sampling.
3) The effectiveness of the proposed method is shown through simulation results under different scenarios, where it outperforms the benchmark schemes in terms of convergence speed, guaranteed rate and power consumption.
II. SYSTEM MODEL We consider an uplink cell-free massive MIMO network with M single-antenna APs and K single-antenna UEs, such that M K.The set of all APs is denoted by M = {1, . . ., M }, and the set of all UEs by K = {1, . . ., K }.The channel between AP m and UE k is given by h k ,m = √ g k ,m h k ,m .Large-scale fading coefficient g k ,m follows a distance-dependent path loss model with lognormallydistributed shadow fading, and small-scale fading coefficient h k ,m ∼ CN (0, 1) is independent and identically distributed (i.i.d.).We consider the block-fading channel model.Each coherence block contains τ c = τ p +τ u +τ d samples, of which τ p are for uplink training, τ u are for uplink data, and τ d are for downlink data transmission.
All UEs first transmit their pilot sequences simultaneously.The pilot sequence of UE k is denoted by φ k ∈ C τp ×1 .The received (τ p × 1)-pilot signal at AP m is, where ρ p is the pilot transmit power, and n m is the noise vector with i.i.d.elements following CN (0, σ 2 n ).AP m estimates the channel by projecting the received pilot signal onto where V k denotes the set of users utilizing the same pilot sequence φ k .The minimum mean-squared error (MMSE) channel estimate is then [1], [10], During uplink data transmission, all UEs transmit their data simultaneously.UE k sends its symbol where ρ k is the uplink transmit power of UE k, and n (u) m is the noise at AP m.Each AP utilizes its local channel estimate ĥk,m to get ĥ * k ,m y m , which is then sent to the CPU for data detection.The received signal at the CPU is, ( The signal-to-interference-plus-noise ratio (SINR) of UE k in ( 6), shown at the bottom of the page, is obtained through the power of the three terms in ( 5), as in [1], [10].The rate of UE k in bits per second (bps) is, where B is the system bandwidth in Hertz (Hz).

III. PROBLEM FORMULATION
We hereby focus on determining the uplink transmit power for all UEs, with the goal of maximizing the minimum user rate over the cell-free massive MIMO network.The optimization problem is formulated as, where (8a) represents our objective, and (8b) defines the valid range for the power values, with ρ max being the maximum transmit power.The problem is equivalent to maximizing the minimum user SINR, which, in principle, can be solved using conventional optimization algorithms, including those described in [1], [14].However, as argued in Section I, such approaches require the exact knowledge of active and deactivated devices, which is impractical in the case of IoT applications, especially if these changes occur rapidly.Therefore, we next design a DRL scheme that only relies on the user performance.
IV. PROPOSED DRL-BASED UPLINK POWER CONTROL WITH PRIORITIZED SAMPLING In this section, we present our proposed DRL framework for uplink power control, designed to cope with device activation and deactivation, and to accelerate convergence.The CPU serves as the DRL agent, and the environment consists of the users and APs, as depicted in Fig. 1.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.k ∈ {0, 1}.The UEs may switch from active to inactive mode (and vice versa), which we refer to as UE toggling, with parameters T tog and K tog .The UE toggling period T tog specifies the number of episodes over which the ON/OFF state of the UEs is assumed to be constant.The number of UEs that switch every T tog is indicated by K tog .The activation pattern is, however, unknown at the CPU.It estimates this information based on whether the APs have received any signal from a specific device.It considers the UE inactive if no associated signal, such as UE rate feedback, has been detected.This is reflected in the state vector describing the environment, where b k ∈ {0, 1} indicates whether UE k is estimated to be active or not at time t, such that b k , and u 2) Action: Based on the current state observation, the agent decides the uplink transmit power of all K UEs denoted by, It assigns zero power to UEs estimated to be inactive, while allocating non-zero power within range ξ ≤ ρ k ≤ ρ max to the active users, with 0 < ξ 1.

3) Reward:
The agent receives a corresponding reward, where is the set of active UEs.

B. Proposed Algorithm
Algorithm 1 outlines the procedure of the proposed method, which we detail below.We leverage the uniform samplingbased DDPG DRL algorithm in [13], as it was shown to best handle the continuous state and action spaces at stake.
Step 2: The CPU observes the current ON/OFF UE states and user performance.It decides the uplink transmit powers, Initialize a random process N for action exploration.Observe the current state s (t)   Select and apply action a (t) ← μ(s (t) |θ μ ) + N (t) .9: Observe the reward r (t+1) and next state s (t+1) .10: Store experience (s (t) , a (t) , r (t+1) , s (t+1) , pmax) in B.
Step 3: Unlike the vanilla DDPG algorithm that uniformly samples the experiences used to update its DNN parameters, we propose to augment it with prioritized sampling or prioritized experience replay (pER) [11].When the agent saves its new experience i in its replay buffer B, we attach a priority information p i , forming the modified experience tuple (s i , a i , r i , s i , p i ).A new experience is always assigned the current maximum priority p max for it to be sampled at least once (Line 10).
Step 4: The priority values are converted to probabilities P (i ) ∈ [0, 1] as, where len(B) denotes the current length of the buffer (Line 11).The prioritization factor α ∈ [0, 1] controls how much we rely on the prioritization, with α = 0 for uniform sampling.
Step 5: The pER mechanism introduces a distribution bias, which we correct by assigning importance-sampling weights to the samples, We define a correction parameter β ∈ [0, 1], such that β = 0 corresponds to the case where no correction is made.The experiences with high priority are likely to be oversampled by pER.We counter this by assigning them lower weights to lessen their impact (Line 12).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Step 6: We sample a mini-batch of X experiences based on the calculated probabilities.For each sample, we compute the TD error δ i = y Targ,i − Q(s i , a i |θ Q ).The target networks are used to calculate the updated Q-value y (t) Targ = r (t+1) + γQ (s (t+1) , μ (s (t+1) |θ μ )|θ Q ) based on the Bellman equation [7].The primary critic network is updated by minimizing the weighted loss between the updated and current Q-values, over the X sampled experiences using gradient descent (Lines 13 to 14).
Step 7: The priorities are calculated based on the TD errors as, where > 0 is a small number to avoid dividing by 0 in (12).Experiences with larger δ i are assigned a higher priority, so that they are more likely to be sampled, resulting to more chances of minimizing the TD error.Based on the newly calculated TD errors, we update the priorities of the sampled experiences, and, subsequently, the current p max (Lines 15 to 16).
Step 8: The primary actor network is updated by taking the following derivative and applying gradient ascent [13], The target networks are updated using Polyak averaging (Lines 17 to 18).
Step 9: We anneal β during training as, We initialize β to β start , and increase its value until it reaches β end over N ts time steps (Line 19).

V. NUMERICAL EVALUATIONS A. Simulation Scenario
We consider an uplink cell-free massive MIMO network with M = 30 single-antenna APs and K = 10 singleantenna UEs, all of which are uniformly distributed over a 500 × 500 m 2 area.The simulation parameters are listed in Table I.For both the actor and critic networks, we employ a fully connected DNN with two hidden layers having 64 neurons each.We use three benchmark schemes to evaluate the performance of our proposed framework: (a) Uniform DDPG -vanilla DDPG algorithm utilizing uniform sampling, (b) Full power -each UE transmits with ρ max , and (c) Max-min -Problem ( 8) is solved as in [1].Moreover, we test our system considering different mobility scenarios, where we combine static and mobile users with dynamic UE activation patterns.In the latter case, the users randomly select a direction (left, right, up, or down) and a speed from 0 to 1 m/s uniformly at each time step.For our proposed framework, DDPG + pER, we experimented with different prioritization factor α values to determine the most appropriate one for our problem.We achieved the best performance for α = 0.5, and thus, we use this value in the sequel.

B. Fully Static / No UE Toggling
We first consider a fully static scenario, where the ON/OFF states of the non-mobile users stay constant throughout the learning process.Fig. 2(a) shows the evolution of the minimum user rate.The proposed scheme converges to 15 Mbps, while Uniform DDPG achieves the same rate 300 episodes later.Compared to Max-min that provides the upper bound, our framework behaves close to optimal, with a difference of only 0.9 Mbps or 5.66%.Note that Max-min requires the UE activation pattern to be known in advance for solving Problem (8) with traditional optimization techniques.However, in practice, this information is not available to the CPU, and is also hard to predict.We highlight that, by estimating the ON/OFF states and relying solely on the UE rate feedback, our system is able to adapt to the current state of the environment without this knowledge, as shown in Fig. 2. The Full power baseline performs worst due to the increased inter-user interference.
The total transmit power consumption is depicted in Fig. 2(b).For both sampling configurations, we observe that power consumption reduces as training progresses.This suggests that the agent is able to recognize the ON/OFF patterns of UEs, allowing it to select better actions or power values as it further interacts with the environment.It is worth noting that power reduction is implicitly accounted for in the reward definition, as maximizing the guaranteed rate requires minimizing inter-user interference.With prioritized sampling, power consumption noticeably reaches that of Max-min faster compared to Uniform DDPG.The Full power benchmark consumes the most power.

C. Static Users With UE Toggling
We next consider the case of non-mobile UEs that switch from active to inactive mode (and vice versa) with T tog = 500 and K tog = 0.1K .The UE toggling then happens at episode 500, which explains the sudden "activity" around this area in Fig. 3.After which, we observe that the prioritization helps the DRL system to recover from the environment change faster, with the agent already finding a solution at episode 600 in Fig. 3(a).In contrast, Uniform DDPG is only able to do so 200 episodes later.Compared to the fully static scenario, we achieve not only accelerated convergence, but also better performance (7.4% rate increase) with prioritized sampling.Similarly, our framework consumes less power than Uniform DDPG in Fig. 3(b), while performing close to Max-min.
We now allow the UE toggling to happen more frequently by setting T tog to 350 in Fig. 4. When the first toggle happens at episode 350, the proposed scheme is able to quickly reach its newly converged value of 14.5 Mbps at episode 450 in Fig. 4(a).On the other hand, Uniform DDPG settles for a rate 11.72% lower only 100 episodes later.The second toggle occurs at episode 700, in which case the minimum user rate is expected to increase as indicated by the Max-min and Full power baselines.With prioritization, the agent is able to adapt to this environment change, characterized by the corresponding increase in the guaranteed rate.In contrast, Uniform DDPG likely needs more time to do so.

D. Mobile Users With UE Toggling
We now consider the case of mobile UEs with T tog = 500 and K tog = 0.1K in Fig. 5.After the toggle at episode 500, we observe that it now takes more time for both DDPGbased systems to reach convergence.Specifically, compared to the static scenario in Section V-C, this happens 100 episodes later for our proposed scheme, while Uniform DDPG has yet to do so even at episode 1000.In this case, the agent has to deal with the additional user mobility that impacts the power or action selection, on top of having to detect the environment change caused by the UE toggling.Nevertheless, we still benefit from the prioritized sampling that achieves performance gain in terms of convergence speed, rate and power consumption, while approaching the optimal performance of Max-min.

VI. CONCLUSION
We have proposed a novel DRL framework for power control in uplink cell-free massive MIMO, designed to handle device activation and deactivation combined with user mobility, without requiring any prior knowledge at the CPU.To ensure that the system can quickly adapt to the dynamics of a practical wireless environment, we have exploited a TDerror-based prioritization that accelerates the learning process.Numerical results have shown that the proposed algorithm achieves faster convergence and enhanced performance compared to the baseline schemes.As future work, we aim at investigating different variants of prioritized sampling, and extending the proposed framework towards a multi-agent system for distributed learning.

Fig. 1 .
Fig. 1.Agent-environment scenario of the DDPG-based framework with prioritized sampling.A. DRL Components 1) State: The actual ON/OFF state of UE k at time t is denoted by d (t)

13 :
Sample a random mini-batch of X experiences from B based on the calculated probabilities.
and consequently, obtains a reward that is the guaranteed rate that we aim to maximize.It then estimates the latest activation state of the UEs for the next time step, as described in Section IV-A.If UE k is estimated to be inactive, the corresponding elements b

TABLE I SIMULATION
PARAMETERS Fig.2.Performance comparison for the fully static scenario.