Deep Reinforcement Learning-Based Sum Rate Fairness Trade-Off for Cell-Free mMIMO

The uplink of a cell-free massive multiple-input multiple-output with maximum-ratio combining (MRC) and zero-forcing (ZF) schemes are investigated. A power allocation optimization problem is considered, where two conflicting metrics, namely the sum rate and fairness, are jointly optimized. As there is no closed-form expression for the achievable rate in terms of the large scale-fading (LSF) components, the sum rate fairness trade-off optimization problem cannot be solved by using known convex optimization methods. To alleviate this problem, we propose two new approaches. For the first approach, a use-and-then-forget scheme is utilized to derive a closed-form expression for the achievable rate. Then, the fairness optimization problem is iteratively solved through the proposed sequential convex approximation (SCA) scheme. For the second approach, we exploit LSF coefficients as inputs of a twin delayed deep deterministic policy gradient (TD3), which efficiently solves the non-convex sum rate fairness trade-off optimization problem. Next, the complexity and convergence properties of the proposed schemes are analyzed. Numerical results demonstrate the superiority of the proposed approaches over conventional power control algorithms in terms of the sum rate and minimum user rate for both the ZF and MRC receivers. Moreover, the proposed TD3-based power control achieves better performance than the proposed SCA-based approach as well as the fractional power scheme.

increasing demands for much higher data-rate mobile communication. Cell-free massive multiple-input multiple-output (mMIMO) is a key enabling wireless network technology as it greatly increases coverage probability and the data rate [2]. In cell-free mMIMO, a large number of access points (APs), each of which is equipped with a few antennas, are randomly distributed in the coverage area [2], [3]. We consider an uplink cell-free mMIMO system with error-free fronthaul links and two different linear detection techniques, namely, maximum ratio combining (MRC) and zero-forcing (ZF). This paper investigates a fundamental challenge of sum rate fairness trade-off, which is a multi-objective optimization (MOO) problem in the cell-free mMIMO system. Based on [4], a weighted sum single objective (SOO) function can be exploited to represent the MOO functions. In order to overcome the intractability of sum rate fairness trade-off MOO problem, an equivalent SOO problem is required [4]. On the other hand, the sum rate fairness trade-off problem needs to be solved for each coherence interval of the large-scale fading (LSF). Unfortunately, the achievable rate of the system does not have a closed-form expression in terms of the LSF coefficients. As a result, the sum rate fairness trade-off problem cannot be solved by using known convex optimization methods. In this paper, we use the reinforcement learning (RL) technique to handle the non-convexity issue.

A. Motivation and Contribution
Recently, machine learning techniques have shown promising potential to deal with various problems related to power control and allocation in the wireless communication networks [5]. Some recent works in this area focus on the data-driven supervised-learning approach. In [6], the theoretical and practical aspects of a deep neural network (DNN)-based power allocation is investigated, and for the first time centralized supervised learning approach is proposed to train a more accelerated DNN, which spontaneously optimizes the power allocation. In [6], the authors show that the supervised learning algorithm can approximately achieve 90% the sum rate obtained by the weighted minimum mean square error (WMMSE) approach. Furthermore, in [7], a supervised learning algorithm is utilized to optimize the power allocation by sum rate maximization in the uplink of the cell-free mMIMO system. This is accomplished by designing a DNN with convolutional neural network to learn a mapping from a set of input data (LSF coefficients) to the optimal power using the quantized channel.
The main disadvantages of the aforementioned DNN-based optimization algorithms are listed below: (i) it requires a massive dataset of desired outputs of the optimization problem for training DNN, which requires enormous time to produce a dataset. This is due to the fact that the computational complexity of solving the optimization problem is high, and the training procedure itself is time-consuming; (ii) the dataset must be regenerated if there is a slight change in the channel model or the optimization objective function; (iii) the DNN-based algorithms are applicable if there is an optimal or suboptimal solution to the resource allocation optimization problem and the performance of the learned DNN is bounded by the target solution. These disadvantages in the DNN-based optimization algorithms pose severe limitations for practical implementation. Due to the above-mentioned bottlenecks of supervised learning-based optimization methods, deep reinforcement learning (DRL)-based optimization algorithms have been developed and received widespread attention. In DRL-based power control, the agent (for instance, the user or the AP) observes the environment states (for instance, channel state information (CSI) or user's signal-to-interference-plus-noise ratio (SINR)) and determines which actions (for instance, power control coefficients) ultimately result in the best cumulative reward (for instance, sum rate or fairness index) by the "trial and error" strategy and ultimately the policy of mapping states to actions is obtained. Furthermore, DRL aims to dynamically update the parameters to find a near-optimal decision policy, maximizing the system's long-term performance via regular interactions. There are many approaches to classify DRL-based power control algorithms depending on their architecture. A comprehensive taxonomy to classify DRL-based power control scheme is based on the type and number of an agent, the type of an objective, and a Markov decision process (MDP) model definition that is involved in the RL algorithm. In the first category, various approaches can be classified based on the number of the agent. In [8], [9], [10], [11] the problem of power allocation is modeled based on single-agent DRL. In [12] the distributed dynamic power allocation algorithm is modeled based on multiple-agent DRL. Similarly, the work in [13], [14], [15] investigates the use of multiple-agent DRL for power allocation problems. In addition, using different types of RL methods can be considered in this category. For instance, [9] uses Q-learning, [16], [17] use deep Q-network (DQN) agent, [14], [15], [18] use deep deterministic policy gradient (DDPG), for solving power allocation problem.
In the second category, various approaches can be classified based on the type of objectives. The objectives can be categorized into three types: sum rate maximization [15], [16], [19], energy-saving [12], and energy efficiency [14], [20]. In the third category, different techniques can be classified based on defining the MDP's model. The MDP's model is crucial to the success of the DRL-based optimization algorithms and helps the agent learn to take more beneficial actions. MDPs consist of states, actions, and a reward function definition. In [21], the state of an MDP model includes the power-coefficients values, data-rate of users, and vectors indicating which of the power-coefficients can be increased or decreased. In [14], the channel conditions and transmission power are considered as state and action, respectively, and both of them have continuous space. In [20], state is defined as the user data rate in each time step, and action is defined as the sets of user associations and power allocations. In [22], [23], users' SINR is considered state space, and the beamforming matrix is considered an action space. In [11], the state space consists of the current reference signal receiving power and the last transmitted power, which makes reducing links overhead and improving the system performance. However, the proposed scheme in this paper considers a single twin delayed deep deterministic policy gradient (TD3) agent. In addition, compared to the third category methods, the state space of the proposed algorithm is defined as users' SINR, users' transmitted power, and objective function gradient. In summary, we propose a sum rate fairness trade-off optimization problem considering the per-user power constraints, the channel estimation error, and the pilot contamination effect. Then, the original MOO problem is recast to an SOO problem exploiting the weighted sum technique. Given the requirements of various use cases, the mobile operator assigns different weights to the conflicting objectives, i.e., sum rate and fairness. Hence, we jointly optimize these conflicting objectives. To the best of our knowledge, this design is entirely novel with much improved performance, which is evidenced by numerical results. It also has been shown that the proposed DRL-based power control has a comparatively low computational complexity as a result of scalar multiplication. Furthermore, the TD3 agent has a faster convergent training speed in comparison with the DDPG agent. The contributions of the paper are summarized as follows: 1) To overcome the non-convexity of the SOO problem, we propose a TD3-based power control approach for power allocation exploiting only the LSF coefficients as inputs.
We utilize the TD3 algorithm to guide the CPU to dynamically adjust the parameters to maximize the sum rate fairness trade-off objective function in the cell-free mMIMO system. 2) We utilize the use-and-then-forget (UatF) bounding technique to drive a closed-form expression for the achievable rate of the system. Using the UatF technique, the resulting SOO problem is non-convex. Therefore, we propose a sequential convex approximation (SCA) approach to overcome the non-convexity issue. 3) Finally, the computational complexity and convergence of the proposed approach are analyzed. There are three main differences between the proposed DRLbased resource allocation algorithm and the state of the art DRLbased resource allocation techniques, which can be summarized as follows: (i) In existing works [15], [16], [20], DRL agents were utilized to solve the optimization problem by finding the map between CSI and desired power coefficient. But in the proposed DRL approach, we seek to derive an optimized solution similar to the scheme presented in [24]. This novel concept of solving optimization problems in wireless communication networks is the first of its kind and can arrive at a better algorithm in a dynamic system; (ii) The work presented in [11], [20], [22] uses the SINR of the users as the states in the RL model, in which the SINR varies as a function of the instantaneous CSI matrix. However, we propose to consider the users' SINR (which varies as a function of LSF), users' transmitted power, and objective function gradient in the definition of the states. The philosophy behind our state-space definition is that, on the one hand, according to the Markov property, the current state should include all information in a way that the future state is independent of the previous state, given the present state. In other words, without considering the historical sequence of states, the future state is estimated based on the current state. Moreover, we define the state space in a way that it does not change with the small-scale fading (SSF), which makes the algorithm more suitable for practical implementation in fast varying channels; (iii) In [23], [25], the DDPG-based agent was exploited to determine the optimal policy. However, it has been shown that the stability of the DDPG algorithm can be further enhanced [26]. In light of this finding, we propose to use the TD3 agent to learn the power control policy. Compared with the DDPG algorithm, the simulation results confirm that our proposed TD3-based algorithm can effectively enhance the convergence. Furthermore, as the hyperparameters can impact the convergence of the proposed algorithm, a series of parameter optimization is performed to improve the algorithm's performance.
Outline: The rest of the paper is organized as follows. Section II provides the system model and the performance analysis. Sum rate fairness trade-off framework is studied in Section III. Sections V and VI introduce proposed DRL model and proposed DRL-based power control scheme, respectively. The proposed SCA-based power control scheme is discussed in Section IV and complexity analysis is presented in Section VII. Numerical results and discussion are presented in Section VIII, and finally Section IX concludes the paper.

II. SYSTEM MODEL
We consider a cell-free mMIMO system where M APs, each equipped with N antennas, simultaneously serve K singleantenna users in the same time-frequency resource. All APs and users are randomly distributed in a large service area. Moreover, the APs connect to a central processing unit (CPU) via fronthaul links. We define an N -dimensional channel vector between the kth user and the mth AP as where β mk denotes the LSF and the elements of h mk are independent and identically distributed (i.i.d.) CN (0, 1) random variables, representing the SSF [2]. During the pilot training phase, all K users simultaneously send pilot sequences to the APs. Let √ τ p φ φ φ k ∈ C τ p ×1 be the pilot sequence transmitted by the k user, where ||φ φ φ k || = 1, for k = 1, 2, ..., K. Upon performing the de-spreading operation, the minimum mean square error (MMSE) estimate of the channel coefficient between the kth user and the mth AP is given by [2] where , and W p,m ∈ C N ×τ p denotes the noise at the mth AP whose elements are i.i.d. CN (0, 1), p p represents the normalized signal-to-noise ratio (SNR) of each pilot symbol. Note thatg mk is the channel estimation error and the elements ofg mk ∼ CN (0, β mk − γ mk ) are i.i.d. random variables and independent ofĝ mk , where γ mk = √ τ p p p β mk c mk . Let the transmitted signal from the kth user be x k = √ q k s k , where s k (E{|s k | 2 } = 1) and q k denote the transmitted symbol and the transmit power of the kth user, respectively. Then the signal received at AP m (y m ∈ C N ×1 ) is given by where n m ∈ C N ×1 , whose elements are i.i.d. CN (0, 1), is the noise at AP m, and ρq k is the normalized uplink SNR corresponding to the kth user.

A. Performance Analysis
In this section, we summarize the achievable rate of the linear receiver, based on the analysis in [27]. Let V ∈ C MN×K be the linear detector matrix depending on the side information at the receiverĝ mk , ∀m, k.
refers to the kth column of the detector matrix V, and v mk ∈ C N . At the CPU, the received uplink data signals at the serving APs are combined to compute an estimateŝ k of the signal s k transmitted by user k. This is obtained by adding the inner products of the combining vectors v mk and y m . This yields the estimate as follows: where DS k , IUI kk , and TEE kk represent the desired signal (DS), interuser interference over estimated channel, and total estimation error (TEE), respectively. Moreover, TN k accounts for the total noise (TN).

B. Achievable Rate with the Estimated Channel as Side Information at the CPU
When the CPU has access to estimated channelĜ = [ĝ 1 , . . . ,ĝ K ] as side information, the instantaneous effective SINR of cell-free mMIMO is given by (5) (defined at the bottom of this page), where the expectation E is taken over the noise and channel estimation error. Then the achievable rate can be obtained as [28,Sec. 2

.3.5]
where E SSF is expectation with respect to the channel estimates, taken over the SSF coefficients. 1 Due to the expectation operation in front of the logarithm in (6), it is not possible to obtain the achievable rate in closed form. In the following, we compute the achievable rate for ZF and MRC receivers, respectively. A simple choice for the AP is to use the MRC receiver, which aims to maximize each user's signal-to-noise ratio, disregarding the interference produced by other users. The combining matrix for MRC receiver is V =Ĝ, SINR MRC k is given by (7) (defined at the bottom of this page). On the other hand, the ZF receiver seeks to eliminate the mutual interference with full CSI at the CPU [29]. The ZF decoder matrix is Then, the instantaneous SINR in (5) can be simplified to By substituting (9) and (7) in (6) the achievable rate for ZF and MRC receivers is obtained.

C. Achievable Rate With Use-and-then-Forget CSI
Deriving the achievable rate defined in (6) is challenging because of the expectation outside the logarithm and the lack of access to channel estimates in the CPU. There is another achievable rate expression obtained from the UatF bounding technique that results in closed-form expressions and widely used in Massive MIMO [30,Theorem 4.4], and also in cellfree Massive MIMO [31,Theorem 5.2]. This achievable rate expression depends on only the LSF coefficients and can be obtained under the assumption that the CPU uses only statistical knowledge of the channel between the users and APs when performing the detection [2]. Then, by adding and subtracting the term √ ρq k E{v H k g k }s k the received signal in (4) can be rewritten aŝ where DS UaTF k , BU UaTF k , and IUI UaTF kk represent the strength of desired signal (which is constant), the beamforming gain uncertainty, and the interuser interference. Moreover, (10) is deterministic channel with the additive interference pluse noise term IpN k , where Since s k for k = 1, ..., K and n m for m = 1, .., M have zero mean, IpN k has zero mean. Furthermore, noting that various users' signals are independent and the noise distribution at different APs are independent, we have Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Additionally, despite the fact that IpN k includes the desired signal s k , it is uncorrelated with it since the BU UaTF k has zero mean. Therefore, [31,Lemma 3.3] can be applied to obtain achievable rate as where SINR UatF k can be interpreted as an effective SINR [7] and is defined in (14) (defined at the bottom of this page), where the expectation E is with respect to all sources of randomness. Notably, there is no expectation in front of the logarithm since the UatF bounding technique treats the channel as deterministic.
Theorem 1: An effective SINR for the kth user in the cell-free massive MIMO system with ZF decoding matrix, for any K, M, and N, is given by (15) The expectation is approximated with the sample average over a large number of random realizations. Theorem 2: An effective SINR for the kth user in the cell-free massive MIMO system with MRC decoding matrix (V =Ĝ) by using the UatF bounding technique, for any K, M, and N, is given by (18), shown at the bottom of the this page.
Proof: See Appendix B.

III. SUM RATE FAIRNESS TRADE-OFF FRAMEWORK
The CPU should be able to intelligently assess whether it has to optimize the sum rate or user fairness or strike a fair compromise between them [32]. In this section, we first define the system's fairness index (FI), which determines the fairness between users achievable rates. Next, the new optimization problem which simultaneously considers the sum-rate and the user fairness will be formulated.

A. Fairness Index (FI)
Similar to the methodology in [4], we exploit the FI to compute the fairness between users regarding their achievable rates. The system FI is defined as Note that the best fairness can be attained when the data rate of all the users are equal, and in this case, FI is one. Furthermore, the FI and sum rate are conflicting performance metrics, which means that increasing the sum rate can decrease the FI and vice versa, especially when the cell-free system has users with considerably varied channel strengths.

B. Optimization Problem Formulation
In this paper, we seek to design the power elements of the users by simultaneously considering both conflicting performance metrics, i.e., the sum rate and FI in the cell-free mMIMO system. Similar to the methodology in [4], the problem of sum rate fairness trade-off can be formulated as the following multi-objective optimization (MOO) framework: where p (k) max is the maximum transmit power available at kth user. Moreover, term f denotes the vector which contains the both objective functions as R k and f 2 (q) = FI. For a MOO with conflicting objective functions, there is no single global optimal solution that jointly maximizes f 1 (q) and f 2 (q). The Pareto and scalarization are two basic methods of MOO that don't require complicated mathematical equations. We utilize the weighted sum approach as a scalarization method to reform the MOO Problem P 1 into SINR ZF,UatF SINR MRC,UatF Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
a form of a tractable SOO problem, as follows: is normalized objective function and max(f i (q)) denotes the maximum value of f i (q), for i = {1, 2}. In addition, ω 1 + ω 2 = 1, where ω 1 and ω 2 indicates the significance of the performance metrics f 1 (q) and f 2 (q) in the initial MOO problem. Theoretically, q k , ∀k should be optimally computed for each channel realization (SSF coefficients) to get the most accurate solution to Problem P 2 . However, SSF-based power control approaches are not practicable in realistic systems [31, Sec. 7.1.1]. As a result, the LSF based power control should be considered. This article proposes two completely different approaches to solving Problem P 2 based on LSF components, which are explained in the following sections.

IV. PROPOSED SCA-BASED POWER CONTROL SCHEME
In the first approach, we utilize the UatF bounding technique to obtain the achievable rate. Recall that, the achievable rate is obtained in (13) (15) and (18). Therefore, power control scheme defined in Problem P 2 is only a function of LSF coefficients. Furthermore, Problem P 2 is not convex. Particularly, the objective function in (21a) is non-convex. To overcome the non-convexity issue, we adopt a SCA approach that can tackle the problem iteratively by solving a convex problem at each iteration. This iterative approach is guaranteed to converge since the objective function is upper bounded and would be enhanced at each iteration. In this regard, Problem P 2 can be rewritten as the following optimization problem: where ν, ν 1 and ν 2 are slack variables and ω 1 = 1 − ω and ω 2 = ω. It is obvious that Problem P 3 is not convex due to non-convex constraints. Hence, it cannot be directly solved through existing convex optimization software. Therefore, we propose to approximate the non-convex constraints with convex ones, which enables us to iteratively solve the optimization problem. The constraint (22c) can be reformulated as follows where μ k , ∀k refer to new slack variables. Using (13) and (15), R k in the constraint (23b) can be rewritten as follows where Then, constraint (24) can be reformulated as: where ζ k , ∀k are new slack variables. The constraint in (26a) can be written as By defining the slack variableq k = √ q k , ∀k, the constraint (27a) can be written as the following second order cone (SOC): where E{||v mk || 2 }. Next, using a first-order Taylor approximation [33], the constraint (27b) is approximated by the following linear inequality constraint: where ζ (i−1) k and ς (i−1) k refer to the approximations of ζ k and ς k at the iteration (i − 1), respectively. Next, using (19), the constraint (22d) is equivalent to which can be approximated as follows where and are new slack variables. Exploiting the Taylor series approximation, the non-convexity in (31a) can be tackled as follows: Algorithm 1: Proposed algorithm to solve Problem P 4 .
Next, (31b) is reformed as a SOC constraint as follows: Finally, Problem P 3 defined in (22) is rewritten as follows where Ψ {q k , ν, ν 1 , ν 2 , μ k , ζ k , ς k , , } K k=1 includes all the optimization parameters. Note that solving Problem P 4 needs the initialization set Ψ (0) , which can be obtained by determining a set of feasible power control q, which can satisfy all constraints in Problem P 4 . Next, we choose all necessary slack variables in Problem P 4 based on the determined initial power control q. Finally, Problem P 4 is alternately solved until the necessary accuracy reached such that |ν i − ν (i−1) | ≤ ε, where ε is a predetermined threshold. The proposed algorithm to solve Problem P 4 is summarized in Algorithm 1.
Remark 1: We refer to the solution obtained by solving Problem P 4 as SCA-based power control.

A. Convergence Analysis
The convergence analysis of Algorithm 1 is consistent with standard justifications for the overall SCA framework [34]. Algorithm 1 is specifically guaranteed to have the following characteristics.
1) A feasible solution to Problem P 4,i is also feasible to P 3 . P 4,i indicates Problem P 4 at the i-th iteration. 2) An optimal solution P 4,i is also feasible to P 4,i+1 .
3) A monotonically increasing sequence of objectives is produced via Algorithm 1. To verify the findings mentioned above, suppose Ψ (i) is feasible to P 4,i . Constraints (22b) and (22e) are the same in both Problems. Since in (32) and (29) the the first order approximation is employed, which is greater than to the actual value, by combining the constraints in (33), (32), and (31c), we have the constraint (22d), and by combining the constraints in (28), (29), and (26b), we have the constraint (22c). This implies that Ψ (i) is also feasible to P 3 . Further, as Ψ (i) is an optimal solution to P 4,i , it is obviously feasible to P 4,i . Clearly, the optimal value of an optimal problem is always greater than the objective value of a feasible solution. This proves the monotonic increase of the objective returned by Algorithm 1. In addition, we see that the total power constraint puts an upper bound on the objective of (P1). As a result, the objective of Algorithm 1 is guaranteed to converge. In our simulation, Algorithm 1 converges rapidly after around 10 iterations.

A. Markov Decision Process
The MDP model consists of the tuple (S, A, t, P a , r a , η), where S and A are a set of possible states (s t ) and a set of possible action (a), and they are called the state space the action space, respectively. In addition, t is the decision time points (or time step), P a,s t ,s t+1 : S × A × S → [0, 1] is the transition probability that action a in state s t will lead to state s t+1 , r a,s t ,s t+1 : S × A → R is the expected immediate reward received at t after taking action a and transition from state s t to state s t+1 , and η is the discount factor [35].

B. Deep Deterministic Policy Gradient
DDPG, which is based on the actor-critic method, has been developed to enhance the trainability of the stochastic policy gradient method with continuous action space problems [36]. Replay buffer (R) and target network strategies are two main concepts used in DDPG to improve learning stability. The structure of DDPG consists of an actor DNN μ(s, θ μ ) and a critic DNN Q(s, a, θ Q ).

C. Twin Delayed Deep Deterministic Policy Gradient
DDPG heavily depends on finding accurate hyperparameters, which could cause instability in the critic training. Furthermore, it is shown that overestimation bias and the accumulation of error in the temporal difference methods persist in a DDPG setting. TD3 addresses these issues by introducing clipped double-Q Learning, delayed policy updates, and target policy smoothing in its structure, and dramatically exceeds the previous state-ofthe-art methods [26]. TD3's structure is depicted in Fig. 2. In TD3 two similar critic DNNs are used, and their minimum is applied to approximate the target Q-value. In the following, θ 1 and θ 2 represent the two critic DNNs' parameters, whereas θ 1 and θ 2 represent corresponding target DNNs' parameters. As the cost function for training the critic DNN, the following function of error is minimized: where y target = [r a,s t ,s t+1 + η min i=1,2 Q θ i (s t+1 ,ã)] is a modified target value andã = π φ (s t+a ) + is a target action, where is clipped normal distribution which implies the target close to the original action. Value function from the critic and the mini-batch from the memory are both the input of the actor DNN and the output is the action.
Specifically, the final deterministic and continuous action is determined by a t = μ φ (s t ) + where φ is the parameter of the actor DNN. In addition noise with normal distribution is added for exploration. By minimizing the following cost function parameter φ is updated.
where d μ (s) is state distribution. It is worth noting that the gradient L μ (φ) is employed to update the parameter φ. That is the reason for naming the method as the policy gradient. The common target value in (35) is utilized to update the two critic DNNs based on the two outputs from the two critic target DNNs. In order to prevent too fast convergence, TD3 updates the actor DNN and the three target DNNs every d step. It should be noted that the policy μ φ (s t ) is only updated relatively to Q θ (s, μ φ (s t )). The critic target DNN's parameters, and the actor target DNN's ones get updated based on the θ = τ θ + (1 − τ )θ at every d stage, leading not only to keeping minor temporal difference error, yet also the target DNN's parameters get updated gradually. Since training of RL agents is through samples, so having valuable samples is critical, implying that the action-state pair boosts the action-value function. On the other hand, because many episodes finish without achieving the goal, our MDP generates many useless samples (facing sparse reward). In this study, Hindsight Experience Replay (HER) is used to increase sampling efficiency. HER resets s T as s goal for the episode e = [(s 0 , a 0 ), (s 1 , a 1 ), ..., (s T , a T )] in the memory where s T is not the goal state. This implies that although the episode does not finish the goal state, it becomes an ended state at the goal after HER modifies it.

VI. PROPOSED DRL-BASED POWER CONTROL SCHEME
In the second approach, we seek to solve Problem P 2 by considering the achievable rate obtained by (6), which renders Problem P 2 a function of the SSF. Since there is no closed-form solution for the achievable rate, solving Problem P 2 is not feasible using known convex optimization methods. On the other hand, LSF varies more slowly than SSF, which allows finding a mapping function between the optimum power control coefficients and LSF coefficients. However, this is only possible through machine learning approaches [7]. In this section, a DRL-based algorithm is developed, where the power control policy is optimized in a trial-and-error manner, considering only the LSF coefficients as inputs. In this case, Problem P 2 is modeled as a RL task consisting of an agent and environment which interact with each other. Fig. 2 demonstrates the complete design of the proposed scheme in detail, in which the power control algorithm is considered as an agent, and the entire cell-free mMIMO system is the environment. The action is the step vector that is used to update the current transmitted power for each user, which takes place at the CPU based on the state. In other words, the CPU computes the power change for the individual user based on the state information and power allocation policy in each training step. The agent receives a reward for its action in the next iteration, and the environment transitioned into an entirely new state. The TD3 agent updates the actor and critic weights using a mini-batch of experiences in each training phase time step. The mini-batches are randomly sampled from the buffer replay. A mini-batch stochastic gradient descent algorithm is used to train the parameters. In addition, past experiences are stored in a circular experience buffer.

A. State, Action and Reward Function
The MDP model is crucial to the success of the DRL method, and it helps the agents to explore their environments better and perform more satisfying actions. In the proposed approach, we seek to learn an optimization algorithm automatically by learning optimal policy to solve Problem P 2 . In this case, execution of an optimization algorithm may be regarded as the performing of a fixed policy in an MDP; the state 2 consists of the current user's SINRs, current power coefficients, and objective gradients evaluated at the current power [24]. Here, to ensure that the state is only a function of the LSF coefficients, the effective SINR is obtained from the UatF bounding technique (15) and (18). By considering K users in the cell-free mMIMO system, the dimension of the environment state equals 3K. The environment state at the tth time step is defined as follows: where s t ∈ R 3K . Sine the CPU always has access to the power control coefficients and LSF components, the environment state s t is available to the agent. The action is the change in the transmit power of the users, i.e., a t = Δq ∈ R K , where the transmitted power of the user at the tth time step is defined as below: Finally, it is crucial to correctly define the reward function. The reward is a scalar function that determines the appropriateness of action according to its current state. Then, we propose to define the reward function as a change of objective function  after performing the action as follows: where f * SOO is defined in Problem P 2 , given in (21).

B. TD3-Based Power Allocation Design
The optimization variables q k , ∀k have continuous forms. Note that the TD3 agent can handle the problem with continuous state space and continuous action space. Thus the TD3 algorithm can be exploited to determine the optimal variables q k , ∀k. As shown in Fig. 2 the TD3 agent reserves six neural network function approximators to estimate the policy and value function, and the details of actor and critic network are depicted in Fig. 3.
Online policy network or deterministic actor (μ φ (s t )) receives state and returns the corresponding action that maximizes the long-term reward. Target policy network (μ φ (s t )) is designed to improve the stability of the optimization. The online and target policy networks have the same structure and parameterization. Algorithm 2 summarizes the TD3 learning process to solve Problem P 2 .
Remark 2: We refer to the solution to Problem P 2 (obtained by our proposed TD3 algorithm) as TD3-based power control.

VII. COMPLEXITY ANALYSIS
In this section, the computational complexity analysis for the TD3-based power control is presented. The training for the TD3 model is performed offline. After training, only the policy Algorithm 2: TD3 learning process to solve Problem P 2 .

12:
Update the critic networks by minimizing the loss defined in (35). 13: if t mod d then (delayed update with d) 14: Update the actor policy by the deterministic policy gradient. 15: Update target networks: 16: end for 20: if s T = s goal then 21: Set additional goal s goal ← s T 22: for t = 1 to max-episode-steps do 23: Sample a transition from L .

24:
Update reward based on new s goal 25: Store the transition in R.

26:
end for 27: end if 28: end for network is required for making control actions. Therefore the computational complexity of the TD3-based power control is determined by the number of floating operations per second (FLOPS) of the policy network. The number of FLOPS during the inference is mainly determined by the matrix multiplications of the policy network, which has five layers with sizes 3K, 512, 128, 64, and K. Therefore, the number of FLOPS for the policy network during the inference is computed as [25]: FLPOS = 512(3K + 1) + 128(512 + 1) + 64(128 + 1) As a result, the complexity of the proposed TD3-based power control is very low due to the scalar multiplication. Next, we calculate the computational complexity of solving Problem P 4 , given in (34), which includes some SOC and linear constraints. The complexity of SOCP is O(N 2 1 N 2 ), where N 1 and N 2 are the number of optimization variables and the total dimensions of the SOCP problem, respectively [37]. As a result, Problem P 4 can be solved with complexity equivalent to O(N iter (2K 3 + 2K 2 + K)), where N iter is the total number of iterations to solve Problem P 4 in order to achieve the required accuracy.

A. Simulation Parameters 1) Cell-Free mMIMO:
The performance of the proposed scheme has been evaluated with two different system models. In system setup 1, we consider a cell-free mMIMO system with 100 two antennas per AP (M = 100 and N = 2), 30 users (K = 30), and pilot sequences length τ p = 20. In system setup 2, a cell-free mMIMO system is considered with 60 single antenna APs (M = 60, N = 1), 20 users (K = 20) with orthogonal pilot (τ p = K). For both cases, users are uniformly distributed at random points over the simulation area of size 1 × 1 km 2 , and to avoid boundary effects, we use the wrap-around technique. The channel coefficients between users and APs and the noise power are modeled in [2]. It is assumed thatp p andρ denote the power of the pilot sequence and the uplink data, respectively, where p p =p p p n and ρ =ρ p n are normalized transmit SNRs. Note that p n refers to the noise power [2]. In simulations, we setp p = 100 mW and ρ = 1 W.
2) Benchmark Algorithms: In this section, simulation results have been provided to evaluate the performance of the proposed schemes. For comparison, we consider three benchmark algorithms, namely, Random power (RP) algorithm, Full power (FP) algorithm, and Fractional power (FrP) algorithm [38]. In particular, in RP, we set q k = rand ×ρ, in FP, we set q k =ρ, and in FrP, q k is obtained as where ϑ ∈ [0, 1] determines how the range of power coefficients are compressed. Note that, when ϑ = 0, the FrP algorithm is as same as FP algorithm. We also set ϑ = ω in our simulation, and a more precise connection between ϑ and ω can be investigated in future work.
3) Proposed TD3 Network: As shown in Fig. 3, the policy (actor) and value (critic) networks have a five-layer structure. Specifically, the policy network involves 3K neurons as input, three layers with 512, 128, 64 neurons as hidden layers with ReLU function in each layer, K neurons with the sigmoid function as output. The value network involves 4K neurons as input and one neuron with the tanh function as output, and the hidden layers are the same as the policy network. Mean squared error and the gradient of Q-value are used as the loss function in the value network and policy network update, respectively. Adam is applied as the optimizer of both value and policy networks. The learning rates of both networks are 0.0005, and the sampled mini-batch B is assumed to be 256. The target policy smoothing is performed by adding ∼ N (0, 0.2) to the actions chosen by the target actor network, clipped to (−1; 1). Finally, we set the Poylal averaging factor τ Poylak = 0.01, the discount factor η = 0.9, the size of replay buffer R = 10 5 , the maximum number of episodes as 1000, and the maximum steps per episode as 100.

B. How to Run the Proposed TD3 Algorithm?
After 10,000 steps (100 episodes) of data collection for each simulation, the training process experiences 90,000 (900 episodes) steps to get the final policy. We use a PC with Core(TM)i7 CPU @ 4 GHz with 32 GB Installed memory (RAM) to run the simulation setup. Fig. 4 and Fig. 5 show the CDF of the achievable uplink per-user rates for three different values of ω for system setup 1 with ZF and MRC receiver, respectively. We compare the proposed schemes with three benchmarks. The first observation is that the proposed SCA-based power control performs better than the benchmarks. Although the improvement lessens with increasing ω, the proposed SCA-based method slightly outperforms FrP algorithm in both median and 95%-likely performance even in ω = 0.9. When it comes to the two proposed schemes, it is obvious that the proposed TD3-based approach has a significantly better performance than the SCA-based method, and this improvement is more prominent for ZF receiver and ω = 0.9. This stems from the fact that the power coefficients, in this case, have more degrees of freedom for altering the objective function. Therefore with a reinforcement learning approach, the power coefficients can be more optimally obtained. Finally, it is worth mentioning that, by increasing the ω, the priority of objective function in Problem P 2 is given to the fairness index. Therefore, for both proposed methods and FrP algorithms, the achievable per-user rate is   much more concentrated around its median than FP and RP transmission.

1) Cumulative Distribution Function (CDF) of the Achievable Per-User Rate:
2) Cumulative Distribution Function (CDF) of the Achievable Sum Rate: Fig. 6 shows the CDF of the achievable uplink sum rates for three different values of ω for system setup 2 with ZF receiver. It is obvious that, the sum rate obtained by proposed power control approaches is always superior than one achieved by benchmark algorithms. Specifically, Fig. 6(a)-(c) show that with the proposed TD3-based power control, the median of the sum rate is approximately 93%, 65% and 23% higher than FP algorithm, 93%, 18% and 16% greater than FrP algorithm, and 8%, 9% and 13% higher than the proposed SCA-based algorithm for ω = 0.01, ω = 0.05, and ω = 0.99, respectively. Since Problem P 2 at ω = 0.01 becomes the sum rate maximization problem, both proposed approaches significantly outperform the benchmarks ( Fig. 6(a)). In addition, by increasing ω, the sum rate optimization problem is given less weight. Therefore, the proposed approaches might have the least improvement compared to the benchmarks in ω = 0.99.
3) CDF of the Achievable Minimum-User Rate: The achievable minimum user-rate performance of cell-free mMIMO with systems setup 2 and the ZF receiver is investigated. Fig. 6 compares the CDF of the minimum user rate of the proposed schemes with benchmark algorithms with ω = {0.01, 0.05, 0.99}. As can be seen from the figures, the proposed TD3-based power control notably performs better than three benchmarks and the proposed SCA-based algorithm for all three values of ω. In particular, Fig. 7(a)-(c) indicates that with the proposed TD3-based power control, the 95%-likely performance of the minimum user-rate is about 80%, 99% and 118% greater than FP algorithm, 80%, 28% and 25% greater than FrP algorithm, and 12%, 7% and 20% greater than the proposed SCA-based algorithm for ω = 0.01, ω = 0.05, and ω = 0.99, respectively. Since Problem P 2 at ω = 0.99 turns out to be the fairness problem, both proposed approaches significantly outperform the FP and RP algorithms Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. ( Fig. 7(c)). In addition, FrP manages to solve the fairness problem at ω = 0.99 [38], it has a comparable performance to the proposed SCA-based power control.

4) Performance Versus Different Weight Factors:
In this section, we investigate the effectiveness of the proposed power control algorithms for the sum rate fairness trade-off. To this end, we simultaneously examine the effect of the weight factor ω on the average sum rate and average minimum-user rate performance. In Fig. 8(a) and (b), the descending and ascending graphs of the average sum rate and average minimum-user rate for both proposed approaches over the weight factor w for system setup 2 with the ZF and MRC receiver have been presented, which demonstrate that both proposed approaches strike a good trade-off between sum rate and fairness by changing the weighting factor. In particular, the average sum rate improves in return for degrading average minimum user-rate, and vise versa. When it comes to the two different agents in DRL-based approaches, TD3 has a slightly better performance in comparison with DDPG in terms of both average sum rate and average minimum-user rate. Finally, it is worth mentioning that the DRL-based schemes always outperform other approaches due to the fact that the DRL-based approaches lead to a more optimized solution by observing its execution on a trial-and-error basis in dynamic environments. The constant value of the average minimum-user rate and the average sum rate obtained by FP transmission are also shown in these two figures.

5) Convergence:
The uplink average sum rate obtained by Algorithm 1 versus the number of iterations for system setup 1 with a ZF receiver at ω = 0.01 is demonstrated in Fig. 9. This figure shows that the proposed SCA-based approach converges rapidly after around 10 iterations. Finally, we analyze the convergence of the proposed DRL-based power control approach with two different agents, namely TD3 and DDPG. Fig. 10 shows a convergence behavior for system setup 1 and with the ZF receiver. In general, no major discrepancies in convergence have been observed between different setups, this scenario is thus used as an example. It can be seen that the power control algorithm with a DDPG agent does not converge even after 500 episodes, and the reward it gets is still low (the reward is fluctuates markedly). However, it is observed that by using a TD3 agent, favorable results are obtained even after 50 episodes (the reward of up to 4 is obtained), and the full convergence is achieved after 200 episodes. Simply put, the TD3 agent enhances the performance of DDPG so that it shows suitable convergent property.

IX. CONCLUSION
In this paper, we proposed the TD3-based power control methods in cell-free mMIMO systems, with the objective of sum rate fairness trade-off optimization in the uplink. The weighted sum technique has been proposed to recast the MOO problem into an SOO problem. In addition, imperfect CSI and uplink pilot contamination have been considered in our models. Our TD3-based methods were trained by interacting with the environment without the need to generate any data set. The well-trained TD3-based approaches can dynamically adjust the parameters based only on the LSF coefficients to optimize conflicting objectives. Next, an SCA-based sub-optimal scheme was investigated, where an SCA technique has been proposed to tackle the non-convexity issue of the sum rate fairness trade-off optimization problem. In addition, the complexity and convergence of the proposed schemes have been presented. Simulation results confirmed the superiority of the proposed approaches over conventional power control algorithms, e.g., Frp and FP algorithm, in terms of average sum-rate and average minimum user-rate. Moreover, the high capacity of the proposed TD3based power control to achieve fairness in cell-free mMIMO without degrading the performance of the sum-rate maximization has been demonstrated. Although we mainly focus on the sum rate fairness trade-off optimization problem, it is straightforward to consider other criteria like energy efficiency and spectral efficency, and also generalize the framework by considering hardware impairments and limited fronthul capacity. Furthermore, it is possible to design a multi-objective DRL approach for computing all true Pareto optimal solutions without using scalarization technique.

A. Proof of Theorem 1
To drive the close form expression for the effective SINR with ZF decoding matrix, we need to compute DS UaTF k , E{|BU UaTF k | 2 }, E{|IUI UaTF kk | 2 }, and E{|TN k | 2 }.