Deep Reinforcement Learning for Energy Efficiency Maximization in Cache-Enabled Cell-Free Massive MIMO Networks: Single- and Multi-Agent Approaches

Cell-free massive multiple-input multiple-output (CF-mMIMO) is an emerging beyond fifth-generation (5G) technology that improves energy efficiency (EE) and removes cell structure limitation by using multiple access points (APs). This study investigates the EE maximization problem. Forming proper cooperation clusters is crucial when optimizing EE, and it is often done by selecting AP–user pairs with good channel quality or aligning AP cache contents with user requests. However, the result can be suboptimal if we determine the clusters based solely on either aspect. This motivates our joint design of user association and content caching. Without knowing the user content preferences in advance, two deep reinforcement learning (DRL) approaches, i.e., single-agent reinforcement learning (SARL) and multi-agent reinforcement learning (MARL), are proposed for different scenarios. The SARL approach operates in a centralized manner which has lower computational requirements on edge devices. The MARL approach requires more computation resources at the edge devices but enables parallel computing to reduce the computation time and therefore scales better than the SARL approach. The numerical analysis shows that the proposed approaches outperformed benchmark algorithms in terms of network EE in a small network. In a large network, the MARL yielded the best EE performance and its computation time was reduced significantly by parallel computing.


I. INTRODUCTION
W ITH the growing number of mobile devices, mobile wireless communication needs high throughputs and improved energy efficiency (EE) to satisfy the requests. The massive multiple-input multiple-output (mMIMO) technology, where each access point (AP) equipped with multiple antennas provides high beamforming and reliability, has been proposed to address this need [1], [2]. However, the mMIMO suffers from restrictions of cell structure and the efficiency degradation of cell-edge users [3], [4]. Therefore, a beyond 5G technology named cell-free mMIMO (CF-mMIMO) has emerged most recently [5], [6]. In a CF-mMIMO network, cell boundaries no longer exist and users can decide which APs to be connected in a distributed manner. As a result, a cell-free network inherits the favorable propagation properties and enhanced EE of a mMIMO network while avoiding the drawbacks of, for example, inter-cell interference [6], [7], [8]. APs are connected to a central processing unit (CPU) by high-speed fiber links in CF-mMIMO. Therefore, the APs can be equipped with simple circuit design which consumes less power since they can leverage the CPU to do the complex calculations. Consequently, the further enhanced EE renders CF-mMIMO networks suitable for applications in green communications [9], [10].

A. Related Work
User association is one promising technique employed in a mMIMO network [11]. Active users can be properly associated with a specific AP in order to enhance the EE and lower the latency. This technique has been adopted by various types of mMIMO networks, such as cloud radio access networks [12] and heterogeneous networks [13]. Parsaeefard et al. [12] jointly assigned radio and cloud resources to users to maximize the total throughput. Xu and Mao [13] formulated distributed user association as repeated games which modeled the interactions between service providers and users. In their settings, service providers priced themselves and let the users bid for services. User association can also be jointly optimized with beamforming [14], [15]. For example, Dong et al. [14] considered jointly optimizing cooperative beamforming and user association to This  maximize the downlink capacity based on statistical channel state information (CSI). Wang et al. [15] proposed to jointly consider user association and a beamforming design to cope with the significant path loss in a millimeter wave (mmWave) CF-mMIMO network.
Apart from user association, content caching is another popular way for mMIMO networks to improve the communication quality of user equipments [16], [17]. Rather than constantly acquiring data from a central processing unit (CPU), the AP temporarily stores popular or important data to reduce backbone data requests. In a cellular network, this cache-enabled technique has been applied to improve the system performance [18], [19]. For instance, Wei et al. [18] minimized the network transmission delays considering a joint user scheduling and caching strategy based on a deep reinforcement learning (DRL) algorithm. Sadeghi et al. [19] minimized the no-hit cost by introducing a deep-Q-network (DQN) based algorithm that optimized the caching policy.
To further enhance the efficiency of cache-enabled networks, user association and caching policy have been investigated simultaneously [20], [21], [22], [23], [24], [25]. Golrezaei et al. [20] and Shanmugam et al. [21] developed the idea of small-cell APs with large storage capacity and provided multi-step solutions of caching policy while taking the given user association into account. Pantisano et al. [22] proposed a distributed user association algorithm based on a given caching policy. Darzanos et al. [23] solved the association and caching problems using multi-phase heuristic methods. Li et al. [24] proposed a DQN based caching strategy that maximized the EE of a network involving hierarchical base stations (BSs). Rather than hierarchical BSs networks, Liu and Yang [25] considered a heterogeneous network composed of users, relays, and device-to-device communication devices, and investigated the EE improvement after caching mechanism. These aforementioned methods addressed user association and caching policy in a sequential manner, leading to sub-optimal solutions.
Recently, joint optimization of user association and caching policy has been considered [26], [27], [28]. Poularakis et al. [26] proposed an approximation algorithm that minimized the number of requests from backbone. Jing et al. [27] proposed an iterative algorithm to address caching and association. Lin et al. [28] studied learning based caching policy on coordinated multi-point joint transmission cellular networks, where the neighbor APs can be accessed by users. While throughputs were improved, existing approaches only addressed at most one connecting AP or assumed prior knowledge about users' content preference.
In general, user association in a CF-mMIMO network is more complex than that in a cellular network since removing the confinement of cell structure enables users to be served by any APs. The increased number of AP-users choices complicates the design of user association. Thus, the user association technique used in mMIMO network is not suitable for CF-mMIMO network. As such, user association in a CF-mMIMO network needs further research [8], [29], [30], [31], [32], [33]. Ngo et al. [8] proposed a downlink power allocation algorithm that maximizes EE with an enhanced AP selection scheme reducing power consumption. D'Andrea et al. [29] proposed a Hungarian Algorithm based method that aimed at sum-rate maximization. Shakya and Ali [30] proposed an iterative AP selection method implemented on a single CPU to reduce multi-user interference.
Meanwhile, K-means clustering algorithms have been utilized for AP selection in a CF-mMIMO network due to the ease of implementation [31], [32]. Biswas and Vijayakumar [31] introduced a K-means based AP selection aiming to reduce the computation workload and pilot contamination. Riera-Palou et al. [32] developed a user association method based on Kmeans with the objective of minimizing pilot contamination. Le et al. [34] addressed a sum spectral efficiency maximization problem of non-orthogonal multiple access systems using a K-means based method. Björnson and Sanguinetti [33] demonstrated the scalability of the CF-mMIMO network by proposing a user association framework. Although the user association problem in CF-mMIMO networks has attracted much attention, the aforementioned association methods did not consider the use of cache memory.
In the literature, the cache-enabled CF-mMIMO with user association has not been well investigated. In [35], a greedy caching strategy was proposed to minimize energy consumption in a CF-mMIMO network. Nevertheless, the content popularity and the number of requests at each AP were assumed to be known. Chang et al. [36] provided a DRL framework using a single agent to address the joint optimization of user association and caching in CF-mMIMO under unknown content popularity.

B. Motivations and Contributions
The above work triggers an intriguing observation: joint user association and caching policy in a CF-mMIMO network lead to a tradeoff property. The purpose of user association is to cluster APs and users in order to improve throughputs on the basis of channel state information. However, an association with higher throughputs may decrease the hit rate of caching, thereby increasing the power consumption. By contrast, when precise content caching is focused, the power consumption reduction can harm the throughputs. Simply combining two independent optimization methods of association and caching may lead the system to performance degradation.
This study investigates CF-mMIMO EE optimization with joint user association and edge caching. A joint design in CF-mMIMO networks can reduce backhaul loads by avoiding fetching duplicate trending contents, while achieving cooperation gains and enhanced throughput at the same time. As a result, applications such as wireless video streaming that require low latency and high throughput can be enabled [28], [37]. As the user requests are unknown in real world, we employed DRL to learn the patterns of user requests. By extending our previous work in [36], we proposed single-agent DRL based and multi-agent DRL based approaches to determine user association and content caching in consideration of different edge device computational requirements. The single-agent RL (SARL) approach works in a centralized manner; it is implemented on a CPU requiring few computation edge devices, and generally employs a large neural network to make a decision, which can be robust against noise interference. For the multi-agent RL (MARL) based approach, neural networks are implemented on each edge device; it is more scalable and can attain a better level of performance in terms of EE than the SARL approach while requiring heavier computation resources at the edge devices. Besides, the MARL approach is suitable for parallel computing that significantly reduces computational time.
Our numerical analysis shows promising results of the proposed methodology. In a smaller network containing 8 devices, both the SARL and MARL approaches outperformed the benchmarks: SNR-based method, First In First Out (FIFO), Least Recently Used (LRU) and Least Frequently Used (LFU) in terms of EE, throughputs, and power consumption. In a larger network containing 20 devices, the MARL approach achieved superior EE performance than the benchmarks. Moreover, the computational costs of the proposed approaches, such as the number of operations, total number of needed parameters and inference time, were examined.
The main contributions of this paper are summarized as follows: r We addressed user association and content caching jointly for the EE maximization problem in a CF-mMIMO network. To the best of our knowledge, this problem has not been investigated in the literature.
r We proposed two approaches for optimizing user association and content caching jointly. The SARL approach computes mostly on CPU, requiring few computational resources on edge devices. The MARL approach has better scalability and EE performance than the SARL approach. Since the computation of MARL is performed in parallel by all agents, the processing time is reduced at the cost of high-end hardware used at the edge. The rest of the paper is organized as follows: Section II describes the system model of caching-aided CF-mMIMO and the problem formulation. Section III presents the proposed approaches. Section IV presents the simulation results and discussion. Finally, Section V concludes this paper.

II. SYSTEM MODEL
This section presents the signal model, caching model, and power consumption model in a mMIMO Network, and formulates an energy efficiency optimization problem. In the signal model, the transmitted signals in downlink channels of a CF-mMIMO network is described. In caching model, a content caching mechanism improving the energy efficiency of the network is described. The power consumption model involves the transmit power among APs, users and CPU. The formulated problem reveals the tradeoff between user association and content caching.

A. Signal Model
We consider a downlink CF-mMIMO network with M APs and K users (or user equipment, UE) [6]. Let M and K denote the sets of all APs and users, respectively. Fig. 1 depicts a network topology. The APs and users are equipped with a single Fig. 1. An example of a CF-mMIMO network topology, depicting dynamic user association (sets of serving APs, i.e., the blue region and the orange region) for users, as well as AP content caching status and user requests. Matched requests and cached files are marked using the same color. For instance, because UE 1 requests file 6, all cached file 6 s are colored orange. The cache capacity is assumed to be two for all APs in this example.
antenna [3], [38], [39]. Each AP is connected to a geographically closest central processing unit (CPU) via a physical fiber link. The orange and the blue region represent two subsets of APs where APs selected in a subset jointly serve users within the subset using the same time-frequency resource under the time-division duplex (TDD) operation [8]. The number in the box above an UE is the file that the UE requests; the numbers in the black box above an AP are the files stored in its cache.
The network operates in a fixed-size slotted fashion. In each time slot t ∈ {0, 1, 2, . . .}, downlink channel data are generated and transmitted. The channel between the mth AP and the kth user at time t can be expressed as where d m,k is the distance between the mth AP and the kth user, be the set of serving APs for the kth user, and C m (t) the set of users served by the mth AP. Suppose that all users are guaranteed to be served; i.e., 0 < |S k (t)| ≤ M, ∀ k and t. But not necessarily all APs are serving users; i.e., k∈K S k (t) ⊆ M. The set of all serving (active) APs at time t and all users can be expressed by respectively. Let q k be the symbol at time t intended for the kth user from this user's serving APs, where E[|q k (t)| 2 ] = 1 and E[q k (t)] = 0, ∀k, t, and E[q k (t)q * l (t)] = 0, ∀k = l, t, i.e., symbols intended for different users are uncorrelated. The transmit signal from the mth AP using conjugate beamforming with perfect channel state information can be expressed as [6] x Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where ρ m,k (t) is the power allocated to the kth user and g m,k (t) is the estimated channel gain for g m,k (t) at the mth AP at time t. ρ m,k (t) is subject to the power constraint P m at mth AP at The received signal r k (t) at the kth user at time t is given by is the noise at the kth user and S c k (t) = S(t) \ S k (t). The achievable rate of the kth user, denoted by R k (t), is given by The achievable sum rate of the network is given by

B. Caching Model
Suppose that there are F content files of equal size in the file library. Let F = {1, 2, . . . , F } be the set of all content files, F m (t) be the set of content files cached at the mth AP at time t, and F c is time-varying due to possible file addition or replacement at the mth AP. Each user independently requests one content file. Let f k (t) ∈ F denote the content file requested by the kth user at time t; f k (t) is determined on the basis of the kth user's content preference vector modeled by all content files ranked in the descending order of preference, and the content popularity that can be modeled by a Zipf distribution [40]. Specifically, the probability that f k (t) is equal to the ith ranked content file in the kth user's content preference vector is given by where β is the Zipf factor; typically, β = 0.5, 1, or 2. Each user has a distinct and independent content preference vector. The content preference vectors of all users are assumed to be distinct, independent, and time-invariant content preference vector.
The hit event for the kth user at time t, denoted by H k (t), is defined as the event where the kth user's requested file is cached at all the serving APs of the kth user at time t, i.e., f k (t) ∈ F m (t), ∀m ∈ S k (t). In the event of no-hit, there exist some APs m ∈ S k (t) for which f k (t) / ∈ F m (t); these APs will retrieve the content file f k (t) from the backbone for joint AP transmissions. The hit rate of the network is defined by is the indicator function which is equal to 1 if event H k (t) happens, and 0 otherwise.

C. Power Consumption Model
The total network power consumption comprises three parts [25]: i) the transmit power of all serving APs, which accommodates the power required to provide contents from APs' caches directly to the users; ii) the power associated with content file transport between the APs and the CPU; iii) the power associated with content file transport between the CPU and the backbone. For i), the sum transmit power of all serving APs is given by For ii), the power is a sum of power associated with cache updates due to file addition or replacement, and power associated with missing content retrieval. For the former, the power is proportional to the number of newly generated content files, i.e., |F m (t) \ F m (t − 1)|. For the latter, the power is proportional to the number of distinct missing content files collectively requested by the users served by the mth AP but not cached at the mth AP, i.e., F miss When multiple users associated with an AP request the same content file that is not cached at this AP, the AP retrieves this missing content file from the CPU once. Let P backhaul be the unit power associated with one content file change or retrieval, which includes backhaul power (the power needed to fetch the content in the AP-CPU link) and circuit power. The total power consumption for file transport between the APs and the CPU at time t is given by where σ penalty represents a power consumption factor for a missing content file retrieval at AP. For iii), we consider that the CPU is equipped with a cache whose storage space is sufficiently large to place the missing contents, but insufficient to accommodate all contents in F.
Thus, if two users associated with different APs request the same content that is not cached at their respective serving APs, the CPU will need to download this content file once. Let P backbone be the unit power associated with CPU downloading a content file from the backbone. Then, the total power consumption for file transport between the CPU and the backbone at time t is Summing up i)-iii), the total network power consumption at time t is given by

D. EE Maximization in Cache-Enabled CF-MMIMO Networks
The goal is to find a policy that determines user association S 1 (t), S 2 (t), . . . , S K (t), and content caching F 1 (t), F 2 (t), . . . , F M (t) such that the average network energy efficiency (EE), defined by R sum (t)/P total (t), is maximized. The problem can be formulated as The objective in (10) represents the average energy efficiency. Because the randomness of user requests due to the unknown user preferences over time dynamically affects the user association and content caching, optimization methods are not suitable for solving the problem. Besides, the problem becomes intractable for large networks with large numbers of APs and users. It is worth mentioning that design trade-offs exist when the clustering and content caching are determined. A design based solely on S 1 (t), S 2 (t), . . . , S K (t) may favor the association of the kth user with the AP subset S k (t) that provides the best channel conditions to increase R k (t) and consequently R sum (t). In contrast, a design based solely on F 1 (t), F 2 (t), . . . , F M (t) may favor the association of the kth user with the AP subset S k (t) that best aligns the content caching status and user requests to increase hit events and consequently decrease P total (t). This motivates our joint design in (10).

III. PROPOSED APPROACHES
This section develops user association and content caching strategies based on RL due to its continual, interactive learning procedure. In addition, average reward setting [41] in RL is utilized since our objective focuses on average performance. The approaches can be utilized by all participants of CF-mMIMO, such as APs, users and the CPU, to collaboratively maximize the EE over the whole network. We proposed two RL approaches, namely, SARL and MARL, to cope with different requirements on edge devices. For the SARL approach, much computation is allocated to the CPU in a centralized manner. This approach has low scalability but is suitable for edge devices (APs and users) that have limited memory storage and limited computational unit. For the MARL approach, much computation is allocated to APs and users in a distributed way. The MARL approach is more scalable than the SARL approach but may require high-end hardware at the edge. Fig. 2 depicts the structures of SARL and MARL solutions. An agent is composed of an actor and critic represented by a blue and yellow rectangles, respectively. The orange rectangles represent the observation (marked by "o") and the reaction to the environment of a specific agent (marked by "a"). Fig. 2(a) illustrates the workflow of the proposed SARL solution. During the training phase, the CPU actor makes action based on an observation and the action is passed to the CPU critic which evaluates the benefits of conducting such an action given the observation. When the SARL is deployed, only the CPU actor is required to determine the current clusters and file replacement. On the other hand, all the APs and UEs are considered as agents in our MARL method. The APs' actors learn an optimal cache replacement policy while the actors of the UEs are responsible for deciding the cooperation clustering. As a result, in Fig. 2(b), there are M + K agents which make actions based on their own observations. During the training phase, in order to improve the overall EE, the critics are designed to evaluate its actor's action based on the actions and observations of all the agents, which prevents the critics from guiding the actors to focus solely on optimizing their own EE performance.

A. SARL for Determining EE of CF-MMIMO
In our SARL approach, the agent selects an action a t pertaining to clustering and caching that maximizes EE. We propose a score-based method that assists evaluating the benefits of users choosing a specific AP and APs storing a certain file. To be more specific, our agent predicts an AP scoring matrix S AP ∈ R |K|×|M| and a file scoring matrix S f ∈ R |M|×|F| ; the kth row of the AP scoring matrix corresponds to scores of APs for user k and the mth row of the file scoring matrix represents scores of files for the mth AP. Note that the scores are continuous. To select beneficial user association pairs and satisfy constraint in (10b), a user is allowed to pick up to L APs having positive scores. Given the limited cache size as established in (10c) We assume that information about user requests and caching files at APs is available before user association. This assumption was also made by [28] where the information about the user requests and BS caching were given before a coordinate multipoint method was applied opportunistically. The state at time t is defined as the vector of channel gains g t = [g m,k (t)] m∈M,k∈K , clustering and caching actions in the previous time slot a t−1 , and each user's file request history e t . The file request history e t is defined as e t = [e k,f,t ] k∈K,f ∈F , where e k,f,t is the number of downloads of the f th file that the kth user has requested up to time t. The state allows the agent to jointly consider user association and content caching based on the channel quality, the up-to-date file requests and the current caching status in each AP. The state is thus expressed as According to (11), the input state dimension of SARL will be |M| * |K| + |K| * |F| + (|M| * |K| + |M| * |F|). Referring to the objective in (10), we define the reward function at time t as where R sum (t) and P total (t) are given in (5) and (9), respectively. Note that the reward r(s t , a t ) depends on g t , e t , and a t−1 in (11). This can be seen from the fact that the sum rate R sum (t) depends on the channel condition g t and the clustering result a cl t , which is determined based on the result of clustering in the previous time slot embedded in a t−1 . Likewise, the total power P total (t) depends on the caching result a ca t , which is determined by the caching result in the previous time slot embedded in a t−1 , as well as by the file preferences of each user embedded in e t .
The action space of our SARL approach is discrete. However, the intermediate score-based method requires the agent to predict scores in a continuous space. Specifically, the goal of the agent is learning to assess the quality of AP-UE pairs and the benefits of files storing in the cache; the agent thus considers a continuous action space which is then transformed into discrete representation. In this paper, the deep deterministic policy gradient (DDPG) algorithm is employed which enables training in a continuous space, and the state, action and reward are carefully designed to realize fast and stable learning [42]. Essentially, the SARL network follows the actor-critic approach, combining an additional target actor network θ μ and target critic network θ Q with the original evaluation actor network θ μ and evaluation critic network θ Q for an improved convergence rate and stability.
The actor aims to produce an action in each time slot using a deterministic policy μ(s t |θ μ ) learned by a deep neural network (DNN) represented by a weight vector θ μ . The weights θ μ are updated in order to find the best deterministic policy μ(s t |θ μ ) based on the action-value function. The expected long-term reward E[r(s t , a t ) | s t , a t (θ μ )] can be approximated by where Q(s t , a t ) represents the action value at pair (s t , a t ). The objective can be defined as with the expectation taken with respect to s t . Typically, θ μ can be updated using a gradient ascent method where α μ is the learning rate. In average reward setting, the differential return R t , a i ). Thus, the action-value function in (13) can be updated by the recursive relation used by the critic to determine Q(s t , μ(s t |θ μ )) in (14). More specifically, the critic evaluates the action-value function Q(s t , μ(s t |θ μ )|θ Q ) using a separate DNN with weights θ Q . Typically, the weights θ Q are updated using where α Q is the learning rate and is the mean-squared Bellman error function with the target being adapted from the recursive relation in (16). In contrast, Q(s t , a t | θ) in (18) is provided by the original evaluationnetwork. The combination of the evaluation-network and targetnetwork stabilizes the SARL algorithm [42]. Because of the difficulty of knowing the precise probability distribution of s t , the expectations in (14) and (18)  1: Initialize θ μ and θ Q in the evaluation-network 2: Set θ μ ← θ μ and θ Q ← θ Q in the target-network 3: Initialize the replay buffer D 4: for episode = 1 to EP do 5: Initialize exploration noise N t at time t = 0 6: for t = 1 to T do 7: Evaluation-network actor on CPU determines the clustering and caching a t = μ(s t |θ μ ) + N t 8: Execute the clustering of users and caching at APs according to a t 9: Obtain the EE r t and next state s t+1 from each AP and user. 10: Send the state information back to CPU. 11: Store b t (s t , a t , r t , s t+1 ) in the replay buffer D 12: Sample a mini-batch B = {(s j , a j , r j , s j )} from D 13: Evaluate targets y j using (19) where (s j , a j , r j , s j ) ∈ B 14: Determine the gradient of critic in (17) Update the weights of critic in the evaluation-network

16:
Determine the gradient of critic in (15)

17:
Update the weights of actor in the evaluation-network 18: Update the weights in the target-network : end for 20: end for . . , t with finite buffer size D for caching the most recent actions and corresponding states and rewards. Finally, soft updates are performed to further stabilize the target critic network by 20) and the target actor network by where τ 1. The complete SARL for determining EE of CF-mMIMO procedure is summarized in Algorithm 1.

Algorithm 2: MARL for M AP Agents and K User Agents
Maximizing EE.
1: for agent i = 1 to M + K do 2: Initialize θ μ i and θ Q i in the evaluation-network 3: Set θ μ i ← θ μ i and θ Q i ← θ Q i in the target-network 4: end for 5: Initialize the replay buffer D 6: for episode = 1 to EP do 7: Initialize exploration noise N t at time t = 0 8: for t = 1 to T do 9: Obtain exploration noise N t at time t 10: for AP agent i = 1 to M do 11: Evaluation-network actor on the ith AP caches files according to a i,t = μ(s i,t |θ μ i ) + N i,t 12: endfor 13: for User agent i = M + 1 to M + K do 14: Evaluation-network actor on the ith determines user association according to endfor 16: Obtain the reward r t and next state s t+1 by execute a t = (a 1,t , . . ., a M +K,t ) 17: Store (s t , a t , r t , s t+1 ) to the replay buffer D 18: for agent i = 1 to M + K do 19: Sample a mini-batch B = {(s j , a j , r j , s j )} from D randomly 20: Evaluate targets y j using (27) where (s j , a j , r j , s j ) ∈ B 21: Determine the gradient of critic

22:
Update the weights of critic in the evaluation-network

23:
Determine the gradient of actor Update the weights of actor in the evaluation-network end for 26: Update the weights in the target-network 27: for agent i = 1 to M + K do 28:

B. MARL for Determining EE of CF-MMIMO
This section elaborates a MARL approach based on multiagent deep deterministic policy gradient (MADDPG) to increase the EE. The MARL approach is more scalable with higher computational cost on parameters storage and variables operations. Two types of agents, namely, AP and user agents, are considered in the communication network. AP agents determine the caching and the user agents determine the AP association.
To maximize the EE, a score-based method similar to that used by our SARL approach is employed by the AP agents. The AP agents predict a file scoring matrix to help evaluate the values of choosing a specific file. Note that the AP agents are responsible for predicting file scoring matrices but not the AP scoring matrices. Based on the top N scores of the matrices, the mth AP's action is expressed as a AP According to (22), the dimension of input state of each AP is |F| + |F| + |K| * |F|.
The reward function at time t is defined as where · represents the inner product, a AP m,t−1 is obtained by applying the logical operator of NOT to all the elements of a AP m,t−1 .
The kth user agent's action a UE k,t involves clustering. User agents use a scoring method to determine which APs are serving them. The kth user agent predicts an AP scoring matrix used to assess the quality of user association pairs. Let a UE k,t = [a UE k,m,t ] m∈M denote the kth user agent's action, where the indicator a UE k,m,t ∈ {0, 1} denotes the kth user-mth AP association status. Note that a UE k,m,t = 1 represents a successful association and a UE k,m,t = 0, otherwise. The action a UE k,t uniquely determines the sets S k (t) = {m ∈ M : a UE k,m,t = 1} and C m (t) = {k ∈ K : a UE k,m,t = 1}. For the users, the state should be a collectable information set that can be used to calculate the reward function. The state is designed as the set of channel gains g t = [g m,k (t)] m∈M,k∈K . The state is expressed as According to (24), the dimension of input state of each user is |M| * |K|.
The reward function at time t is designed as where R sum (t) and P total (t) are given in (5) and (9), respectively. MARL also follows the actor-critic structure. An actor of agent i for an AP or user outputs the action a i,t as based on the local observation s i,t . Note that the local observation s i,t contains local information observed by agent i, rather than using global information. The corresponding critic computes an centralized action-value function as Q(s j 1 , . . ., s j M +K , a j 1 , . . ., a j M +K |θ Q i ) whose inputs are the joint actions and local observations for all agents.
While users are determining clustering and APs are determining caches, centralized critics will be discarded and only actors acting in a decentralized manner. Algorithm 2 presents the pseudocode of the proposed decentralized approach in a CF-mMIMO network. The flow is similar to the SARL approach. In lines 10-15, actor network of agent i outputs an action, which is storing cache for AP agents or deciding clustering for users. The target for mean-squared Bellman error is defined as and evaluated in line 20. The gradient of actor critic policy, defined in lines 21 and 23, are used to update the weights of the actor and critic networks in lines 22 and 24, respectively. In lines 26-30, soft updates are performed to target networks of each agents.

IV. SIMULATION RESULTS
The following setting was adopted throughout this section. Users and APs were uniformly distributed in a 1 km 2 area, with one AP anchored at (0,0). The content preference vector was randomly generated for each user, and the user request was generated using the content preference vector with Zipf factor β = 1 [40], [43], [44] in each time slot. The path-loss exponent was α = 2, and the small-scale fading coefficient h m,k representing a time-varying model was adopted [45]: where n m,k (t) ∼ CN (0, 1), and = 0.01 was the channel variation coefficient. AP transmit power P m = 10 mW for all APs and P backhaul = P backbone = 50 mW were set. The thermal noise power at each user was [6] where Bandwidth was 20 MHz, k B = 1.381 × 10 −23 , T 0 = 300 K, noise figure was 9 dB, yielding σ 2 w = 7.457 × 10 −13 W. The actor networks were composed of DNNs of two hidden layers and the tanh activation function was employed. On the other hand, we chose DNNs with two hidden layers combined with ReLU activation function as our critic networks.
We examined three scenarios in Table I. In the following subsections, we first justify our motivation of joint consideration in Scenario 1. Then, Scenarios 2 and 3 are presented to show the scalability of the proposed methods while comparing ours with common benchmarks. For user association, we considered SNR-based clustering policy (the kth user connects to l ≤ L APs with the highest |g m,k | 2 among all APs). Cache replacing policies include some well-known fixed strategies such as First In First Out (FIFO) [46], Least Recently Used (LRU) [47] and Least Frequently Used (LFU) [48]. FIFO strategy replaces the oldest file with the new data. LRU strategy discards the file that has not been hit for the longest time. LFU records the times that each file has been hit and substitutes the new file for the file with least hit.

A. Scenario 1: Motivation Justification
To examine the benefit of jointly optimizing user association and cache policy design, in Fig. 3(a) we show the performance of the proposed SARL, MARL and three benchmarks. The three benchmarks are as follows: 1) BF: Find the optimal EE by a brute-force search of all possible combinations of user clusterings and file replacements. 2) BF(user association)+FIFO: FIFO and exhaustively search for the user association design which yields the greatest EE. 3) SNR+BF(cache replacement): SNR-based clustering policy and exhaustively search the file replacement design which yields the greatest EE. BF(user association)+FIFO and SNR+BF(cache replacement) represent solely optimizing user association and caching policy, respectively. Fig. 3(a) shows that our SARL and MARL outperformed BF(user association)+FIFO and SNR+BF(cache replacement), demonstrating the merit of a joint user association and caching design. Fig. 3(b) shows the convergence of SARL and MARL, where the EE versus training episodes results were examined. Both methods converged after 200 episodes.

B. Scenario 2: Small Network
In Table I, the number of combinations of user clusters and cached files escalates abruptly when the number of APs, the number of files or the cache size increases. Therefore, the brute   Fig. 4(a). In Figs. 4(b) and 4(c), the throughputs and the power consumption of the proposed approaches were both better than other benchmarks. SARL and MARL can learn the users' preferences and cache the required files in advance to reduce the power consumption. Fig. 4(c) shows that the power consumption of SARL and MARL were both lower than the benchmarks. It is noteworthy that even though SARL achieved a higher throughput than MARL did, it also consumed more power on requesting backbone data. In other words, only when the user association and the content caching are jointly considered can the optimal EE be achieved. Meanwhile, benchmarks with larger link number outperformed benchmarks with smaller links number in terms of EE. Benchmarks with larger links number may have a lower hit rate due to connecting to more APs. However, in return a better channel quality was achieved, which resulted in higher throughputs and better EE. Fig. 5(a) shows the cumulative distribution function (CDF) of EE. Given time t and the small-scale fading coefficient h m,k (t), the CDF was obtained by repeatedly realizing h m,k (t + 1) using (28) conditioned on h m,k (t). SARL and MARL outperformed the benchmarks in Fig. 5(a), which was aligned with the previous results. Fig. 6 examined an imperfect CSI case. The imperfect estimated channel gain can be expressed by where e m,k (t) ∼ CN (0, σ 2 error ) is the estimation error. The proposed methods outperformed the benchmarks with σ 2 error = 0.1 and 0.3. In Fig. 6(a), as in the perfect CSI case, MARL achieved greater EE than SARL. However, benchmarks with less links performed better than ones with more links. The imperfect CSI was harmful to benchmarks with more links in terms of throughputs; its already greater power consumption became an even heavier burden which resulted in more significant degradation in EE. In Fig. 6(b), the EE of MARL was on a par with that of SARL. MARL determined its user association and file replacement by multiple small neural networks, while SARL did so with a single large neural network. The result may be attributed to the fact that a large size of a neural network is robust against noise. The computational costs of DRL based approaches were evaluated by multiply-accumulate operations (MACs), number of parameters (Params) and inference time. One MAC represents a pair of one multiplication and one addition. Because the neurons, the basic unit of neural networks, are computed by multiplications and additions, MACs are a common time efficiency metric for deep learning inference [49]. Params represents the total number of needed parameters, such as weights and bias in neural networks, which indicates the memory efficiency of networks. In our analysis, other arithmetic computation and parameters, such as hyperparameters including learning rate, batch size and buffer size, were neglected due to the small ratio as compared with the neural network computation. For fair comparison, all calculations were completed by THOP [50], an operation counter based on PyTorch. The inference time was measured on an NVIDIA Tesla V100 GPU. Table II shows the MACs, Params and inference time of the proposed SARL and MARL. For MARL, each agent in the network computed in parallel, and the time consumption of the slowest agent was used as the inference time. The critic networks were not used during the inference, incurring no time consumption. For the whole network, the MACs and Params of MARL were larger than those of SARL since each AP and user had their own network in MARL (i.e., the overall MACs of MARL were 4 × (44104 + 22260) = 265456, and the overall Params was obtained similarly, which was 266604). Furthermore, SARL used one neural network located on the CPU to compute, thereby reducing the computational burden of edge devices. Considering parallel computation, we can observe that the inference time of MARL was less than that of SARL.

C. Scenario 3: Large Network
In Figs. 7(a) and 5(b), the MARL outperformed SARL and other benchmarks in EE. As improving throughputs and reducing power consumption can be conflicting, MARL sacrificed the cache hit rate a little to increase the throughputs while jointly considering user association and caching. On the other hand, SARL increased the throughputs by increasing its power consumption, but resulted in degraded EE performance. This is because when considering an environment with larger numbers of APs and users, the only network of SARL may have insufficient ability to address an exponentially rising action space. It can be seen that MARL had the advantage of scalability over SARL in Scenario 2 and Scenario 3. In Fig. 7(b), benchmarks with more links still outperformed benchmarks with fewer links in terms of throughputs. However, in Fig. 7(c), there was a gap between the power consumption of benchmarks with L = 4 and ones with L = 1. As a result, benchmarks using more links performed better in terms of EE. In Fig. 5(b), MARL still outperformed others. All four methods were closely centered around the median. In Fig. 7(a) and Fig. 5(b), SNR+LRU slightly outperformed other benchmarks. The LRU method could reduce the cost of power associated with content file transport between the APs and the CPU, causing less power consumption. The MACs and Params of the proposed SARL were significantly lower than MARL as indicated in Table II. This showed the computational efficiency of the SARL.
Remark 1: The MACs and Params of the proposed approaches in the same scenario are similar. This is because our networks mostly consist of linear layers. For one linear layer, Params is a layer output size larger than MACs. For instance, a fully-connected linear layer with input size l and output size k requires l * k weights and k biases, which is l * k + k parameters. Since layers are fully connected, each output neuron is connected to all input neurons, leading to l multiplications for weight and l additions for bias. k neurons lead to l * k multiplication and l * k addition, which is MACs l * k. Thus, given fully-connected linear layers with input size l and output size k, the number of parameters (l * k + k) is larger than MACs (l * k) with size k. MARL enabled parallel computation to reduce the computation time. The computation time reduction of MARL was significant and observed in all scenarios. This is because the whole MARL algorithm can be roughly divided into two parts: neural network (lines 2, 3, 7, 9, 11, 14, 19, 21, 22, 23, 24, 28, 29 in Algorithm 2) which can be accelerated by parallel computation and other variable computation part (lines 5, 16, 17, 20 in Algorithm 2). For instance, the multiple evaluation-network actors at different users/APs independently computing the action can be accelerated by parallel computation; executing the joint action a t = (a 1,t , . . ., a M +K,t ) needs to collect all actions from all APs and users, which cannot be accelerated by parallel computation. The computation time of a neural network is a large portion of whole algorithm computation time. If a scenario involves more participants, the time reduction will become more significant.

V. CONCLUSION
This paper has investigated the EE maximization of the CF-mMIMO network through joint optimization of user association and content caching. Two learning algorithms, i.e., SARL and MARL approaches, have been proposed. The SARL allocates most computation to the CPU, which is suitable for limited capacity of computational edge devices. The major drawback is its low scalability. By contrast, MARL requires more computation resources at the edge devices but enables parallel computing to reduce the computation time. As a result, the MARL approach scales well for a large network. Our simulations show that both approaches outperformed benchmarks in a smaller network containing 8 devices. For a larger network containing 20 devices, the MARL approach scaled well and yielded the best EE performance.