Federated Learning for Precoding Design in Cell-Free Massive MIMO Systems

Cell-free massive MIMO precoding leverages the large number of antennas and dense access point (AP) deployment to concurrently serve multiple users in a wide coverage area, thus promising efficient interference mitigation and significant network capacity enhancement. Most existing centralized precoding methods for cell-free massive MIMO systems, whether optimization-based or learning-based, suffer from high computational complexity or expensive communication overhead, introducing additional latency to the network in practical applications. To address the above issues, in this paper we propose two decentralized precoding methods based on the horizontal federated learning (HFL) and vertical federated learning (VFL) frameworks, respectively. In the HFL-based precoding method, we design a low-cost residual global channel state information (CSI) feature acquisition mechanism called RFAM at each AP to create local datasets. RFAM eliminates the need for point-to-point CSI exchange between APs, resulting in reduced communication overhead. In the VFL-based precoding method, each AP utilizes its own CSI for precoding scheme design, thereby eliminating the communication overhead associated with obtaining CSI from other APs. Furthermore, the computational complexity is considerably reduced due to the low overhead of model inference. Experimental results conducted in various channel environments show that the proposed HFL-based method achieves faster convergence rate and outperforms traditional decentralized methods in terms of sum-rate performance. The results also show that the proposed VFL-based method achieves similar sum-rate performance to centralized schemes.


I. INTRODUCTION
C ELL-FREE massive multiple-input multiple-output (MIMO) is an emerging technique that employs a large number of distributed access points (APs) connected to a central processing unit (CPU) to cooperatively serve a small number of users over a wide area [1]. It combines the ideas of small cells, massive MIMO, and usercentric cooperative beamforming and therefore enjoys the advantages of high energy efficiency, cost-efficient deployment, channel hardening, favorable propagation conditions, etc., [2], [3], [4]. Cell-free massive MIMO is currently being considered a potentially physical-layer shift for 6G systems [5], [6].
One of the key elements in the downlink transmission of cell-free massive MIMO systems is precoding. In traditional centralized precoding schemes, i.e., zero-forcing (ZF) [7] and weighted minimum mean-square-error (WMMSE) [8], the APs need to transmit their locally obtained channel state information (CSI) to the CPU via backhaul links, which subsequently sends the optimized precoding vectors back to the APs. As a result, a significant amount of signaling overhead is required, imposing a substantial burden on the backhaul links. A cooperative team minimum mean-square error (TMMSE) precoding method based on transmitterspecific CSI, is introduced in [9]. It generalizes the traditional centralized MMSE precoding to decentralized operations. An iterative decentralized approach is proposed in [10] to optimize the precoding vectors locally at each AP using bidirectional training between the users and the APs as well as periodic cross-term information sharing among the APs. However, the aforementioned decentralized methods exhibit computational inefficiency due to the necessity iterative optimization among APs, rendering them unsuitable for numerous applications demanding prompt responses. There are also some simple and direct uncooperative precoding methods, such as matched filtering and local MMSE, which only need locally acquired CSI without the requirement for CSI exchange via backhaul links in each AP. A decentralized linear precoding method called maximum ratio transmission (MRT) is introduced in [11], which aims to maximize the gain of the signal into a specific receive terminal. However, the diminished prominence of the channel hardening effect in cell-free massive MIMO systems, compared to cellular massive MIMO systems [12] indicates the limitation of relying solely on local CSI. In addition, the majority of existing methods are based on convex optimization techniques [13], which are constrained in their ability to tackle non-convex problems and perform poorly as the network size of the problem grows. While it is possible to develop problemspecific algorithms, doing so involves significant time cost and necessitates substantial problem-specific knowledge.
Recently, deep learning (DL) has achieved great success in the physical layer design and optimization of wireless communications systems [14], [15], e.g., channel estimation [16], joint channel estimation and signal detection [17], precoding [18], and semantic communication [19]. Inspired by the successes in these domains, researchers have attempted to employ DL-based algorithms, namely learning to optimize approaches, to overcome the high computational cost of traditional precoding design methodologies. Supervised DL-based beamforming approaches for coordinated beamforming are proposed in [18], [20], [21]. However, these approaches only enable CPU-centric precoding vector design, and also involve a lot of CSI exchanges at the inference stage. Furthermore, because the precoding vector must be downloaded from the CPU, these centralized learning-based techniques have a significant system latency.
Decentralized learning is also attracting academic interest in the domain of precoding design since it is closer to the edge and does not require a large amount of signaling exchange to get comparable performance. Federated learning (FL) is a prominent decentralized learning framework that spreads the learning operations among many local users while guaranteeing privacy [22]. The two common FL frameworks are horizontal federated learning (HFL) and vertical federated learning (VFL). VFL is concerned with the featurepartition scenario, in which the edge devices have distinct feature spaces (with potential partial overlap) but share the same sample space, whereas HFL is concerned with the sample-partition scenario, in which different edge devices share the same feature space but have different sample spaces [23]. As a technology to enable learning capability on terminals, FL has the advantages of fast response and low cost as compared to standard centralized learning, particularly in the field of wireless communications. To reduce the communication overhead of the system and increase its response speed, research efforts are now using FL in massive MIMO systems for designing precoding vectors [24], [25], [26]. However, these approaches are reliant on the pre-existing knowledge of the global CSI at the AP, which poses challenges in directly scaling them to cell-free massive MIMO systems. The work in [27] uses FL in cellfree massive MIMO systems for decentralized precoding design. This scheme, however, requires each AP to send additional synchronization signal bursts (SSBs) to all users in turn in order to obtain local model input features. At this stage, there are very few researches on the design of precoding schemes based on FL in cell-free massive MIMO systems. The main difficulty is that the traditional FL-based collaborative training method cannot be directly applied to cell-free massive MIMO systems. This is because the trained local model can only use its local data to complete the precoding design, and this will result in the decision-conflict problem and severe inter-cell interference.
As motivated above, in this paper we focus on developing precoding schemes employing FL in cell-free massive MIMO systems. Our goal is to take advantages of FL to design precoding schemes in a decentralized manner with low communication cost, as well as to tackle the problems of slow response and high computing cost that exist in classic optimization-based or learning-based methods. The main contributions and results of this work are summarized as follows: • We propose an HFL-based decentralized precoding design method, which includes a low-cost residual feature acquisition mechanism (RFAM) of the global CSI across APs to form the local dataset. During the training and inference stages, each AP employs RFAM to acquire input features for its respective local model. The local precoding vector is then generated as the output of the individual AP's model. The proposed RFAM requires simply that all APs broadcast a uniform pilot sequence to all users and collect the received signal feedback from the users to solve the residual features. This approach eliminates the need for point-to-point signaling exchanges in traditional solutions, resulting in a substantial reduction in communication overhead. By integrating the information obtained from the global CSI acquired through the proposed RFAM into each local input feature, the performance of the precoding scheme is guaranteed. Meanwhile, the low computational cost of model inference decreases the computational overhead of precoding scheme design significantly. • We propose a VFL-based precoding design method that can be regarded as splitting the complete global model and the global input features into several local models and local input features. This enables collaborative training among the local components. During the training stage, each AP utilizes its local CSI as the input feature for the local model. The AP subsequently transmits the output of its local model (the local precoding vector) to the CPU. Subsequently, the CPU leverages the outputs of all local models to calculate the global loss. As a result, the need for inter-AP CSI exchange is obviated, enabling the VFL-based method to offer fully decentralized inference. This decentralized approach effectively eliminates the communication cost associated with exchanging local CSI during the inference stage. Similar to HFL approach, the model inference in VFL-based method incurs a low computational cost. This reduction in computational overhead significantly alleviates the burden associated with the design of the precoding scheme. • We conducted comprehensive experiments to assess the performance of the proposed precoding design methods based on HFL and VFL. The experimental results demonstrate that the HFL-based method exhibits a faster convergence rate across various channel models, outperforming the traditional decentralized methods in terms of sum-rate performance. The results also indicate that the VFL-based method achieves similar sum-rate performance compared to centralized schemes across different channel models.

A. CELL-FREE MASSIVE MIMO PRECODING
As shown in Fig. 1, we consider a downlink cell-free massive MIMO system in which a collection of APs B = {1, 2, . . . , B}, each equipped with M antennas, serves a collection of user equipments (UEs) K = {1, 2, . . . , K}, each equipped with a single antenna, and these UEs are randomly distributed throughout a large area. It is typically assumed that B × M K. All APs are connected to a CPU through backhaul links to exchange information. We assume a time division duplex (TDD) scenario and a single data stream per UE for simplicity. Let h b,k ∈ C M×1 denote the individual channel vector between AP b ∈ B and UE k ∈ K, denote the precoding vector by AP b for all UEs, and P = [p 1 , p 2 , . . . , p B ] ∈ C MK×B be the global precoding matrix. In order to meet the transmit power constraint of each AP, we assume k∈K p b,k 2 ≤ ρ, ∀b ∈ B , where ρ is the maximum transmit power at each AP, the received signal of UE k is given by where d k is the transmit data symbol for UE k with unit energy, and n k is the additive white Gaussian noise (AWGN) at UE k following the distribution CN (0, σ 2 k ). The corresponding signal-to-interference-plus-noise ratio (SINR) of UE k can be expressed as In (2), the first term in the denominator represents the interference signal, thus, an important role of precoding is to eliminate the impact of interference on the network performance. Finally, the downlink achievable sum-rate (measured by bps/Hz) is given by In this paper, we focus on cell-free massive MIMO precoding design to maximize the system sum-rate, which can be formulated as the following optimization problem: max In traditional learning-based design approaches, the server typically employs the global CSI obtained from all APs as the input feature for the learning model. Consequently, during the precoding vector design stage (model inference), the server needs to obtain local CSI from all APs as the input feature for the learning model. However, this process introduces additional system latency and communication overhead.

B. OVERVIEW OF FEDERATED LEARNING
In a typical FL framework, multiple decentralized edge devices (EDs) (e.g., APs in our considered system) work together to train a shared learning model, denoted as w, using their own individual datasets under the coordination of an edge server (e.g., the CPU in our considered system). The edge server updates the global model by aggregating the uploaded local gradients and disseminating the latest global model to all EDs. Following [23], let D n denote the data owned by each ED n. Let X denote the feature space, Y denote the label space, and I denote the sample identity (ID) space. Then, the entire training dataset can be represented by the tuple {I, X , Y}. Based on how data is distributed among multiple EDs in the feature space and sample ID space, FL can be classified into HFL and VFL.
• Horizontal federated learning: HFL, also known as sample-based FL, is introduced when datasets have the same feature space but different sample spaces. For example, the customers at two regional banks represent significantly varied user groups from different places. However, their businesses are so similar, thus the feature spaces in each user are the same. As mentioned above, HFL can be summarized as: The HFL aims to train a globally optimal single model by leveraging the combined knowledge of all EDs while maintaining data privacy. The classical global model update method [22] is to weight and average all local models. • Vertical federated learning: VFL, also known as featurebased FL, is suitable for situations in which datasets share the same sample ID space but have different feature spaces. For example, in healthcare applications, one device may hold patient demographic information while another device has medical test results. VFL is the procedure of merging these several variables and computing the training loss and gradients in a privacypreserving approach in order to construct a model with data from both sides collectively. In summary, VFL can be described as: In HFL, each local model necessitates the provision of complete input features of a training data sample (in the case of cell-free massive MIMO precoding design, complete features refer to global CSI H). These complete local models or gradient parameters are then transmitted to the edge server for global model updates.
On the other hand, VFL can be interpreted as having local training data samples with partial features, and the complete feature set for an entire training data sample is formed by aggregating these partial features. The partial characteristic of VFL implies that it no longer transmits complete model or gradient parameters to the edge server. Instead, it transmits intermediate parameters from each local forward propagation. The edge server then undertakes the calculation of the training outcomes.
Consequently, in the context of precoding design for cellfree massive MIMO systems, HFL possesses the ability to reduce training and communication costs as it can perform multiple rounds of iteration at each AP. In contrast, VFL has the potential for fully decentralized learning since it does not require input features from other APs.

C. THE DECISION-CONFLICT PROBLEM OF FULLY DECENTRALIZED LEARNING
When using fully decentralized learning approaches in cellfree massive MIMO systems to design precoding methods, each AP only can use its own CSI to make decisions, this will lead to the decision-conflict problem. That is, due to the lack of global information, the local decisions made by each AP may conflict with each other, and thus the system's performance can significantly deviate from the global optimum. At the local model inference stage, each AP only has a subset of the whole features of the input data sample to finish the inference of the local precoding vectors. Obviously, this precoding method is not globally optimal and will result in substantial decision-conflict problem.
To further illustrate the above decision-conflict problem, we use the Pytorch open source framework to provide some numerical analysis. We consider four benchmark precoding methods, including ZF, WMMSE, MRT, and centralized learning (see Section V-A for details of these four methods and setup). We compare them with the fully decentralized learning method, in which each AP only uses its own CSI to design the precoding. Note that the ZF, WMMSE, and centralized learning techniques are all centralized schemes as they require each AP to transmit its local CSI to the CPU for centralized processing, while the MRT method is a distributed scheme since it can allow each AP to design the precoding individually using its own CSI. We adopt the supervised learning method to train the local model for each AP b, and the input features are the CSI {h b,k } k∈K of AP b, and the corresponding labels are the precoding vectors  obtained by using the WMMSE algorithm. Fig. 2 and Fig. 3 illustrate the system sum-rate performance comparison in the Rayleigh fading channel model and the deepMIMO channel model [28], respectively. It is observed that the fully decentralized learning method is significantly inferior to all the centralized schemes (WMMSE, ZF, and centralized learning) at all transmit power and in both channel models. The gain at high transmit power is because the decentralized learning scheme has some ability to eliminate inter-cell interference with the help of a precoding vector obtained in advance by using the WMMSE algorithm.
Remark 1: The learning-based precoding design consists of two stages: training and inference. Hence, it is necessary to evaluate computational and communication costs separately. As with most existing learning-based precoding designs in the literature, such as [29], [30], the training stage of our proposed scheme is conducted offline. The offline training is typically carried out when the devices are idle. As such, the communication overhead has no impact on the system operation. Once the model training is completed, there will be no additional communication costs in the inference stage besides the one we shall discuss in Section V-D. Consequently, when conducting performance comparisons, this work solely considers the computational and communication costs associated with the model inference stage.

III. HFL-BASED DESIGN
In this section, we introduce the precoding method in the cell-free massive MIMO system based on the HFL framework. Recall how features are distributed among multiple EDs is crucial to distinguish HFL and VFL. In this paper, we define the complete input features of a training data sample as the global CSI H. Initially, each AP b only has the partial features H b of this training data sample, and the residual features {H b , b = b} are unknown. Note that the HFL framework needs complete input features of the global CSI for local training in each AP, but only partial features are available in reality. To solve this problem, we propose RFAM, a low-cost mechanism for acquiring the residual features to generate the complete input features for local training, which can reduce the signaling exchange between the CPU and each of the APs. In the rest of this section, first, we outline the problem formulation and the architecture of the HFL-based method. Then, we elaborate the detailed procedures of the RFAM method.

A. ARCHITECTURE AND PROBLEM FORMULATION
The architecture of the proposed HFL framework for cellfree massive MIMO precoding is shown in Fig. 4. Let denotes the input features of the i-th training data sample at AP b, and is generated by the proposed RFAM method, as shown in the following subsection, and Y i b is the corresponding label defined as the precoding vector p b designed by AP b, which can be obtained by some benchmark algorithms such as WMMSE when the global CSI is available. Let the total number of training data samples be denoted as The function α(·) at the CPU is used for aggregating the local models from all APs to generate a global model w for all APs. A typical aggregation function is weighted averaging [31]. In this scheme, the objective is to learn a global model to fit all data across the APs. In particular, we aim to solve: where w; H i b , Y i b is the loss function for the i-th training data sample in AP b.
The training stage involves multiple iterations in offline mode, each of which includes τ local updates, local model uploading, global model aggregation, and global model broadcast. Before starting local training, each AP uses RFAM to obtain input features for its local model, and the CPU adopts the centralized WMMSE algorithm to provide corresponding labels for each local model. Here, we describe one around (say the t-th) of the training stage.
• Global Model Broadcast: First, the CPU broadcasts the newest global model, w t , to all APs through the backhaul link. • Local Updates: Upon receiving the global model w t from the CPU, each AP b sets its local model as w t b = w t . Then, it adopts a specific learning algorithm, such as the stochastic gradient descent (SGD) algorithm, and performs τ (≥ 1) rounds of local updates to get a new local model based on its local dataset: where ∇ denotes the gradient operation, η t r is the learning rate (also known as step size) and D t b,r is a sample randomly chosen from the local dataset D b in the r-th local training iteration of the t-th round.
denotes the prediction error between predicted labelŶ i b generated by input features H i b with model w t and the ground-truth label Y i b . In neural can be written as: where J denotes the number of neurons, v i is the weight of connecting different neurons, and φ(·) denotes the activation function. • Local Model Uploading: Each AP b transmits its local updated model w t b,τ to the CPU through the backhaul link. • Global Model Aggregation: After receiving all the local models, the CPU performs global model aggregation to update the global model w t+1 . In this work, we consider the classic FL model aggregation scheme in [22], and it is given by After the training of the global model w is finished, in the inference stage, each AP b first uses the proposed RFAM method to acquire the input features H i b , then uses the trained global model w to generate the precoding vector p b for all users in online mode.

B. RFAM METHOD
In this paper, we aim to maximize the downlink sum-rate in (3) of the cell-free massive MIMO system. According to (2), it can be seen that the precoding method of each AP is related to the global CSI and the channel noise between the UEs and each AP. Therefore, we use the global CSI H as input features of the HFL-based local model for precoding as elaborated. We assume that each AP b can perfectly obtain its local CSI with all UEs {h b,k } k∈K but does not know the CSI from other APs to the UEs (this assumption is consistent with most existing learning-based precoding design methods, such as [21], [27]). In the following, we propose a threestage approach to acquire the residual features of the global CSI at each AP called RFAM to generate input features for each local model. The overall procedure is illustrated in Fig. 5, where in stage 1, each AP transmits a common pilot sequence simultaneously, in stage 2, each UE returns its received signal to each AP, and in stage 3, each AP finds the desired residual features. The details are as follows: Let S ∈ C M×M denote the common pilot matrix for all UEs. In the first stage, each AP broadcasts pilot matrix S to all UEs simultaneously. The received signal at UE k, denoted as y dl k ∈ C 1×M is given by where β dl is a constant that meets the transmit power constraint, and it is the same for all APs and n dl k is the AWGN term at user k with elements distributed as CN (0, σ 2 k ). In the second stage, each UE k sends its received signal y dl k back to all APs through an ideal error-free channel. Let Y dl = {y dl k } k∈K denote the collection of feedback signals from all users at each AP. In the last stage, based on Y dl and S, each AP b estimates the composite CSI b∈B,b =b h b,k from all other APs to each UE by treating its own CSI {h b,k } as side information. Specifically, by means of this new uplink signaling resource, each AP b obtains Remark 2: The proposed HFL-based method demonstrates robustness in the presence of imperfect local CSI acquired at each AP. Firstly, the utilization of imperfect local CSI as the input feature for both training and inference stages reduces the impact of imperfect local CSI on performance during the local model inference stage. Secondly, the proposed HFL-based method employs a supervised learning approach, necessitating the generation of labels in advance for each local input feature. Building upon the robust WMMSE precoder for massive MIMO systems presented in [32], even in the scenario where the AP obtains imperfect local CSI, the approach in [32] can facilitate the construction of accurate labels for each local input feature. This enables the mitigation of the effects caused by imperfect local CSI. Therefore, for the sake of simplicity, the proposed HFL-based method assumes that each AP b can perfectly acquire its local CSI {h b,k } k∈K from all UEs.
Remark 3: Here we assume that the received signal at each UE k, y dl k , can be transmitted without error to each AP. In practical scenario, this assumption may not hold true due to limited channel resources. However, it is worth noting that neural networks exhibit strong robustness towards input features. Additionally, the data samples used in the local training and inference stages of the proposed HFL-based scheme are independent and identically distributed (IID). As a result, even if the feedback of y dl k is not ideal, the performance of the proposed HFL-based scheme will not be significantly affected. This assertion will be validated through the simulation results presented in Section V-C. To facilitate the analysis, we assume that the stage 2 of RFAM incorporates an ideal and error-free feedback channel.
From (12), h H b,k denotes the sum of the channel vectors between all APs except AP b and user k. Here, perfect h H b,k estimation would imply that the transmit SNR in each AP is high enough, so channel noise can be eliminated, and the pilot sequence S is a full rank matrix. Let is the input feature matrix generated for the local model w b which contains the complete feature information about the global CSI.
Remark 4: Although H b does not precisely encompass the complete CSI as represented by H, the learning-based method possesses an advantage over the optimizationbased method due to its feature-to-output mapping nature, which lacks a closed-form expression. On the contrary, the optimization-based scheme relies on a correct closed-form expression for the design of the precoding method. Each variable within this expression must acquire an accurate value to achieve optimal design. Moreover, the superposition of all features of a specific global CSI provides a representation of the overall features to a certain extent. For instance, when identifying a cat picture, the aggregation of multiple feature elements such as eyes, ears, and so on, allows for accurate recognition as a cat, albeit potentially making the identification process more challenging. Therefore, employing the superposition of all features as input is a reasonable approach.

C. COMPLEXITY AND CONVERGENCE ANALYSIS
As mentioned in Remark 1, the learning-based precoding design entails two stages: training and inference, thereby necessitating a separate analysis of the associated computation and communication costs. In this subsection, we discuss the computation and communication costs of the training stage of the proposed HFL-based method. Section V-D will address the computation and communication costs of the proposed HFL-based method specifically during the inference stage.
The computational cost associated with the HFL-based method primarily arises from training the local model at each AP. Here, without loss of generality, we assume that each AP utilizes a convolutional neural network (CNN) model. Based on a thorough literature review, the computational complexity of forward propagation in a CNN is generally represented as where M a is the edge length of the output feature map generated by each convolutional kernel, K e is the edge length of each convolutional kernel, C in corresponds to the number of input channels in each layer of the CNN, and C out indicates the number of output channels in each layer of the CNN. The computational complexity of backpropagation in a CNN can be represented as O(λM 2 a K 2 e C in C out ), where λ denotes a multiple of the computational complexity compared to forward propagation. Based on this, for each AP b, the total computational complexity during the training phase can be approximated as O((1+λ)τ D b TM 2 a K 2 e C in C out ), where T is the total number of training rounds.
Let Q denote the number of parameters of the local model (in the HFL framework, where each local model has an equal number of parameters) at each AP. Within each training round, the communication cost associated with the HFL scheme stems from two aspects. Firstly, there is the communication cost of obtaining local model input features using the RFAM method, which amounts to M 2 +KM. Secondly, there is the communication cost of transmitting the local model parameters, which is quantified as BQ. Consequently, the cumulative communication cost within each training round is BQ + M 2 + KM.
The HFL-based method adopts the classic FedAvg training method, and according to [33,Th. 1], to attain a fixed precision ε, the number of training rounds is where G 2 is the upper bound of the expected squared norm of all local gradients, and is a problem-related constant.
The proposed HFL method solves the decision-conflict problem involved in Section II-C by acquiring additional input features for each local model through the proposed RFAM method. Importantly, RFAM obviates the need for inter-AP exchange of local CSI during both the training and inference stages, thereby yielding a substantial reduction in communication costs.

IV. VFL-BASED DESIGN
In the HFL framework proposed in the previous section, each AP b still requires additional communication overhead during the inference stage to obtain the H i b introduced by Section III-B. As such, it is a partially decentralized scheme. To address this limitation, we propose a fully decentralized precoding design framework for the cell-free massive FIGURE 6. VFL system framework.
MIMO system based on the VFL framework. As depicted in Section II-B, the VFL-based framework offers the advantage of enabling global collaborative training without the need for access to the global CSI features. Therefore, compared to the HFL-based framework, the VFL framework has the potential to eliminate pilot transmission overhead. This section presents the details of the proposed VFL framework, which aims to decrease communication overhead (eliminate pilot transmission overhead) during the inference stage through a fully decentralized mode and incorporates collaborative training across all APs.

A. ARCHITECTURE AND PROBLEM FORMULATION
The proposed VFL framework for cell-free massive MIMO precoding is shown in Fig. 6. Let v b denote the local model owned by AP b. Define v AP = {v b } b∈B as the collection of all local models. Let b b∈B as the collection of the i-th local input feature from all APs. As described in Section II-B, in VFL, all APs share the same sample ID but have separate feature spaces. Therefore, the objective of VFL is to train a favourable model v AP , where the specific part v b corresponding to model v AP is trained by AP b using its local input features X b .
The proposed VFL framework can be divided into the AP side and the CPU side. In the AP side, each AP b maps its input feature H i b into an intermediate vector p i b . Then, each AP b transmits p i b to the CPU for calculating the global loss L. After that, the CPU returns L for each AP b to update v b .
Specifically, in the proposed VFL scheme, the objective is to learn all the local models v b to fit all data across the APs. In particular, the objective function is where ({v b , p i b }; H i ) is the loss function for input features H i , and X ≤ X b , ∀b ∈ B is the number of the complete input feature (the value of X depends on how many aligned input features there are on all APs).

B. TRAINING STAGE
The distributed SGD algorithm is designed to solve problem (13), in the VFL framework. The training process involves multiple iterations in offline mode, each of which includes both the forward propagation for loss function evaluation and the backward propagation for calculating the gradient of each local model. Unlike in centralized learning or HFL, where the entire training model is owned at each AP. In VFL, each AP keeps only a portion of the training model v AP , and the intermediate results must be exchanged between each AP b and the CPU in the course of forward propagation and back propagation.
• Forward propagation: In round t, each AP b maps its input features of the i-th data sample H i b to a predicted precoding vector p t, } k as the specific predicted precoding vector for user k by AP b. Then, each AP b ∈ B transmits its output precoding vector p t,i b to the CPU via the backhaul link (in the first training round, each AP b also needs to transmit its local CSI H i b to the CPU for calculating the loss). After receiving all predicted precoding vectors {p t,i b } b∈B , the CPU calculates the loss for the i-th training sample as ({p t,i b } b∈B ), where (·) is the error function. We adopt unsupervised learning approach and define the error function as the opposite of the achievable sum-rate of the system as (3). To sum up, the loss for the i-th training data sample is given by (14) shown at the bottom of the page. • Back propagation: After completing the loss calculation, the gradients can be back propagated over each local model to update the current model parameters. Specifically, the CPU sends ({p t,i b } b∈B ) to each AP via backhaul link. Then, each AP can perform its respective back propagation to compute the stochastic gradient of where η t b is the learning rate for AP b in round t. Following the completion of the training stage, these welltrained local models are deployed for inference. Then, each AP uses its corresponding well-trained local model and outputs local precoding vectors through model inference in online mode.

C. COMPLEXITY AND CONVERGENCE ANALYSIS
Similar to Section III-C, this subsection only focuses on the computation and communication costs associated with the training stage of the proposed VFL-based method.
The computational cost of the VFL-based method primarily stems from training the local model at each AP. Here, without loss of generality, we assume that each AP utilizes a CNN model. Notably, since the VFL-based scheme does not involve multiple local iterations, the computational complexity during the training phase at each AP can be approximated as O ((1 + λ) The communication cost associated with the VFL scheme arises from two aspects. Firstly, each AP transmits local CSI to the CPU in the initial stage, resulting in a communication cost of BKM. Secondly, in each training round, there is a communication cost of TKM for transmitting local model outputs to the CPU. Consequently, the total communication cost is (B + T)KM.
The VFL-based scheme adopts the classic SGD method, and according to [34], to attain a fixed precision ε, the number of training rounds is L v 0 −v * 2ε , where L represents that (·) satisfies the L-Lipschitz condition, v 0 is the initial model of v AP , and v * is the optimal model. The proposed VFL framework solves the decision-conflict problem involved in Section II-C by calculating the global loss on the CPU side to guide the local model update. Meanwhile, once all local models are well trained, each AP can complete the precoding vector design only using its local CSI, which greatly reduces the communication overhead.
Remark 5: HFL-based solutions are partially decentralized, whereas VFL-based solutions are fully decentralized. This is because HFL-based solutions need additional signal exchange with UEs during the precoding design phase, which is not fully decentralized. In contrast to the VFL-based solutions (which need to transmit intermediate parameters of local update in each round), the HFL-based solution offers the benefit of lightweight training since they can be batched and conducted multiple rounds of iteration locally. Similarly, the VFL-based solution is more oriented toward the scenario of efficient communication since it does not require additional signal exchange during the precoding design phase.

V. PERFORMANCE EVALUATION A. SIMULATION SETUP
The simulation setup consists of B = 16 APs (if not specified otherwise) with each equipped with M = 4 (if not specified otherwise) antennas and arranged in a square grid with an inter-site distance of 100 m and height of 10 m. There are in total K = 6 users (if not specified otherwise) with each equipped with a single antenna and randomly dropped in the considered coverage area. We verify the performance of our proposed methods in both the Rayleigh fading channel model and the deepMIMO ray tracing channel model [28]. The Rayleigh fading channel represents an ideal IID channel environment, while the deepMIMO channel represents a more practical non-IID channel environment. As in [35], [36], the Rayleigh fading channel model includes IID Rayleigh fading and power-law pathloss with each channel generated as h b,k ∼ CN (0, γ b,k I M ), where γ b,k [dB] = −30.5 − 36.7 log 10 (d b,k ) with d b,k being the distance between AP b and user k. The AWGN powers at the users are fixed to {σ 2 k = −90 dBm} k∈K . In the deepMIMO channel model, we consider an outdoor scenario of two streets and one intersection, with parameters active_BS= {1:12}+{15:18} and the users distributed across all rows. In this simulation, we adopt the prevalent CNN learning model. The structure of this CNN model has 4 layers, including a 2 × 2 × 32 convolution layer, a 2 × 2 × 64 convolution layer, and two fully connected layers. The first fully connected layer has z × 128 units, the second fully connected layer has 128 × 64 units, and z is the number of output of the second convolution layer. The mini-batch size θ is set to 100, the learning rate is fix to 0.001, and the weight decay is set to 10 −8 .

B. DATA GENERATION AND BENCHMARK METHODS
The size of each local dataset is 10,000. Among these 10,000 samples, 80% are used for training and the remaining 20% are used for performance evaluation. In the Rayleigh fading channel model, we produce channel data 10,000 times by randomly generating the user's position in the user distribution region. In the deepMIMO channel model, we randomly select 60,000 users across all rows to generate channel data, and use the channel data of each set of 6 users after shuffling as one data sample.
We consider the existing centralized precoding schemes WMMSE [8] and ZF [7], as well as the decentralized precoding scheme MRT [11] as benchmark methods. In addition, we adopt centralized learning and decentralized learning methods to verify the effectiveness of our proposed FL-based scheme. In centralized learning, we use the unsupervised learning training method with loss function being the negative sum-rate in (3). A large-scale DNN model whose dimensions are equal to the sum of the dimensions of all APs DNN model is deployed in the CPU, and all APs transmit their CSI to CPU to train this model. In the precoding design phase, the CPU collects the global CSI as model input and returns the corresponding precoding vectors to each AP through the backhaul link. In decentralized learning, we deploy a DNN model for each AP with supervised learning where the precoding label is obtained by the WMMSE method and the input feature for each local model is the local CSI H b in AP b. Then, in the precoding design phase, each AP b uses its own CSI and local model to infer the corresponding precoding vector p b . To illustrate the performance advantage of the proposed VFL, we also adopt the fully decentralized precoding (FDP) method in [27], where the number of SSBs is 16, and the received signal strength indicator (RSSI) feedback values are quantized by 8 bits. Fig. 7 illustrates the achievable sum-rate versus the number of training epochs with ρ = 5 dBm at each AP in the Rayleigh fading channel model. Several key observations can be made. First, the proposed VFL precoding method achieves a performance close to the ZF method after convergence, and achieves a performance increase with respect to the MRT method of about 45% and 41% after 10 epochs, respectively. Compared with the decentralized learning method, the performance is improved by about 42% and 38% after convergence, respectively. In addition, it can nearly achieve the performance of centralized learning within 15 epochs. Second, the proposed HFL precoding method also has a significant performance improvement of at least 30% with respect to the two decentralized benchmark methods (MRT and Decentralized learning). Third, when comparing the FDP method with the proposed VFL scheme, in the small-scale system (B = 8, K = 4, M = 4), the proposed VFL method outperforms the FDP by 3% in terms of sum-rate performance, while in the large-scale system (B = 16, K = 6, M = 8), this value increases to 9%. It is obvious that the proposed VFL method is more suited for large-scale systems. This is because in the large-scale system, the local CSI features cannot be fully characterized by RSSI alone. Fourth, by comparing the proposed VFL method and the proposed HFL method, the VFL has a higher sum-rate performance. This is because the input H b , b ∈ B for the global model w in the HFL cannot perfectly represent the features of global CSI, while the VFL method can always provide perfect features H of the global CSI for the global model v AP . Although the sum-rate performance of the HFL-based scheme decreases compared with the VFL-based scheme, however, since the HFLbased scheme can perform multiple local training rounds in one training epoch, it has a better convergence rate performance. Fig. 8 illustrates the results in the deepMIMO channel model. Similar performance trends can be observed as compared with the Rayleigh fading channel model. The difference is that the proposed approaches have various degrees of performance penalty (at least 5%), because the model training in the non-IID channel model becomes difficult. Nonetheless, when compared to the decentralized learning approach, the sum-rate performance of the two proposed methods increases by at least 15%.

1) ACHIEVABLE SUM-RATE VERSUS TRAINING EPOCHS
2) CDF OF THE ACHIEVABLE PER-USER RATE Fig. 9 and Fig. 10 demonstrate the cumulative distribution function (CDF) of the achievable per-user rate after convergence with ρ = 5 dBm in different channel models. It is observed that both in these two proposed FL-based methods, each UE has a more than 90% probability of being served at a rate higher than 5 bps/s/Hz in the Rayleigh fading channel model scenarios, and the probability decreases to 70% when using the FDP method. However, the probability does not exceed 50% with respect to the MRT method and the decentralized learning method. In addition, compared with the FDP method, the user using the VFL method has a higher probability of being served at a higher rate.
In the deepMIMO channel model, although the probability of the user being served at a stable rate is reduced. However, compared with the MRT method and the decentralized learning method, the proposed two methods still maintain a higher quality of service. In addition, the proposed methods are weaker than the WMMSE approach in terms of throughput performance. However, as described earlier, both the HFL and VFL methods are superior in terms of saving communication overhead and improving system responsiveness.

3) ACHIEVABLE SUM-RATE VERSUS DIFFERENT TRANSMIT POWERS
In Fig. 11 and Fig. 12, we evaluate the sum-rate performance of the cell-free massive MIMO system at different transmit powers in different channel models. The results show that the proposed HFL and VFL methods provide better sum-rate performance than the MRT method and the decentralized learning methods over different transmit powers. In particular, first, in the case of high transmit power, MRT performs poorly due to inter-cell interference dominating. Meanwhile, the performance of the decentralized learning technique is affected under all channel conditions due to the problem of decision-conflict. The ZF method has poor performance at low transmit power due to its weak resistance to channel noise. In addition, WMMSE and centralized learning methods consistently perform well both in the Rayleigh and the deepMIMO channel models. As described in Section V-C.1, the sum-rate performance of the proposed HFL-based method and the FDP method decreased when compared with the proposed VFL-based method in large-scale systems. All learning-based approaches have various degrees of performance penalty in the deep-MIMO channel model when compared to the Rayleigh fading channel model. However, in terms of system throughput, the proposed two decentralized precoding approaches based on FL outperform the MRT and the decentralized precoding schemes by at least 20% when the transmit power surpasses 5 dBm. From the simulation results, the two proposed methods can better cope with the influence of channel noise than the ZF method when the transmit power is -10 dBm in the Rayleigh fading channel model. In the case of high transmit power, the performance of the proposed VFL-based method is slightly lower than the centralized method, while the proposed HFL-based method improves the performance by at least 20% compared to the decentralized learning method.

4) THE IMPACT OF ESTIMATION ERRORS FOR HFL
In the proposed HFL scheme in Section III-B, we assume an ideal error-free channel for each UE to feedback its received signal to each AP. In practice, the transmission of the received signal y dl k is affected by the wireless channel, i.e., channel noise. To illustrate the robustness of the proposed HFL-based framework, we simulate different channel models under different estimation errors. Let y dl b,k denote the estimated received signal in AP b, then, we model the estimation error, i.e., y dl b,k −y dl k , as Gaussian distribution with elements distributed as CN (0, δ 2 ) (we set the same variance on all APs), and the achievable sum-rate with different estimation errors is shown in Fig. 13. The results indicate that when the estimation error is large (δ 2 = 0.1), the sum-rate of the system decreases slightly. Overall, despite the existence of an estimation error, it has limited influence on the system's performance. It is worth noting that in the Rayleigh fading channel with ρ = 10 dBm and δ 2 = 0.01, the sumrate performance is slightly larger than perfect transmission. On the one hand, the neural network has strong robustness to the input features; on the other hand, because the training and inference phases employ RFAM with estimation errors to acquire model input features, therefore, the impact of estimation errors on the system performance is mitigated. In general, the proposed HFL-based method has strong robustness for feedback errors. Table 1 compares the amount of signaling exchange and computational order of the proposed methods with existing approaches (the amount of signaling exchanges only considers the inference stage when using learning-based methods) with ρ = −5 dBm at each AP. The volume of signal exchange is determined by counting the sum number of real matrix coefficients that have been exchanged between the APs and the CPU, as well as between the APs and UEs. It is worth noting that during the design phase of precoding vectors, methods such as MRT, Decentralized Learning, VFL and FDP do not involve signal exchange. Consequently, the amount of signal exchange associated with these methods is disregarded. It is observed that the proposed HFL method significantly reduces the number of signaling exchanges compared to centralized methods (in

general, BM
K and B > M in cell-free massive MIMO systems). To analyze the computational complexity, we consider the computational order, which includes the number of real multiplications involved in each matrix multiplication and inversion. Specifically, multiplying two matrices with dimensions C M×N and C N×M has a computational order of O(M 2 N), while inverting a matrix with dimensions C M×M has a computational order of O(M 3 ). Table 1 indicates the computational complexity of all involved algorithms, where ξ is the number of iterations for the WMMSE algorithm, and q represents the utilization of q-bit quantization in the FDP method. Additionally, considering the conditions M a < K × M and K e < min(K, M), it is evident that the computational complexity of the proposed HFL and VFL methods is substantially reduced in comparison to the centralized methods.

VI. CONCLUSION
In this paper, we choose the FL framework to deal with the problems of high computational complexity and expensive communication overhead associated with precoding design in cell-free massive MIMO systems. We propose partially and fully decentralized precoding design methods, respectively, based on HFL and VFL frameworks. Since the HFL method requires all features of the training data sample, we design a low-cost mechanism for acquiring residual features. Using the VFL framework, we design a collaborative training mechanism based on unsupervised learning, which solves the decision-conflict problem of traditional decentralized methods and realizes the fully decentralized precoding method design. The two proposed methods are adapted to different scenarios: training-cost sensitive (HFL), and inference-cost sensitive (VFL). Experimental results show that the throughput performance of our proposed methods can achieve or exceed that of traditional precoding methods with a lower cost. In addition, system performance will decrease when learning-based precoding methods are employed in complicated channel models. To address this bottleneck, in future work we will investigate methods (i.e., model structure design) to improve the effectiveness of models employed in complicated channel conditions for precoding scheme design in cell-free massive MIMO systems.