Pilot-Aided Distributed Multi-Group Multicast Precoding Design for Cell-Free Massive MIMO

We propose fully distributed multi-group multicast precoding designs for cell-free massive multiple-input multiple-output (MIMO) systems with modest training overhead. We target the minimization of the sum of the maximum mean squared errors (MSEs) over the multicast groups, which is then approximated with a weighted sum MSE minimization to simplify the computation and signaling. To design the joint network-wide multi-group multicast precoders at the base stations (BSs) and the combiners at the user equipments (UEs) in a fully distributed fashion, we adopt an iterative bi-directional training scheme with UE- and/or group-specific precoded uplink pilots and group-specific precoded downlink pilots. To this end, we introduce a new group-specific over-the-air uplink training resource that entirely eliminates the need for backhaul signaling for the channel state information (CSI) exchange. The precoders are optimized locally at each BS by means of either best-response or gradient-based updates, and the convergence of the two approaches is analyzed with respect to the centralized implementation with perfect CSI. Finally, numerical results show that the proposed distributed methods greatly outperform conventional cell-free massive MIMO precoding designs that rely solely on local CSI.


I. INTRODUCTION
Emerging shared wireless applications, such as video streaming, vehicular communications, augmented/mixed reality, and wireless coded caching, considerably increase the demand for multicasting services [2].The multicast precoding framework was initially developed to transmit a single data stream to a group of user equipments (UEs) [3].This was subsequently extended in [4] to serve several multicast groups with parallel data streams, each transmitted using a groupspecific precoder under a rate constraint imposed by the worst UE in the multicast group.The conventional objective considered for the multi-group multicast precoding design is the max-min fairness, according to which the minimum signalto-interference-plus-noise ratio (SINR) in each multicast group is maximized under a transmit power constraint [3], [4].For this objective, [5], [6] proposed low-complexity methods to design the optimal multi-group multicast precoders.Such precoders have a similar structure to the weighted minimum mean squared error (MMSE) precoder, where the matched The authors are with the Centre for Wireless Communications, University of Oulu, Finland (e-mail: {bikshapathi.gouda,italo.atzeni,antti.tolli}@oulu.fi).This work is supported by the Research Council of Finland (318927 6G Flagship, 336449 Profi6, and 348396 HIGH-6G) and by the European Commission (101095759 Hexa-X-II).Part of this work was presented at IEEE GLOBE-COM 2022 [1].filtering (MF) front-end is given by a weighted sum of the effective channels in the multicast group [5].
The aforementioned works assume perfect channel state information (CSI) at the transmitter.However, in practice, the UE-specific channels need to be estimated.In time division duplexing (TDD) systems with channel reciprocity, this can be done via reverse link measurements, which usually require as many orthogonal pilots as the number of UEs to avoid pilot contamination.The number of orthogonal pilots can be substantially reduced by assigning a common pilot to all the UEs in a multicast group [7].Hence, considering the resulting training overhead, using group-specific rather than UE-specific pilots for the multi-group multicast precoding design has the potential to increase the effective rate.The effective performance of multi-group multicasting in massive multiple-input multiple-output (MIMO) systems was analyzed in [8] under different precoding and pilot assignment strategies.This study was extended in [9] to include coexisting unicast and multi-group multicast transmissions.The multi-group multicast precoding design in a coordinated multi-cell scenario was considered in [10]- [12], where the CSI is assumed to be exchanged among the BSs via backhaul signaling.
Cell-free massive MIMO is an extension of joint transmission coordinated multi-point to a UE-centric approach, where all the BSs jointly serve all the UEs to eliminate the inter-cell interference [2], [13], [14].To facilitate the UEcentric joint processing, the BSs are connected to a central processing unit (CPU) via backhaul links to exchange the UEspecific data and CSI.Most works on cell-free massive MIMO consider simple local precoding strategies, such as MF, local (regularized) zero forcing, and local MMSE precoding [14]- [16], to circumvent the prohibitive complexity and backhaul signaling of large-scale centralized precoding designs.However, allowing (limited) coordination among the BSs to enable more advanced precoding strategies can provide significant performance gains [16]- [19].In our previous work [20]- [22], we considered a cell-free massive MIMO unicasting scenario and proposed a fully distributed method based on iterative bi-directional training [23] to design the joint network-wide MMSE precoders locally at each BS.This scheme eliminates the need for backhaul signaling for the CSI exchange altogether and yields a performance close to that of the centralized implementation with perfect CSI.
Cell-free massive MIMO is especially suited for multicasting applications as it improves the rate of the cell-edge UEs and thus reduces the impact of the worst UE in each multicast group.Multi-group multicasting in cell-free massive MIMO systems has been considered, for example, in [24]- [26], where MF precoding is used for the data transmission.Equal power allocation among the multicast precoders at each BS was assumed in [24] to eliminate the need for backhaul signaling for the CSI exchange, whereas the optimal power allocation among the multicast groups was carried out in [25], [26] while assuming limited backhaul signaling.

A. Contribution
Most works on cell-free massive MIMO multi-group multicasting assume MF precoding to avoid the complexity and backhaul signaling issues associated with the centralized precoding design [24]- [26].In this paper, we propose a distributed framework to design the multi-group multicast precoders with low complexity and without any backhaul signaling for the CSI exchange.
We begin by targeting the minimization of the sum of the maximum mean squared errors (MSEs) over the multicast groups, which is referred to in the following as the sumgroup MSE.This approach achieves absolute MSE fairness within each multicast group, which is dictated by slowly varying dual variables that would need to be exchanged among the BSs via backhaul signaling in the distributed precoding designs.To avoid the resulting backhaul signaling overhead, we approximate the sum-group MSE minimization with a weighted sum MSE minimization, which greatly simplifies the distributed precoding design while only slightly relaxing the MSE fairness requirement.In this regard, we show that the in-built MSE fairness of the weighted sum MSE metric provides a good approximation for the original sum-group MSE metric, especially at high signal-to-noise ratio (SNR).Based on the reformulated problem, we propose a novel framework to design the joint network-wide multi-group multicast precoders at the BSs and the combiners at the UEs in a fully distributed fashion.To this end, we adopt an iterative bi-directional training mechanism [23] with UE-and/or groupspecific precoded uplink pilots and group-specific precoded downlink pilots.The iterative optimization of the precoders is carried out via either best-response or gradient-based updates, and the convergence of the two approaches is analyzed with respect to the centralized implementation with perfect CSI.In our previous work on distributed precoding design for cell-free massive MIMO unicasting [20], we introduced a UE-specific over-the-air (OTA) uplink training resource to facilitate the distributed precoding design.In this paper, we propose a new group-specific OTA uplink training resource tailored for the multi-group multicasting scenario, which entirely eliminates the need for backhaul signaling for the CSI exchange and enables the proposed distributed precoding designs with modest training overhead.Moreover, the proposed framework can straightforwardly handle the coexistence of multicasting and unicasting by simply considering individual UEs as separate multicast groups.Numerical results show that the proposed distributed methods bring substantial gains over conventional cell-free massive MIMO precoding designs that rely solely on local CSI.Among the proposed distributed methods, the ones based on group-specific pilots always yield the best effective performance.
The contributions of this paper are summarized as follows.• We establish that, with perfect CSI, the distributed precoding design with gradient-based updates converges to the same solution as its centralized implementation.Part of this work is included in our conference paper [1], which presents the distributed multi-group multicast precoding design with best-response updates.

B. Outline
The rest of the paper is structured as follows.Section II introduces the system model for cell-free massive MIMO multi-group multicasting along with the iterative bi-directional training and channel estimation.Section III describes the sum-group MSE minimization and the approximation with a weighted sum MSE minimization with reference to the centralized implementation with perfect CSI.The proposed distributed multi-group multicast precoding designs with bestresponse and gradient-based updates are presented in Sections IV and V for perfect and imperfect CSI, respectively.Finally, Sections VI and VII provide the numerical results and the concluding remarks, respectively

C. Notation
Lowercase and uppercase boldface letters denote vectors and matrices, respectively.(•) T and (•) H are the transpose and Hermitian transpose operators, respectively.• and • F represent the Euclidean norm for vectors and the Frobenius norm for matrices, respectively.Re[•] and E[•] are the real part and expectation operators, respectively.I L denotes the Ldimensional identity matrix and 0 represents a zero vector with proper dimension.Diag(•) and blkdiag(•) represent diagonal and block-diagonal matrices, respectively.[a 1 , . . ., a L ] denotes horizontal concatenation, whereas {a 1 , . . ., a L } and {a ℓ } ℓ∈L represent sets; the latter notation is occasionally relaxed as {a ℓ } for brevity.CN (0, σ 2 ) is the complex normal distribution with zero mean and variance σ 2 .Lastly, ∇ x (•) denotes the gradient with respect to x, whereas L (P) (•) represents the Lagrangian of optimization problem (P).

II. SYSTEM MODEL
Consider a cell-free massive MIMO system where a set of BSs B {1, . . ., B} serves a set of UEs K {1, . . ., K} in the downlink.Each BS and UE are equipped with M and N antennas, respectively.The UEs are divided into a set of non-overlapping multicast groups G {1, . . ., G}, with K g denoting the set of UEs in group g ∈ G. 1 In the following, we use g k as the index of the multicast group that contains UE k.The BSs transmit a single data stream to each multicast group, i.e., all the UEs k ∈ K g are intended to receive the same data symbol d g .Let H b,k ∈ C M×N be the uplink channel matrix between UE k ∈ K and BS b ∈ B, and let w b,g ∈ C M×1 be the BS-specific precoder used by BS b for group g.We use to denote the aggregated uplink channel matrix of UE k and the aggregated precoder used for group g, respectively, which imply . We assume the per-BS transmit power constraints g∈G w b,g2 ≤ ρ BS , ∀b ∈ B, where ρ BS denotes the maximum transmit power at each BS.Hence, the signal received at UE k is given by where d g k represents the data symbol intended for the group that contains UE k and z k ∈ C N ×1 is the additive white Gaussian noise (AWGN) with i.i.d.CN (0, σ 2 UE ) elements.Upon receiving y k , UE k obtains a soft estimate of d g by applying the combiner v k ∈ C N ×1 and the resulting SINR can be expressed as (2) Finally, the sum of the rates over the multicast groups, which is referred to in the following as the sum-group rate, is given by R g∈G R g , where R g is the rate of group g defined as R g min Note that (3), which is based on the SINR expression in (2), represents an upper bound on the system performance that assumes perfectly estimated SINRs for given precoders and combiners. 2In Section VI, we use this metric to evaluate the proposed distributed multi-group multicast precoding designs.
In this paper, we aim to design the joint network-wide multi-group multicast precoders at the BSs and the combiners at the UEs in a fully distributed fashion assuming an ideal TDD setting with channel reciprocity between uplink and downlink.To this end, we adopt an iterative bi-directional training scheme that relies on estimating the effective uplink and downlink channels via precoded pilots, as discussed in detail in the following section.

A. Pilot-Aided Channel Estimation and Iterative Bi-Directional Training
The centralized precoding design (considered as reference scheme and described in Section III-C) involves the transmission of antenna-specific uplink pilots, by which each BS estimates the antenna-specific uplink channels.
Antenna-specific uplink channel estimation (UL).The estimation of the uplink channel H b,k involves N antenna-specific uplink pilots for UE k.In this context, let P UL k ∈ C τ UL ×N be the uplink pilot matrix of UE k, with P UL k 2 F = τ UL N, ∀k ∈ K.Moreover, let ρ UE denote the maximum transmit power at each UE.Each UE k synchronously transmits its pilot matrix P UL k , i.e., where the power scaling factor β UL ρUE N (equal for all the UEs) ensures that X UL k complies with the per-UE transmit power constraint.Then, the signal received at BS b is given by where where the last equality holds if (P UL k ) H P UL k = τ UL I N , i.e., if there is no pilot contamination among the antennas of UE k.
On the other hand, the proposed distributed precoding designs and the local precoding designs (also considered as reference schemes and described in Appendix III) are based on iterative bi-directional training, whereby the precoders at the BSs and the combiners at the UEs are updated iteratively by means of uplink and downlink pilot-aided channel estimation [19], [23], [28].Specifically, each bi-directional training iteration involves: i) The transmission of UE-and/or group-specific precoded uplink pilots from all the UEs, by which each BS estimates the UE-and/or group-specific effective uplink channels and updates its precoders; ii) The transmission of precoded downlink pilots from all the BSs, by which each UE estimates its effective downlink channel and updates its combiner.Iterative bi-directional training can reduce the training overhead compared with antenna-specific uplink channel estimation for multi-antenna UEs.More importantly, it eliminates the need for centralized precoding design since each BS (resp.UE) can update its precoder (resp.combiner) based on the effective uplink (resp.downlink) channel estimation.A schematic representation of iterative bi-directional training in a single-UE, single-BS setting is provided in Figure 1.In the following, we describe the different existing types of pilotaided channel estimation that are adopted within the iterative bi-directional training, which will be heavily utilized in Sections III-C and V as well as in Appendix III.In Section V, we further introduce a new group-specific OTA uplink training resource tailored for the multi-group multicasting scenario, which entirely eliminates the need for backhaul signaling for the CSI exchange and enables the proposed distributed precoding designs with modest training overhead.

UE-specific effective uplink channel estimation (UL-1).
Let h b,k H b,k v k ∈ C M×1 be the effective uplink channel between UE k and BS b, and let p UL-1 k ∈ C τ UL-1 ×1 denote the uplink pilot of UE k, with p UL-1 k 2 = τ UL-1 , ∀k ∈ K.Each UE k synchronously transmits its pilot p UL-1 k using its scaled combiner v k as precoder, i.e., where the power scaling factor β UL-1 (equal for all the UEs) ensures that X UL-1 k complies with the per-UE transmit power constraint.Then, the signal received at BS b is given by where Group-specific effective uplink channel estimation (UL-2).In the antenna-specific and UE-specific channel estimations described above, the BSs may apply UE-specific weights to the channel estimates to promote fairness among the UEs in a multicast group.On the contrary, in the group-specific channel estimation, any UE-specific weights must be already incorporated during the pilot transmission.Accordingly, let ω k be the weight of UE k and let f b,g k∈Kg ω k H b,k v k ∈ C M×1 denote the effective uplink channel between K g and BS b.Furthermore, let p UL-2 g ∈ C τ UL-2 ×1 be the uplink pilot of group g, with p UL-2 g 2 = τ UL-2 , ∀g ∈ G.Each UE k synchronously transmits its pilot p UL-2 g k using its scaled combiner v k as precoder, i.e., where the power scaling factor β UL-2 (equal for all the UEs) ensures that X UL-2 k complies with the per-UE transmit power constraint.Then, the signal received at BS b is given by where Effective downlink channel estimation (DL).Let g k b∈B H H b,k w b,g ∈ C N ×1 be the effective downlink channel between all the BSs and UE k.Moreover, let p DL g ∈ C τ DL ×1 denote the downlink pilot of group g, with p DL g 2 = τ DL , ∀g ∈ G.Each BS b synchronously transmits a superposition of the pilots {p DL g } g∈G after precoding them with the corresponding precoders {w b,g } g∈G , i.e., Then, the signal received at UE k is given by where Note that all the above pilot-aided channel estimation schemes can be implemented with arbitrary pilots and, hence, any possible pilot contamination is implicitly accounted for.

III. PROBLEM FORMULATION
The goal of this paper is to propose fully distributed multigroup multicast precoding designs for cell-free massive MIMO systems based on the MMSE criterion.In this section, we establish the basis for the distributed precoding design by considering the centralized implementation with perfect CSI.First, in Section III-A, we focus on the sum-group MSE minimization and identify several practical challenges with its distributed implementation.Then, in Section III-B, we approximate the sum-group MSE minimization with a weighted sum MSE minimization, based on which we develop the proposed distributed precoding designs presented in Sections IV and V with perfect and imperfect CSI, respectively.

A. Sum-Group MSE Minimization
The sum-group MSE minimization achieves absolute MSE fairness within each multicast group through the min-max MSE criterion subject to the per-BS transmit power constraints.Accordingly, the precoders and combiners are optimized by solving where MSE k is the MSE of UE k defined as and E b ∈ R M×BM is a selection matrix such that E b w g = w b,g .The problem in ( 24) is convex with respect to either the precoders or the combiners but not jointly convex with respect to both.Hence, we use alternating optimization, whereby the precoders are optimized for fixed combiners and vice versa in an iterative best-response fashion.Before describing each step of the alternating optimization, let us define t g max k∈Kg MSE k and rewrite (24) in epigraph form as Optimization of the combiners.For a fixed set of precoders {w g } g∈G , the combiners {v k } k∈K are optimized by solving the following convex problem: The Lagrangian of ( 28) can be written as where µ k is the dual variable corresponding to each per-UE MSE constraint in (27).Note that the optimal {µ k } k∈K are such that the MSE objectives of the UEs in a multicast group are equal.For example, if UE k is subject to poor channel conditions, the optimal µ k will be large to force the reduction of its MSE objective.Then, the optimal v k is obtained by setting ∇ v k L (28) {v k , t g , µ k } = 0, which yields Optimization of the precoders.For a fixed set of combiners {v k } k∈K , the precoders {w g } g∈G are optimized by solving the following convex problem: which can be solved, e.g., via CVX [29].Alternatively, one can resort to the Karush-Kuhn-Tucker (KKT) conditions, which also conveniently reveal the optimal multi-group multicast precoding structure.In this regard, the Lagrangian of ( 31) can be written as where λ b is the dual variable corresponding to each per-BS transmit power constraint in (27).Then, the optimal w g is obtained by setting ∇ wg L (31) {w g , t g , µ k , λ b } = 0, which yields The above expression of w g depends on the dual variables {µ k } k∈K and {λ b } b∈B .Such dual variables can be updated iteratively using the sub-gradient method as detailed in Appendix I [6], [30], and their values after convergence are finally used in (33).
From the expression of the aggregated precoder w g in ( 33), it is evident that the BS-specific precoders {w b,g } b∈B also rely on the dual variables {µ k } k∈K .To compute each w b,g locally at BS b, extensive backhaul signaling is required to iteratively update the dual variables {µ k } k∈K either at the CPU or at each BS in parallel.To simplify the distributed precoding design, we propose to relax the absolute MSE fairness requirement within each multicast group, which leads to a weighted sum MSE minimization.In the following section, we describe the reformulated problem and the corresponding centralized precoding design with perfect CSI.

B. Weighted Sum MSE Minimization
To circumvent the shortcomings of the original problem formulation described in Section III-A, we approximate the sum-group MSE objective in (24) with a weighted sum MSE objective.Accordingly, the precoders and combiners are optimized by solving where we recall that ω k is the weight of UE k.This choice stems from the fact that the weighted sum MSE metric provides some in-built MSE fairness among all the UEs.Since the problem in ( 34) is convex with respect to either the precoders or the combiners but not jointly convex with respect to both, we use alternating optimization as in the previous section.For a fixed set of combiners {v k } k∈K , the precoders {w g } g∈G can be optimized, e.g., via CVX [29] or by resorting to the KKT conditions.In this regard, the Lagrangian of (34) can be written as Then, the optimal w g is obtained by setting ∇ wg L (34) {w g , λ b } = 0, which yields It is straightforward to notice the resemblance between (36) and (33).If the optimal dual variables {µ k } k∈K of the sum-group MSE minimization were known in advance, one could replace the weights {ω k } k∈K in (36) with the optimal {µ k } k∈K at each alternating optimization iteration, which would lead to the same solution of (33).However, the optimal {µ k } k∈K cannot be known in advance.Moreover, tuning the weights to match the dual variables at each alternating optimization iteration would generate the same complexity and backhaul signaling overhead of the original sum-group MSE minimization. 3o simplify the distributed precoding design, we consider the sum MSE minimization with fixed UE-specific weights, which can be assigned to promote fairness or priority within each multicast group based on prior information, e.g., about their channel conditions.Without loss of generality, we fix equal weights for all the UEs, i.e., ω k = ω, ∀k ∈ K, a choice justified by the uniform service provisioning of cellfree massive MIMO systems.Hence, in the following, we refer to (34) simply as sum MSE minimization.Though slightly suboptimal, as demonstrated later, this approach leads to much simpler computation and signaling, and is characterized by faster convergence.Note that, especially at high SNR, the UEspecific rates derived from the sum MSE minimization are close to those obtained with the sum-group MSE minimization.This is formalized in Proposition 1. Furthermore, for a fixed set of precoders {w g } g∈G , the optimal combiners {v k } k∈K for (34) are again obtained as in (30) based on the effective downlink channel estimation described in Section II-A.
Proposition 1.As ρ BS → ∞, i.e., at high SNR, the UEspecific rates obtained with the sum MSE minimization in (34) asymptotically approximate the ones resulting from the sumgroup MSE minimization in (24).
Proof: Without loss of generality, let us consider a single BS and let us define . Assuming that UE k adopts the MMSE combiner in (30), its MSE can be expressed as ).As ρ BS → ∞, the precoder in (36) approaches a solution similar to zero forcing, i.e., w g lies in the nullspace of the effective uplink channels of the UEs k / ∈ K g and matched towards the superposition of the effective uplink channels of the UEs k ∈ K g .Thus, considering UE k / ∈ K g k , the inner product between w gk and the effective uplink channel of UE k tends to zero, which leads to c k k → 0, ∀ k / ∈ K g k .In this context, all the UEs experience high SINR, and the SINR of UE k can be approximated as (cf.( 2)) where p g is the transmit power allocated to group g.Finally, when ω k = ω, ∀k ∈ K, the sum MSE minimization in (34) reduces to the following power allocation problem: From the KKT conditions detailed in Appendix II, we obtain the optimal p g as with . Consequently, the rate difference between UE k and UE k at high SNR can be written as which is independent of ρ BS .This suggests that all the UEspecific rates increase uniformly with the transmit power.
Considering the MSE fairness requirement of (27), it follows that the rate of UE k ∈ K g obtained with the sum-group MSE minimization lies within the minimum and the maximum rates among all the UEs k ∈ K g obtained with the sum MSE minimization, i.e., Hence, at high SNR, the UE-specific rates obtained with the sum MSE minimization asymptotically approximate the ones resulting from the sum-group MSE minimization.
In the rest of the paper, we focus on the sum MSE minimization in (34) to design the multi-group multicast precoders.The proposed distributed precoding designs presented in Sections IV and V with perfect and imperfect CSI, respectively, are compared with different reference schemes, namely: i) the centralized precoding design presented in Section III-C, which is referred to in the following as the Centralized; and ii) the local precoding designs based on MMSE and MF described in Appendix III, which are referred to in the following as the Local MMSE and the Local MF, respectively [31].While the primary focus of this paper is to design the joint networkwide multi-group multicast precoders at the BSs in a fully distributed fashion, we point out that the Centralized, the Local MMSE, and the Local MF are also part of our contribution as they are tailored for the sum MSE minimization in the multigroup multicasting scenario.

C. Centralized Precoding Design with Pilot-Aided Channel Estimation
The practical implementation of the Centralized requires the antenna-specific uplink channel estimation (see Section II-A) to enable the computation of the precoders in (36) and the combiners in (30) at the CPU.First, each BS b obtains { Ĥb,k } k∈K and forwards them to the CPU via backhaul signaling.Then, the CPU computes the aggregated precoders {w g } g∈G and the combiners {v k } k∈K via alternating optimization by replacing 36) and (30), respectively.After convergence, the resulting BS-specific precoders are fed back to the corresponding BSs via backhaul signaling.Finally, the effective downlink channel estimation (see Section II-A) is carried out to allow each UE k to compute its (final) combiner as Note that (45) coincides with (30) for perfect CSI, i.e., when τ DL → ∞.The implementation of the Centralized is summarized in Algorithm 1.
Algorithm 1 (Centralized) (7) and forwards them to the CPU via backhaul signaling.Initialization: Combiners {v k } k∈K .Until a predefined termination criterion is satisfied, do: 3) The CPU computes the precoders {wg}g as in (33) and the combiners {v k } k∈K as in (30)

IV. DISTRIBUTED PRECODING DESIGN WITH PERFECT CSI
In this section, we describe the proposed distributed multigroup multicast precoding designs with perfect CSI and backhaul signaling for the CSI exchange.Their practical implementation with imperfect CSI and without any backhaul signaling for the CSI exchange is presented in Section V.The precoders are optimized locally at each BS by means of either best-response or gradient-based updates, as discussed in the following sections.Regardless of the computation of the precoders, each UE k computes its combiner as in (30) with perfect CSI.

A. Best-Response Distributed Precoding Design
In the best-response distributed precoding design, which is referred to in the following as the Distributed BR, the optimal w b,g is obtained by setting ∇ w b,g L (34) {w g , λ b } = 0, which yields The above precoder can be computed locally at BS b provided that ξ b,g , which comprises group-specific cross terms from the other BSs, is known.To reconstruct ξ b,g , BS b needs to obtain {v H k H H b,k wb ,g } k∈K from each BS b = b via backhaul signaling as in [19].In practice, each BS is required to share GK complex scalars with the other BSs.In addition, the backhaul signaling introduces a delay that causes each BS to reconstruct the cross terms based on outdated CSI from the other BSs.As done in [20], we assume that such a delay consists of a single bi-directional training iteration.Hence, the cross terms ξ b,g at iteration i are given by ξ b,g .With this information, all the BSs can compute their precoders simultaneously building on the parallel optimization framework [32], which uses best-response updates to ensure the convergence to a solution of the sum MSE minimization in (34).Finally, the BS-specific precoder at iteration i is computed as = w where the step size α BR ∈ (0, 1] strikes a balance between convergence speed and accuracy of the solution [32], and ∆w ⋆ b,g is obtained by replacing ξ b,g with ξ b,g in (46) as shown in (49) at the top of the next page.
) is a steepest descent direction for the sum MSE minimization in (34).
Proof: Let us write the gradient of (35) with respect to w b,g as Furthermore, let us define Then, we simplify (49) as and, exploiting the fact that Finally, we observe that ∆w ⋆ g in ( 52) is a steepest descent direction for the quadratic norm x C (x H Cx) Remark 1. Theorem 1 states that, for a fixed set of combiners {v k } k∈K , the Distributed BR solves the sum MSE minimization in (34) via a steepest descent method characterized by the quadratic norm x C .Since C is a block-diagonal matrix with blocks {C b } b∈B , each BS b greedily aims to reduce its individual MSE by following the steepest descent direction for the quadratic norm x C b , whereas the convergence to a solution of the sum MSE minimization is guaranteed by a proper choice of α BR .On the other hand, the centralized precoding design with best-response updates is obtained by replacing C with the Hessian of (35) in (52), where the latter is a full matrix.Therefore, the Distributed BR is not equivalent to its centralized implementation and, as a consequence, may be characterized by slow convergence.This motivates the development of the gradient-based distributed precoding design in Section IV-B.Lastly, we point out that the outdated CSI used to reconstruct the cross terms at each BS further slows down the convergence.
Remark 2. To speed up the convergence of the Distributed BR, we impose that, for a fixed set of combiners {v k } k∈K , the BSspecific precoders {w b,g } g∈G are updated only once at each BS b.In this respect, a sufficiently small α BR would ensure the monotonic (yet slow) convergence to a solution of the sum MSE minimization in (34) even with a single update of the precoders for a fixed set of combiners [32].However, considering a practical scenario where only a limited number of bi-directional training iterations is admissible, we disregard the strictly monotonic convergence and choose α BR to promote an aggressive reduction of the sum MSE objective during the first few iterations.

B. Gradient-Based Distributed Precoding Design
The Distributed BR presented in Section IV-A is not equivalent to its centralized implementation and may be thus characterized by slow convergence (see Remark 1).Hence, in this section, we propose a gradient-based distributed precoding design, which is referred to in the following as the Distributed GB and follows directly from its centralized implementation.In this method, the BS-specific precoders are first updated using the gradient of the sum MSE objective and then projected to meet the per-BS transmit power constraints.To this end, we write the gradient of the sum MSE objective (cf.(25)) with respect to w b,g as Then, the corresponding gradient-based update can be expressed as where α GB is the step size.The above gradient-based update can be computed locally at BS b upon receiving the CSI from the other BSs (necessary to reconstruct the cross terms) via backhaul signaling.Finally, the BS-specific precoders at iteration i are obtained by projecting { w(i) b,g } g∈G to meet the per-BS transmit power constraint, i.e., [w with otherwise.Note that this approach can be easily extended to a unicasting scenario considering a single UE in each multicast group.
Theorem 2. The Distributed GB is equivalent to its centralized implementation.
Proof: Considering the centralized implementation, the gradient of the sum MSE objective (cf.( 25)) with respect to w g is given by which corresponds to the concatenation of the gradients with respect to the BS-specific precoders (see (54)).As a consequence, the gradient-based update of w g can be expressed as the concatenation of the gradient-based updates of the BS-specific precoders (see (56)) at iteration i.Then, the aggregated precoders at iteration i are obtained by projecting the aforementioned gradient-based updates to meet the per-BS transmit power constraints, i.e., [w Finally, we observe that the aggregated precoders in (61) correspond to the concatenation of the BS-specific precoders in (57).
Remark 3. Theorem 2 states that, for a fixed set of combiners {v k } k∈K , the Distributed GB (where the BS-specific precoders {w b,g } g∈G are optimized locally at each BS b) solves the sum MSE minimization in (34) in the same way as its centralized implementation (where the aggregated precoders {w g } g∈G are optimized at the CPU).Therefore, each BS directly targets to reduce the sum MSE rather than its individual MSE as in the Distributed BR.Moreover, the convergence to a solution of the sum MSE minimization is guaranteed by a proper choice of α GB .Lastly, the comments in Remark 2 on how to speed up the convergence of the Distributed BR also apply here.

V. DISTRIBUTED PRECODING DESIGN WITH PILOT-AIDED CHANNEL ESTIMATION
In this section, we describe the practical implementation of the proposed distributed multi-group multicast precoding designs with imperfect CSI and without any backhaul signaling for CSI exchange.We recall that the local computation of the precoders at each BS in (46) relies on group-specific cross terms from the other BSs.To avoid the resulting CSI exchange via backhaul signaling, we adopt an OTA signaling scheme similar to that proposed in our previous work on distributed precoding design for cell-free massive MIMO unicasting [20].Therein, we introduced a UE-specific OTA uplink training resource to eliminate the need for backhaul signaling to exchange the UE-specific CSI.In this paper, we propose a new group-specific OTA uplink training resource tailored for the multi-group multicasting scenario, which eliminates the need for backhaul signaling to exchange the group-specific CSI.

New group-specific OTA uplink training resource (UL-3).
To reconstruct the cross terms ξ b,g locally at BS b, each UE k transmits Y DL k in (20) after precoding it with where the power scaling factor √ β UL-3 (equal for all the UEs) ensures that X UL-3 k complies with the per-UE transmit power constraint.We observe that (63) contains the group-specific effective downlink channels between all the BSs and UE k, and we recall that Y DL k is obtained by means of group-specific pilots (see Section II-A).Therefore, this new group-specific OTA uplink training resource generates the same training overhead as the effective downlink channel estimation, which depends on G rather than K as in the unicasting scenario.Then, the signal received at BS b is given by where Building on the new group-specific OTA uplink training resource, the precoders are optimized locally at each BS by means of either best-response updates (based on both UE-and Algorithm 2 (Distributed BR) Data: Pilots {p UL-1 k } k∈K and {p DL g } g∈G .Initialization: Combiners {v k } k∈K .Until a predefined termination criterion is satisfied, do: 3) Each BS b reconstructs {∆w ⋆ b,g } g∈G as in (68) and computes the precoders {w b,g } g∈G as in (47).4) DL: Each BS b transmits X DL b in (19); each UE k receives Y DL k in (20).5) Each UE k computes its combiner v k as in (45).End group-specific pilots or group-specific pilots only) or gradientbased updates (based on group-specific pilots), as discussed in the following sections.Regardless of the computation of the precoders, each UE k computes its combiner as in (45) with imperfect CSI.

A. Best-Response Distributed Precoding Design with UE-
and Group-Specific Pilots The practical implementation of the Distributed BR requires, at each bi-directional training iteration, the UE-specific effective uplink channel estimation and the effective downlink channel estimation (see Section II-A) together with the new group-specific OTA uplink training resource (see Section V).In this setting, Y UL-1 b in (10) and Y UL-3 b in (65) are suitably combined to reconstruct ∆w ⋆ b,g in (49) as shown in (68) at the top of the next page, which is used to compute the BSspecific precoder in (47).Note that (68) becomes equal to (49) with perfect CSI, i.e., when τ UL-1 → ∞ and τ DL → ∞.If pilot contamination is to be avoided entirely, the Distributed BR requires a minimum of K + G orthogonal pilots, i.e., K orthogonal pilots to obtain Y UL-1 b in (10)

B. Best-Response Distributed Precoding Design with Group-Specific Pilots
The practical implementation of the Distributed BR described in Section V-A relies on the UE-specific effective uplink channel estimation, which requires a minimum of K orthogonal pilots in each uplink training instance to avoid pilot contamination.Hence, to reduce the training overhead, we propose a best-response distributed precoding design based solely on group-specific pilots, which is referred to in the  49) as shown in (69) at the top of the page, which is used to compute the BS-specific precoder in (47).To understand the convergence behavior of the Distributed BR-GS, let us assume for a moment that perfect CSI is available at BS b, i.e., τ UL-2 → ∞ and τ DL → ∞.In this case, we have with ∇ w b,g L (34) {w g , λ b } given in (50) and Theorem 3. ∆w ⋆ b,g in (70) is a steepest descent direction for the sum MSE minimization in (34).
Proof: The proof follows similar steps to the proof of Theorem 1 and is thus omitted.Remark 4. Following similar arguments to Remark 1, Theorem 3 states that, for a fixed set of combiners {v k } k∈K , the Distributed BR-GS solves the sum MSE minimization in (34) via a steepest descent method characterized by the quadratic norm x D , with D blkdiag(D 1 , . . ., D B ) ∈ C BM×BM .Due to the extra interference term in (70), the Distributed BR-GS may be characterized by slower convergence than the Distributed BR.Nonetheless, as shown in Section VI, this drawback may be well compensated by the reduced training overhead, especially for small resource blocks.Hence, the Distributed BR-GS may outperform the Distributed BR in terms of effective sum-group rate.Lastly, the comments in Remark 2 on how to speed up the convergence of the Distributed BR also apply here.

C. Gradient-Based Distributed Precoding Design with Group-Specific Pilots
The practical implementation of the Distributed GB requires, at each bi-directional training iteration, the group-specific effective uplink channel estimation and the effective downlink channel estimation (see Section II-A) together with the new group-specific OTA uplink training resource (see Section V).In this setting, Y UL-2 b in (15) and which is used to compute the corresponding gradient-based update in (56).Note that (72) becomes equal to (54) with perfect CSI, i.e., when τ UL-1 → ∞ and τ DL → ∞.Finally, the BS-specific precoders are obtained by projecting the gradientbased updates to meet the per-BS transmit power constraint as in (57).Remarkably, the Distributed GB can be implemented based solely on group-specific pilots.Consequently, if pilot contamination is to be avoided entirely, the Distributed GB requires a minimum of 2G orthogonal pilots in each uplink training instance (as the Distributed BR-GS).Another significant advantage of the Distributed GB is that the computation of the precoders does not involve any matrix inversion, which yields a reduced computational complexity with respect to the Distributed BR and the Distributed BR-GS.The implementation of the Distributed GB is summarized in Algorithm 4.

D. Training Overhead
The practical implementation of the proposed distributed precoding designs requires, at each bi-directional training iteration, the UE-or group-specific effective uplink channel estimation and the effective downlink channel estimation (see Section II-A).In addition, it also relies on the new groupspecific OTA uplink training resource (see Section V), which eliminates the need for backhaul signaling to exchange the b in (19).In principle, the iterative bi-directional training comprising the above signaling can be integrated into the flexible 3GPP 5G NR frame/slot structure, as discussed in [20], [23].Table I shows the minimum number of orthogonal pilots (and thus the minimum number of pilot symbols) necessary for the iterative bi-directional training without pilot contamination in the proposed and reference precoding schemes.
Remark 5.The Distributed GB, if implemented via backhaul signaling for the CSI exchange similarly to [19], would still require the UE-specific effective uplink channel estimation (see Section II-A) and would generate the same backhaul signaling overhead as the Distributed BR described in Section IV-A.In fact, reconstructing the cross terms ξ b,g in (46) at BS b is not possible with group-specific CSI exchange.On the other hand, adopting iterative bi-directional training with the new groupspecific OTA uplink training resource allows to implement the Distributed GB (and the Distributed BR-GS) with reduced training overhead with respect to the Distributed BR.

E. Computational Complexity
Based on the minimum number of pilot symbols specified in Table I, Table II presents the computational complexity for each bi-directional training iteration of the proposed and reference precoding schemes.The computational complexity mainly arises from matrix multiplications and inversions in the computation of the precoders.Notably, the Local MF and the Distributed GB exhibit remarkably low computational complexity compared with the other methods.Additionally, the Distributed BR-GS is less complex than the Distributed BR as the former relies solely on group-specific pilots.Among all the considered methods, the Centralized entails the highest computational complexity.

VI. NUMERICAL RESULTS AND DISCUSSION
In this section, we compare the performance of the proposed distributed multi-group multicast precoding designs presented in Section V, i.e., the Distributed BR (Algorithm 2), the Distributed BR-GS (Algorithm 3), and the Distributed GB (Algorithm 4), with that of the reference precoding schemes de- Table I: Minimum number of pilot symbols necessary for the iterative bi-directional training without pilot contamination in the proposed and reference precoding schemes, where I denotes the total number of bi-directional training iterations (recall that the Centralized requires a single uplink-downlink training iteration).
Table II: Computational complexity for each bi-directional training iteration of the proposed and reference precoding schemes, where δ denotes the number of bi-section steps at each iteration (recall that the Centralized requires a single uplink-downlink training iteration).
scribed in Section III-C and Appendix III, i.e., the Centralized (Algorithm 1), the Local MMSE, and the Local MF.Unless otherwise stated, the simulation setup comprises the following parameters.B = 25 BSs, each equipped with M = 8 antennas, are placed on a square grid with a distance of 100 m between neighboring BSs.K = 32 UEs, each equipped with N = 2 antennas, are uniformly distributed across the square grid.The UEs are divided into G = 8 multicast groups, each consisting of 4 randomly selected UEs. 4 Assuming uncorrelated Rayleigh fading, the entries of H b,k are i.i.d.CN (0, δ b,k ) random variables, where δ b,k −48 − 30 log 10 (d b,k ) [dB] is the large-scale fading coefficient and d b,k is the distance between BS b and UE k. 5 The maximum transmit power for both data and the pilot transmission is ρ BS = 30 dBm at the BSs and ρ UE = 20 dBm at the UEs.The AWGN power at the BSs and at the UEs is fixed to σ 2 BS = σ 2 UE = −95 dBm.As a performance metric, we evaluate the sum-group rate in (3) averaged over 10 3 independent channel realizations and UE drops.In all the algorithms, the combiners at the UEs are initialized with random vectors and the step sizes are appropriately chosen to promote an aggressive reduction of the sum MSE objective during the first few iterations.
We begin by validating Proposition 1 considering a centralized implementation.Figure 2 compares the average sumgroup rate resulting from the sum-group MSE minimization (see Section III-A) and the sum MSE minimization (see Section III-B) for different values of ρ BS .We observe that, as the SNR increases, the gap between the two curves does not increase.Therefore, at high SNR, the sum-group rate obtained with the sum MSE minimization closely approximates the one resulting from the sum-group MSE minimization.
Figure 3 illustrates the average sum-group rate as a function of the number of bi-directional training iterations, where the Centralized with perfect CSI is also included as an upper bound.The proposed distributed precoding designs greatly outperform the local precoding designs.During the first few iterations, the Distributed BR and the Distributed BR-GS are superior to the Distributed GB.Indeed, in the distributed pre- 4 If the multicasting services demand the UEs to be grouped based on similar geographical locations, the interference among the multicast groups could be mitigated more effectively, thus yielding better performance with respect to the considered random UE grouping. 5The simulation results would be very similar with correlated channel models such as the one-ring model [34].coding designs with best-response updates, each BS greedily aims to reduce its individual MSE by exploiting its local interference covariance matrix, yielding a slower convergence to a solution of the sum MSE minimization.On the other hand, the Distributed GB directly targets to reduce the sum MSE and thus outperforms all the other distributed algorithms after few iterations.The proposed distributed precoding designs eventually provide a higher sum-group rate than the Centralized.In fact, the iterative bi-directional training involves multiple uplink-downlink training instances with independent AWGN realizations, whereas only one (antenna-specific) noisy channel estimate is used in the Centralized (see [20]).Therefore, the impact of AWGN on the distributed precoding designs is averaged out over the iterations and, eventually, the Distributed BR and Distributed GB outperform the Centralized.As expected, the Local MMSE is the best among the local precoding designs as it exploits the local interference covariance matrix that is not considered in the Local MF.
In the following, we compare the effective performance of the distributed precoding designs in terms of effective sumgroup rate, defined as where r ce is the number of pilot symbols used in each bidirectional training iteration and r t is the resource block size including the transmission of both pilot symbols and data symbols.The switching time between uplink and downlink training instances is neglected.Figure 4 plots the average effective sum-group rate as a function of the number of bidirectional training iterations with resource block size r t = 1000.All the algorithms achieve the maximum effective sumgroup rate (indicated by the larger dots) within few iterations.After the peak, the performance starts to decrease as the number of data symbols transmitted within the resource block reduces at each bi-directional training iteration.As shown in Table I, the Distributed BR-GS uses fewer pilot symbols than the Distributed BR.As a result, the effective sum-group rate of the Distributed BR-GS is slightly higher than that of the Distributed BR.Note that the sum-group rate (which does not consider the training overhead) of the Distributed BR-GS is inferior to that of the Distributed BR (as shown in Figure 3).methods.Furthermore, its training overhead is smaller than in the Distributed BR due to the use of group-specific pilots.In this example, the maximum effective sum-group rates of the Distributed BR, the Distributed BR-GS, and the Distributed GB are 1.6, 1.65, and 2.1 times higher, respectively, than that of the local precoding designs.
Figure 5 depicts the average effective sum-group rate as a function of the resource block size r t .For r t = 1000, the effective sum-group rates correspond to the maximum values in Figure 4.Note that the optimal number of bi-directional training iterations to obtain the maximum effective sum-group rate increases with r t as a higher training overhead can be tolerated for larger resource blocks.In general, the distributed precoding designs perform well for r t ≥ 500.For example, with r t = 500, the Distributed GB greatly outperforms all the other methods.Furthermore, the Distributed BR-GS performs better than the Distributed BR due to the use of fewer pilot symbols in each bi-directional training iteration and despite the extra interference term in (70).With large resource blocks, the training overhead becomes insignificant and the effective sum-group rate approaches the sum-group rate in Figure 3, which does not account for the training overhead.Figure 7 depicts the average effective sum-group as a function of the number of antennas at each BS with resource block size r t = 1000.Increasing M obviously improves the performance of all the considered methods.What is more, the proposed distributed methods provide significant gains over the local precoding designs even with a relatively high number of antennas at each BS, e.g., M = 32, which motivates the use of the distributed precoding designs even in such scenarios.
Figure 8 illustrates the average effective sum-group as a function of the resource block size at low SNR, where the joint interference suppression across the BSs becomes less important.Nonetheless, the Distributed GB is superior to all the other methods, whereas the Distributed BR-GS, which depends now on the noisy feedback with the extra interference term in (70), suffers from inaccuracies in the local interference covariance matrix, making it inferior to the local precoding designs.
Lastly, Figure 9 compares the proposed distributed multigroup multicast precoding designs with the distributed unicast precoding design developed in [20] in the multi-group multicasting scenario considered so far (i.e., with K = 32 UEs divided into G = 8 multicast groups of 4 randomly selected UEs).The unicast precoding design is intended to suppress the interference among all the UEs and does not consider that the latter are divided into groups.For this method, the same data symbols (distinctly modulated for each UE) are transmitted to all the UEs in a multicast group by means of UE-specific precoders.In this setting, the rate is still limited by the worst UE in the multicast group and, therefore, we use the sumgroup rate in (3) as a metric to evaluate the performance of the unicast precoding design.Moreover, the unicast precoding design requires UE-specific pilots, which are longer than the group-specific pilots used for the multicast precoding designs and thus result in higher CSI accuracy.Hence, to compare the impact of the training overhead between the multicast and unicast precoding designs, we scale the transmit power of the group-specific pilots to achieve the same CSI accuracy as the UE-specific pilots. Figure 9 plots the average effective sumgroup rate as a function of the resource block size.Here, the Distributed BR (unicasting) indicates the distributed precoding design proposed in [20] while Distributed GB (unicasting) corresponds to Algorithm 4 adapted to consider each UE as a multicast group.We observe that all the proposed distributed methods tailored for the multi-group multicasting scenario outperform the unicast precoding designs.In addition, with small resource blocks, the performance of the unicast precoding designs is further penalized due to the higher impact of the training overhead.For instance, with r t = 1000, the Distributed GB (unicasting) delivers around 4.5 bps/Hz per UE, while the Distributed GB provides approximately 8.5 bps/Hz per UE.We point out that even the performance of the multicast precoding designs with non-scaled transmit power of the group-specific pilots in Figure 5 is significantly better than that of the unicast precoding designs in Figure 9.

VII. CONCLUSIONS
We proposed fully distributed multi-group multicast precoding designs for cell-free massive MIMO systems with modest training overhead.The sum-group MSE minimization is initially considered to guarantee absolute MSE fairness within each multicast group.Subsequently, to simplify the computation and signaling, the sum-group MSE is approximated with the sum MSE objective.Considering the UE-specific rates as the performance metric, the aforementioned approximation holds well, especially at high SNR.An iterative bi-directional training is adopted to design the precoders and the combiners locally at each BS and at each UE, respectively.To this end, a new group-specific OTA uplink training resource is introduced to obtain the required group-specific cross terms from other BSs in the distributed precoding design, which eliminates the need for backhaul signaling to exchange the CSI.Furthermore, the distributed precoding designs are implemented by means of either best-response or gradient-based updates exploiting UEand/or group-specific pilots.Consequently, the distributed precoding design with best-response updates results in a steepest descent direction for the sum MSE minimization, which makes it inferior to its centralized implementation.However, the gradient-based update solves the sum MSE minimization as it would be in a centralized design.Numerical results show that the distributed gradient-based precoding design with groupspecific pilots always yields the best effective performance.Moreover, all the proposed distributed methods greatly outperform conventional cell-free massive MIMO precoding designs that rely solely on local CSI.= max 0, µ where i is the iteration index and ζ is the step size.Finally, (76) is normalized to meet the constraint in (74).
2) Sub-gradient update of {λ b } b∈B .To meet the per-BS transmit power constraint, λ b is updated as [30] λ where η is the step size.

APPENDIX II KKT CONDITIONS OF (38)
The Lagrangian of (38) can be written as where κ is the dual variable corresponding to the constraint in (38).Then, the optimal p g is obtained as (82)

APPENDIX III LOCAL PRECODING DESIGNS
To avoid the prohibitive complexity and backhaul signaling of large-scale centralized precoding designs, most works on cell-free massive MIMO assume simple local precoding strategies exploiting the large-antenna regime across the BSs [31].In this setting, the BS-specific precoders are designed based solely on local CSI, ignoring the contribution from the other BSs.Nevertheless, iterative bi-directional training is required to update the precoders at the BSs based on the combiners at the UEs and vice versa.With perfect CSI, at each bi-directional training iteration, the Local MMSE precoder at each BS b is computed as whereas the corresponding Local MF precoder is computed as Note that the dual variable λ b in (83) and (84) can be easily obtained via bisection.In both cases, each UE k computes its combiner as in (30).The local precoding designs may not convergence to a solution of the sum MSE minimization in (34).However, the resulting UE-specific rates improve over the iterations since the combiners are better focused towards the intended signals and increase the accuracy of the effective channel estimation.The practical implementation of the Local MMSE and the Local MF requires, at each bi-directional training iteration, the UE-and group-specific effective uplink channel estimations, respectively, as well as the effective downlink channel estimation (see Section II-A).Accordingly, the Local MMSE precoder at each BS b is computed as with Ω Diag(ω 1 , . . ., ω K ) ∈ R K×K , whereas the corresponding Local MF precoder is computed as In both cases, each UE k computes its combiner as in (45)

UE k BS b initialize v k update w b update v k update w b pre cod ed upl ink pilo ts pre cod ed dow nlin k pilo ts pre cod ed upl ink piloFigure 1 :
Figure 1: Schematic representation of iterative bi-directional training in a single-UE, single-BS setting.To transmit the precoded uplink pilots, UE k uses v k as precoder.
where R S-MSE k and R SG-MSE k indicate the rates of UE k obtained with the sum MSE minimization and with the sum-group MSE minimization, respectively.The asymptotic approximation of the normalized difference between R S-MSE k and R SG-MSE k is given by by replacing {H k } k∈K with { Ĥk } k∈K .End 4) The CPU forwards the resulting BS-specific precoders to the corresponding BSs via backhaul signaling.5) DL: Each BS b transmits X DL b in (19); each UE k receives Y DL k in (20).6) Each UE k computes its combiner v k as in (45).
and G orthogonal pilots to obtain Y UL-3 b in (65), in each uplink training instance.The implementation of the Distributed BR is summarized in Algorithm 2.

Algorithm 3 ( 2 b
Distributed BR-GS) Data: Pilots {p UL-2 g } g∈G and {p DL g } g∈G .Initialization: Combiners {v k } k∈K .Until a predefined termination criterion is satisfied, do: 1) UL-2: Each UE k transmits X UL-2 k in (14); each BS b receives Y UL-as in (15).2) UL-3: Each UE k transmits X UL-3 k in (63); each BS b receives Y UL-3 b as in (65).3) Each BS b reconstructs {∆w ⋆ b,g } g∈G as in (69) and computes the precoders {w b,g } g∈G as in (47).4) DL: Each BS b transmits X DL b in (19); each UE k receives Y DL k in (20).5) Each UE k computes its combiner v k as in (45).End following as the Distributed BR-GS.This method is obtained by replacing the UE-specific effective uplink channel estimation with its group-specific counterpart (see Section II-A).Consequently, if pilot contamination is to be avoided entirely, the Distributed BR-GS requires a minimum of 2G < K + G orthogonal pilots, i.e., G orthogonal pilots to obtain Y UL-2 b in (15) and G orthogonal pilots to obtain Y UL-3 b in (65), in each uplink training instance.In this setting, assuming ω k = ω, ∀k ∈ K, Y UL-2 b in (15) and Y UL-3 b in (65) are suitably combined to reconstruct ∆w ⋆ b,g in ( 71) We observe that (70) includes an extra interference term with respect to (49), which arises from reconstructing the local interference covariance matrix based solely on group-specific CSI (i.e., Y UL-2 b in (15)) rather than UE-specific CSI (i.e., Y UL-1 b in (10)) as in (68) for the Distributed BR.The implementation of the Distributed BR-GS is summarized in Algorithm 3.

Figure 2 :
Figure 2: Average sum-group rate resulting from the sum-group MSE minimization and the sum MSE minimization versus number of alternating optimization iterations for different values of ρBS.

Figure 3 :
Figure 3: Average sum-group rate versus number of bi-directional training iterations.

Figure 4 :
Figure 4: Average effective sum-group rate versus number of bidirectional training iterations, with rt = 1000.

Figure 5 :
Figure 5: Average effective sum-group rate versus resource block size.

|Kg|Figure 6 :
Figure 6: Average effective sum-group rate versus number of UEs in each multicast group, with rt = 1000.

Figure 7 :rtFigure 8 :
Figure 7: Average effective sum-group rate versus number of antennas at each BS, with rt = 1000.

Figure 6
Figure 6 plots the average effective sum-group rate as a function of the number of UEs in each multicast group |K g | with resource block size r t = 1000.In general, the sumgroup rate decreases when |K g | grows as more spatial degrees of freedom are used to suppress the interference among the multicast groups.However, at the same time, the sum rate across all the UEs g∈G |K g |R g is increased.The training overhead associated with X UL-1 k in the Distributed BR and the Local MMSE depends on K = g∈G |K g |, while the training overhead associated with X UL-2 k in the Distributed BR-GS, the Distributed GB, and the Local MF is dictated by G. Consequently, the Distributed BR and the Local MMSE are more severely penalized by an increase in K.For example, considering the case of |K g | = 16, the performance of the Distributed BR and the Local MMSE is inferior even to that of the Local MF.Figure7depicts the average effective sum-group as a function of the number of antennas at each BS with resource block size r t = 1000.Increasing M obviously improves the performance of all the considered methods.What is more, the proposed distributed methods provide significant gains over the local precoding designs even with a relatively high number

Figure 9 :
Figure9: Average effective sum-group rate versus resource block size in comparison with the unicast precoding design[20] with the same CSI accuracy.

( 31 )
{w g , t g , µ k , λ b } {p g , κ} = 0 =⇒ p g = is computed to satisfy g∈G p g = ρ BS , which yields If pilot contamination is to be avoided entirely, the Local MMSE requires a minimum of K ≥ G orthogonal pilots to obtain Y UL-1 b in (10) in each uplink training instance, whereas the Local MF requires a minimum of G orthogonal pilots to obtain Y UL-2 b in (15) in each uplink training instance.For a fixed number of bi-directional iterations, the Local MMSE outperforms the Local MF by exploiting the local interference covariance matrix, although it has a higher training overhead.

•
We formulate the multi-group multicast precoding design problem as a sum-group MSE minimization, which is approximated with a weighted sum MSE minimization to avoid the resulting backhaul signaling overhead.•We show that the UE-specific rates obtained with the weighted sum MSE minimization asymptotically approximate the ones resulting from the sum-group MSE minimization.