Communication-Efficient Federated Learning for Resource-Constrained Edge Devices

Federated learning (FL) is an emerging paradigm to train a global deep neural network (DNN) model by collaborative clients that store their private data locally through the coordination of a central server. A major challenge is a high communication overhead during the training stage, especially when the clients are edge devices that are linked wirelessly to the central server. In this paper, we propose efficient techniques to reduce the communication overhead of FL from three perspectives. First, to reduce the amount of data being exchanged between clients and the central server, we propose employing low-rank tensor models to represent neural networks to substantially reduce the model parameter size, leading to significant reductions in both computational complexity and communication overhead. Then, we consider two edge scenarios and propose the corresponding FL schemes over wireless channels. The first scenario is that the edge devices barely have sufficient computing and communication capabilities, and we propose a lattice-coded over-the-air computation scheme for the clients to transmit their local model parameters to the server. Compared with the traditional repetition transmission, this scheme significantly reduces the distortion. The second scenario is that the edge devices have very limited computing and communication power, and we propose natural gradient-based FL, that involves forward pass only, and each client transmits only one scalar to the server at each training iteration. Numerical results on the MNIST data set and the CIFAR-10 data set are provided to demonstrate the effectiveness of the proposed communication-efficient FL techniques, in that they significantly reduce the communication overhead while maintaining high learning performance.


I. INTRODUCTION
W ITH the increasing attention to user privacy, data protection becomes a fundamental requirement in many machine learning applications.A promising solution is federated learning (FL) [1], [2] where a number of clients perform individual local training using locally stored data, and periodically fuse the local models into a global model through the coordination of a central server.FL methods play a critical role in privacy-sensitive applications [3], [4], where training data are typically locally stored at the wireless network edge [5], [6], [7].
A major challenge of FL is the high communication overhead during the training process, especially when the communication links are wireless.In particular, the periodic model fusion involves two data transmissions between the clients and the central server: the uplink transmission where all clients send their local models or gradients to the server, and the downlink transmission where the server broadcasts the updated global model to all clients.Both transmissions typically involve a large neural network model.The uplink transmission is usually considered as the bottleneck [8] since it is a many-to-one transmission (i.e., multiple-access), while the downlink transmission is one-to-many, i.e., broadcast [7], [9].
Most existing FL works focused on reducing the data exchanges between clients and the server by model or gradient sparsification without considering the actual communication channel.For the uplink transmission, the clients can transmit either their local models [9] or the gradients to the server.The former commonly occurs every a number of local training epochs, while the latter needs to be performed every epoch.In [10], each client transmits only part of its updated local model according to a random sparsity pattern.In [11] and [12], each client transmits only the k largest values of its gradient.In [13], such top-k gradient sparsification is extended to downlink transmission.Note that these sparsification methods are ad hoc and reduce the amount of data exchanges at the expense of degraded learning performance.
When the communication channels are explicitly taken into account in FL, both digital and analog transmission schemes have been considered.In particular, for digital transmission, the model parameters are first quantized before transmission.In [14], the feasibility of 1-bit quantization is investigated, where only the signs of the gradients are transmitted.In [15] and [16], a dithered scalar quantization method is adopted.The works in [6] and [17] extended scalar quantization to vector quantization by lattice quantization and Grassmannian quantization, respectively.However, they all assume that the communication channel is perfect, and therefore the performance loss is due to the quantization distortion only, and the effect of the noisy nature of the channel, especially the wireless channel, is ignored.
Considering non-ideal wireless communication channels, in [18] and [19], it is shown that by taking advantage the overthe-air computation property of a wireless multiple-access channel (MAC), the analog transmission outperforms the digital transmission.Specifically, clients transmit local models simultaneously to the central server over the same wireless channel, and the server receives the sum of the local models directly.Thus, it is more efficient than the conventional scheme where each client separately transmits its local model to the server, and the server performs model averaging after receiving all local models.The works in [19] and [20] extended Gaussian MACs to Gaussian fading MACs, and the work in [21] focused on receiver beamforming design.However, in these works, only uncoded transmission is used, which is not effective to combat the channel noise.Overall, lattice-coded over-the-air computation has not been well exploited in federated learning.
In this paper, we propose communication-efficient federated learning by making use of three key techniques: low-rank tensor models for neural networks, lattice-coded over-theair computing, and natural gradient learning.We next briefly review related works in these three areas.
2) Lattice code is a vector quantization scheme, and has recently been adopted for digital transmission in FL [6].On the other hand, lattice-coded analog transmission for over-the-air computation has been analysed in [31] for Gaussian MAC.Information theoretic analysis in [32] showed that lattice-coded scheme can significantly outperform the conventional uncoded and repetition scheme for over-the-air computation.However, this technique has not been used for FL.
3) The natural gradient descent method [33] essentially replaces the true gradient in the back propagation process of DNN training with random perturbation and therefore consists of forward pass only.The work in [34] employed an unbiased estimate of the natural gradient.In [25], the natural gradient is used for parallel GPU training with high performance.In [35], the effectiveness of a parallel natural gradient method is shown in training a very deep neural network.However, the natural gradient descent has not been studied in FL with wireless channels.
In this paper, to reduce the communication overhead in FL, we first propose to replace the linear layers in conventional neural networks by low-rank tensor layers to achieve model compression.Compared with the existing sparsification approaches, in our low-rank tensor representation approach, the computations in both the forward and backward passes are on the tensor parameters, instead of the original model parameters as in sparsification, which leads to both reduced computational complexity and communication overhead.Moreover, the tensor approach has a sound theoretical basis, unlike the ad hoc method of randomly discarding model parameters as in sparsification.Then we propose to employ a lattice-coded scheme for each client to transmit its local model to the central server, which can substantially reduce the distortion, leading to a smaller number of channel usages and therefore smaller communication overhead.Finally, for the FL scenario where the clients are low-complexity devices with very limited computing and communication capabilities, we propose natural gradient-based FL, where each client performs only forward pass and the transmission of a single-scalar at each iteration.
The remainder of this paper is organized as follows.Section II introduces some background and outlines our proposed schemes for reducing the communication overhead in FL.In Section III, we describe neural network compression using low-rank tensor decompositions.In Section IV, we present the lattice-coded transmission scheme from the clients to the central server.In Section V, we describe the natural gradient-based FL.Section VI presents the experimental results and we conclude this paper in Section VII.

II. BACKGROUND AND PROBLEM STATEMENT
In this section, we briefly describe deep neural networks and wireless communication models in a federated learning scenario.Then, we give an overview of our proposed approach.

A. DEEP NEURAL NETWORKS
A linear layer is the building block of neural network architectures [36], such as FC networks, CNNs, and RNNs.For an input vector x ∈ R N 0 , it applies a transform using a weight matrix A ∈ R N 0 ×N 1 , resulting in a feature vector y ∈ R N 1 , which can be expressed as where b ∈ R N 1 is an offset.

1) FULLY CONNECTED (FC) NETWORK
For an L-layer FC network [36] with input , the forward pass can be expressed as follows where y ℓ ∈ R N ℓ is the ℓ-th layer's feature vector, and σ (•) is an element-wise activation function, e.g., ReLU, sigmoid, and softmax [36].The last layer produces an output vector y ∈ R N L that denotes an estimated label, where f (•) is an output function, e.g., softmax and maxout.
For an L-layer FC network, we denote the parameters as Consider a supervised learning task, an update step takes a mini-batch of B samples from a training data set , where x b ∈ R N 0 is a data sample and y b ∈ R N L is the corresponding label such that if x b belongs to class i then y b (i) = 1 and y b (i ′ ) = 0 for i ′ ̸ = i.The cross-entropy loss of the b-th sample-label pair (x b , y b ) can be computed through a forward pass as follows (5) where 1(•) is an indicator function.Other types of loss function L • can also be employed, such as mean squared error (MSE), mean absolute error (MAE), etc.The model parameters are updated by the gradient descent method as where g is an estimate of the gradient, and ξ is a learning rate parameter.

2) CONVOLUTIONAL NEURAL NETWORK (CNN)
For an L-layer CNN [36], an input image is a third-order data tensor X 0 ∈ R W 0 ×H 0 ×C 0 , where W 0 × H 0 is the spatial size and C 0 is the number of channels.The input to the ℓ-th layer where is an activated output, and σ (•) is an element-wise activation function, e.g., ReLU and tanh.P ℓ is then processed by a pooling operation channel by channel independently.Within each channel, the matrix with The pooling operator then maps each subregion to a scalar resulting in the output of the ℓ-th layer X ℓ = pooling(P ℓ ) ∈ R W ℓ ×H ℓ ×C ℓ , which is the input to the (ℓ + 1)-th layer.
At last, the L-th layer outputs a feature vector y that denotes the estimated label y = f (vec(X L )), (10) where vec(•) maps a tensor to a vector, and f (•) is an output function.For an L-layer convolutional neural network, the parameters are denoted as The loss function and back propagation model update are similar to (5)- (7) of the FC networks.

B. FEDERATED LEARNING
Federated learning [2] enables multiple clients to learn a global DNN model with the training data stored locally.During the training process of the conventional FL, each client performs a gradient descent update using its local training data; and periodically, the latest trained models of all clients are fused through a central server.In particular, one FL training iteration consists of the following three steps: 1) The server broadcasts the current global model W to all K clients, where W represents model parameters.2) Using its own local training data {(x k i , y k i )} n k i=1 , each client k updates its local model that is initialized as W k = W , by running multiple gradient descent steps: to obtain the updated local model Then, it sends W k back to the server.
3) The server updates the global model as linear combinations of where the weights {ρ k } K k=1 are non-negative and where is the training loss based on the local data set from client k.The weight ρ k specifies the relative impact of client k, with one natural setting , where n k is the number of data samples at client k and n = K k=1 n k is the total number of data samples [2].
In each FL iteration described above, there are two communication rounds between the clients and the server: the uplink round at the end of step 2 where each client sends its local model W k to the server, and the downlink round in step 1 where the server broadcasts the global model W to all clients.The uplink is the bottleneck since it is a multiple-access channel (MAC).Next, we describe traditional methods for the uplink transmission in FL.

C. UPLINK TRANSMISSION SCHEMES
Let the size of local model W k be S, and we reshape the parameters {W k } K k=1 to vectors {w k ∈ R S } K k=1 for each transmission.To combat the channel noise at each FL iteration, } K k=1 need to be transmitted M times.There are two wireless analog transmission schemes in the literature [37], one is for each client to use a point-to-point channel to transmit its data to the server, and the other is for all clients to transmit their data simultaneously through a multiple-access channel to the server.

1) TRANSMISSION THROUGH ORTHOGONAL CHANNELS
In this case, each client k transmits its data x k = cρ k w k to the server through a separate additive white Gaussian noise (AWGN) channel, where c is a scaling parameter to meet the average codeword power constraint of all clients: where E[•] denotes the expectation operator, and ∥•∥ denotes the Euclidean norm.In the following, we focus on the transmission of one element x k of x k .The channel between the k-th client and the central server is modeled as where h k is the channel gain, x k is the transmitted signal from client k, z k is the Gaussian channel noise, and y k is the received signal by the server.Each client k transmits the signal x k h k through this channel M times, and therefore the central server receives where z k (m) ∼ N (0, σ 2 ) are i.i.d.Gaussian noise samples.
Recall that the server needs . We consider the simple estimator of x k that is based on the sample average, i.e., This estimator is unbiased and the mean squared error (MSE) is given by An estimate of w is then given by which is also unbiased with the MSE 2

) TRANSMISSION OVER MULTIPLE-ACCESS CHANNEL
In this case, each client k transmits its data x k = cρ k w k to the server through a Gaussian MAC, where c is a scaling parameter to meet the same average codeword power constraint in (15).The Gaussian MAC channel is modeled as y = K k=1 h k x k + z, where h k and x k are the channel and the transmitted signal from client k, respectively, z is the Gaussian channel noise, and y is the received signal by the server.Each client k transmits its signal x k h k through this MAC M times, and the central server receives where z(m) ∼ N (0, σ 2 ) are i.i.d.Gaussian noise samples.Since the server needs x k , we consider the simple sample average estimator of w, i.e., w = 1 Mc M m=1 y(m).This estimator is unbiased and the MSE is given by Compared to (19), the MAC reduces the MSE by a factor of K .This scheme is also called over-the-air computation since the MAC channel automatically performs the summation computation.In this paper, Sec.IV, we will employ the lattice code to substantially reduce the MSE in (21) of the simple repetition transmission.

D. OVERVIEW OF PROPOSED SOLUTIONS
Our goal is to reduce the communication overhead in the uplink wireless transmission for federated learning, while keep the training loss in (14) as low as possible.Our proposed solution consists of the following three main ingredients to mitigate the communication overhead.1, the upper process is our proposed method for edge devices with sufficient or limited computing and communication power.The lower process is the conventional method, regardless of edge constraints.

III. NETWORK COMPRESSION USING TENSOR MODELS
Since the communication overhead is proportional to the network parameter size S, in this section, we propose to replace the linear layers in FC networks and CNNs by low-rank tensor layers to achieve model compression.

A. TENSOR OPERATIONS
Mapping between a vector/matrix and a tensor: We specify a mapping rule between a vector/matrix and a tensor-based on [27].To map a vector x ∈ R N into a J -th order tensor This is denoted as x = vec(X ), and X = vec −1 (x).
On the other hand, a mode-(J , D) mapping between a matrix X ∈ R N ×M and a D-th order tensor Tensor contraction operation: A tensor contraction operation [38], [39] combines two tensors along a certain dimension.In particular, given two third-order tensors A ∈ R N 1 ×N 2 ×N 3 and B ∈ R N 3 ×N 4 ×N 5 , the tensor contraction results in a fourth-order tensor

B. TENSOR LAYERS FOR DEEP NEURAL NETWORKS
We explain how to compress FC layers and convolutional layers using different tensor decomposition models.Specifically, we adopt the tensor train (TT) decomposition for FC layers [27], and the canonical polyadic (CP) decomposition for convolutional layers [28] in order to reduce the communication overhead in FL, as they can respectively achieve the highest compression ratios from previous experimental results.

1) TT DECOMPOSITION FOR FC LAYER
We convert a fully connected layer in (2) into a TT layer, using the following four steps: i. Map ii. Decompose the weight tensor A ℓ into D core tensors, namely, a conventional fully connected layer in ( 2) is decomposed into D sub-layers.iii.Forward pass: Propagate X ℓ−1 through these D sublayers and obtain the output tensor that corresponds to y ℓ ∈ R N ℓ .iv.Back propagation: Compute the gradient descent using the automatic differentiation module.
To illustrate the above procedure, we set J = 2 and D = 4.Using a mode- (2,4) mapping in (23), matrix A ℓ is mapped into a fourth-order tensor ×N (2)  ×N (3)   ×N (4) with N ℓ = N (1) N (2) and N ℓ−1 = N (3) N (4) .Vectors y ℓ , b ℓ , x ℓ−1 are mapped into second-order tensors ×N (2) , and 4) , respectively.Next, the fourth-order weight tensor A ℓ is decomposed into 4 third-order tensors under the TT decomposition: where and R ≪ N .The layer size is reduced from N 4 which is the size of matrix A ℓ to 4R 2 N which is the total size of the four core tensors {Z ℓ d } 4 d=1 .Then, the forward pass in (2) becomes the following four steps of tensor contractions , , where

2) CP DECOMPOSITION FOR CONVOLUTIONAL LAYER
We convert a convolutional layer in (8) into a CP layer to reduce the kernel size, using the following three steps: i. Decompose the weight tensor into four factor matrices as in (27), namely, a convolutional layer in ( 8) is decomposed into four sub-layers.ii.Forward pass: Propagate X ℓ−1 through these four sub-layers and obtain the output tensor iii.Back propagation: Compute gradient descent by the automatic differentiation module.In CP decomposition, a fourth-order kernel tensor 8) is decomposed into four matrices as follows where the minimal possible R is called the CP-rank, The kernal size is reduced from N 2 K 2 which is the size of tensor A ℓ to 2R(N + K ) which is the total size of the four matrices.
The forward pass in (8) becomes the following four convolutions where {X k } 4 k=1 are intermediate tensors.Note that the time complexity of the forward pass is reduced from O(W 2 N 2 K 2 ) in the conventional convolutional layer to O(W 2 R(N + K )) in the CP decomposition-based convolutional layer.
One can train the tensor layers using the stochastic gradient descent (SGD) method.The backward propagation process in (6) employs a chain rule [36] over the tensor factors in ( 26) and ( 28), respectively, which can be computed automatically by PyTorch's automatic differentiation module Autograd [40].We summarized the theoretical complexities of various networks in Table 1.
Remark: We showed how to compress FC layers and convolutional layers using tensor decomposition models in FL.As for other neural networks, tensor methods may be still applicable, e.g., RNN compression in [26].

IV. OVER-THE-AIR COMPUTATION VIA LATTICE CODE
In Sec.II-C.2, it is seen that when the clients transmit their locally updated weights through a multiple-access channel, the central server receives the combined weights, realizing over-the-air computation.In this section, we propose a lattice-coded transmission scheme over the same multipleaccess channel, that can significantly reduce the MSE of the combined weights received by the central server.We first provide some basic background on lattice code.Then we present our proposed lattice-coded uplink transmission scheme for federated learning.Finally, we describe the implementation details of some lattice operations.

A. BACKGROUND ON LATTICE CODE
Lattice code [41] is a vector quantization scheme that exhibits a number of asymptotic optimalities.We next list several key concepts that will be used in our proposed lattice-coded transmission scheme [42], [43], [44], [45].
An s-dimensional lattice, , is a set of points in R s such that if s 1 , s 2 ∈ , then s 1 + s 2 ∈ , and if s ∈ , then −s ∈ .Moreover, can be directly defined in terms of a nonsingular generator matrix G ∈ R s×s : A lattice quantizer, Q : R s → , maps x ∈ R s to the nearest lattice point in Euclidean distance: We denote quantization residual as The fundamental Voronoi region, V 0 , of a lattice, is the set of all points that are closest to the zero lattice point: The second moment per dimension of a lattice is where d is uniformly distributed in V 0 .

B. LATTICE-CODED UPLINK TRANSMISSION SCHEME
Recall from Sec. II-C.that at each FL iteration, each client k transmits M copies of x k = cρ k w k to the central server through a multiple-access channel.By using a sample average estimator of the combined weight w = K k=1 ρ k w k , the following MSE can be achieved In this section, we will consider a lattice-based transmission scheme over the same multiple-access channel that can significantly reduce the MSE.The first transmission is the same as that described in Sec.II-C with M = 1.Then in each subsequent transmission, each client transmits a lattice quantization residual signal.
First, each client k divides its local model w k ∈ R S into subvectors of size s.Denote such a subvector as w k ∈ R s , and the corresponding transmitted signal as x k = cρ k w k ∈ R s .We next describe the procedure for transmitting x k , k = 1, • • • , K , over the multiple-access channel M times using an s-dimensional lattice code , which is scaled such that its second moment per dimension G( ) = KP, where P is given in (15).The transmission consists of two phases.The first phase is an uncoded transmission that is the same as the transmission scheme over multiple-access channels described in Sec.II with 1), (34) where z(1) ∈ R s contains i.i.d.Gaussian noise N (0, σ 2 ) samples.The server then obtains an estimate of w as w(1) =

VOLUME 1, 2023
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.(36) where

Denote the estimate of w by the central server after the (m−1)th transmission as
In the m-th transmission, client k generates a dither d k (m) ∈ R s uniformly distributed over the fundamental Voronoi region V 0 , and makes it available to the server through sharing a random seed k (m) [6], h k , where Here, we scale down the lattice residue by 1 to meet the average codeword power constraint in (15).The server receives where z(m) ∈ R s consists of i.i.d.Gaussian noise N (0, σ 2 ) samples.The server then computes the latest estimate w(m) of w according to (41) Assume P > K −1 K σ 2 , and the constants are given as follows If s is large enough, the MSE of the estimate w(m) is See Appendix for the derivation of (43).
Hence after the M -th transmission, the MSE of the final estimate w(M ) of w is Compared with (33), we conclude that for the uncoded case, the MSE is inversely proportional to the number of transmissions M ; whereas for the proposed lattice-coded scheme, the MSE decreases exponentially with M .Hence to achieve a given MSE, the lattice-coded approach requires a smaller number of channel usages M than the uncoded repetition transmission scheme discussed in Sec.II-C.2.Note that for each lattice transmission step m, each client k and the central server share a common random dither signal d k (m), which can be realized through a shared random seed k (m).

C. IMPLEMENTATION DETAILS OF LATTICE OPERATIONS
In this subsection, we explain the computational procedures for lattice scaling, modulo operation, and uniform dither generation.

1) LATTICE SCALING
Given an s-dimensional lattice 0 with the second moment per dimension G( 0 ) = P 0 , and ∀v ∈ R s , denote the corresponding lattice quantizer as Q 0 (v).Let be the scaled version of 0 with the second moment per dimension G( ) = KP and denote the corresponding lattice quantizer as Q(v).Then these two quantization operations are related as Q(v) = λQ 0 (v/λ), where the scaling factor λ is determined by λ 2 • P 0 = KP.Thus, v mod = v − λQ 0 (v/λ).

2) MODULO OPERATION
In this paper, we choose s = 8, and use the E 8 lattice [46] as 0 .E 8 lattice is an 8-dimensional lattice 0 = {s ∈ Z 8 ∪ (Z + 1 2 ) 8 : 8 i=1 s(i) is even} with the second moment per dimension P 0 = 929 12960 ≈ 0.0717.Hence for a given vector v ∈ R 8 , v mod = v − λQ 0 (v/λ), where λ = 3.735 • √ KP.Denoting x = v/λ, we outline the procedure for computing the quantization Q 0 (x) using the E 8 lattice based on the fast algorithm in [47].We first define element-wise functions f : R 8  → Z 8 and g : R 8 → Z 8 such that f (x) = [x] rounds each element in x to the nearest integer where [•] is the rounding operator; and g(x) is the same as f (x) except at the element x i with the largest absolute rounding residual for which g(x i ) is rounded the other way, i.e., g where ⌈•⌉ and ⌊•⌋ are the ceiling and flooring operators, respectively.In case of a tie, we choose x i with the smallest index i.To encode x, we first compute f (x) and g(x), select the one whose sum is even, and call it y 0 .Then, compute f (x− 1  2 ) and g(x− 1 2 ), also select the one with even sum, add 1  2 to it, and call it y 1 .Finally, compare y 0 and y 1 , and choose the one that is closer to x as the result, i.e., Q 0 (x) = argmin y∈{y 0 ,y 1 } ∥y − x∥ 2 .

3) GENERATION OF UNIFORM DITHERS
We use an acceptance-rejection method [48] to generate a dither uniformly distributed in the fundamental Voronoi region of a lattice.
For the E 8 lattice, we list all 240 lattice points {s i } 240 i=1 that are closest to the origin [49] with ∥s i ∥ = √ 2. The perpendicular bisector between each s i and the origin forms a face of the fundamental Voronoi region.Then we generate a vector d ∈ R 8 of i.i.d.samples uniformly distributed in [−1 , 1].By computing the projections of d on all lattice points {s i } 240 i=1 , we accept d as a dither if If d does not satisfy (45), then it is discarded and we repeat the same procedure until a dither that satisfies (45) is found.

V. NATURAL GRADIENT BASED FEDERATED LEARNING
Conventional federated learning imposes certain computing and communication capabilities on the clients, since the back propagation training is computationally intensive, and each transmission involves sending all model parameters from the client to the central server.In this section, we consider the scenario where each client has only very limited computing and communication power, and propose the corresponding federated learning strategy based on natural gradient learning [34].Specifically, the local update at each client involves only the forward pass and no back propagation.Moreover, for each uplink transmission, each client transmits only a single scalar.
As before, denote the training data of client k as . Specifically, to update the global model parameters W , the gradient in ( 6)-( 7) is replaced by the following natural gradient where L • is the loss function, k is a perturbation matrix that has the same size as W with S i.i.d.Gaussian N (0, ν 2 ) samples, ρ k = n k n with n = K k=1 n k being the total number of data samples.It is seen from ( 46)-( 47) that the natural gradient g is a linear combination of the perturbations to the global model W , with the weights being the losses of the perturbed networks on the local training data.At each iteration during training, the scalar loss L k in ( 46) is computed at client k, and sent to the central server; The central server receives the loss values {L k } K k=1 from all clients and then computes the natural gradient g in (47) and update the global model as where ξ is the learning rate parameter.The updated global model W is then broadcast to all clients.Note that the same random perturbation matrix k needs to be generated at both client k and the central server, which can be implemented by using a shared common random seed.According to the analysis in [50] and [51], without channel noise, the convergence rate of the exact natural gradient descent is O(1/t), where t is the number of global updates.
The simple natural gradient estimate ( 46)-( 47) has a high variance.Here we adopt an antithetic sampling (AS) method [52] to reduce the variance.AS perturbs the network twice in opposite directions using the same k , and computes the loss as where k = 1, • • • , K .Other estimators can be found in [34] and [53].
For natural gradient-based federated learning, the central server shares a common seed with each client k, which is used to generate a random seed k at each iteration.Each training iteration consists of the following steps: 1) The server broadcasts the current global model W to all K clients, where W represents model parameters.2) Client k generates a random seed k using the common seed shared with the server, and then generates the perturbation matrix k that contains i.i.d.Gaussian N (0, ν 2 ) samples using seed k .Then client k performs the forward pass using parameters W + k and W − k on its local training data set, and obtains the loss L k in (49).3) Each client k transmits the signal to the server M times through orthogonal channels as described in Sec.II-C., where h k is the channel gain, and c is a scaling parameter to meet the codeword power constraint as follows 4) The server receives where z k (m) ∼ N (0, σ 2 ) are i.i.d.Gaussian noise samples.The server then uses sample average to get the estimate The server generates seed k from the common seed shared with client k, and then generates k using seed k ,  (48).As shown in Fig. 3, we describe the virtualized illustration of the natural gradient method.Each client k calculates the weighted loss ρ k L k (blue lines) in its direction k (black lines), and transmits the loss to the server.The server gets the weighted average as the natural gradient g (the red line).

VI. PERFORMANCE EVALUATIONS
We first evaluate the compression-performance tradeoff of the tensor models for federated learning with ideal (i.e., noiseless) communications.Then, under the noisy communication channels, we evaluate the lattice-coded over-the-air computation method and the natural gradient-based FL method.

A. EXPERIMENTAL SETUP
Data sets: We use the following data sets: • MNIST data set [54] contains gray-scale images of handwritten digits, where each image has 28×28 pixels.There are 60, 000 training images and 10, 000 testing images.Both are evenly split into 10 classes.
• CIFAR-10 data set [55] contains 60, 000 color images in 10 classes, where each image has size 32×32×3.There are 50, 000 training images and 10, 000 testing images.Both are evenly split among the 10 classes.The training data sets are split evenly and stored locally among clients.Our experiments are executed on two NVIDIA Quadro RTX 5000 GPUs, and each has 16 GB of memory.We use PyTorch to implement neural networks.
Performance metrics: We consider the following performance metrics: 1) Compression ratio: the ratio between the size of the original network and that of a low-rank tensor-based network; 2) Convergence: the loss value versus the communication round during the training process; 3) Test accuracy: the percentage of correctly estimated labels for samples in the testing data set.

Remark:
We adopt the FedAvg framework [9] as described in Sec.II-B.Although our experimental study focuses on the i.i.d.data, a recent work [56] explains that the FedAvg framework is actually effective with data heterogeneity.

B. FL WITH LOW-RANK TENSOR MODELS AND IDEAL COMMUNICATION
In this subsection, we assume that the communication channels are noiseless and ideal, and evaluate the three FL performance metrics under different rank values of tensor models described in Sec.III, corresponding to different compression ratios.Note that a higher compression ratio leads to a smaller network parameter size S, and therefore smaller communication overhead.
Federated learning settings: We adopt the FedAvg framework [9] as described in Sec.II-B.We take the conventional FedAvg as the baseline, e.g. the red lines in Fig. 4 and Fig. 5.There are 10 clients with equal weight.Each client uploads its local model to the central server every 10 epochs of local updates.During the training process, we use 120 communication rounds for fully connected networks and 350 communication rounds for convolutional neural networks.

1) FC NETWORKS
We compare the performances of a conventional FC network and the TT decomposition-based FC networks on the MNIST data set [54].
FC network structure: There are three hidden layers and one output layer, where each hidden layer has width N ℓ = 1024, ℓ = 1, 2, 3, and the output layer has size 10, i.e., N 4 = 10.The activation function is ReLU and the loss function is cross-entropy.The batch size is 128 and the learning rate is 0.01.We used the Adadelta optimizer [57].The dimensions of the FC model parameters in (2), and those of the corresponding TT model in (25) are shown in Table 2.There are in total 2, 913, 290 parameters for the conventional FC network.
For the TT decomposition-based FC network described in (26), we set the TT-rank as The output layer has the same size as that of the original FC network.There are in total 795, 402 parameters when R = 64.Similarly, there are 211, 850 parameters if the TTrank R = 32, and 64, 458 parameters if the TT-rank R = 16.
CNN network structure (VGG-like [58]): There are six convolutional layers with kernel size 128, C 5 = 128, C 6 = 256, followed by one hidden FC layer with layer width N 7 = 256.The activation function is ReLU and the loss function is cross-entropy.The batch size is 128 and the learning rate is 0.005.We used BatchNormalization [59] in each layer, the RMSprop optimizer [60], and a    4. Therefore, the conventional CNN network has in total 837, 898 parameters.
We decompose the six convolutional layers using the CP tensor decomposition and the FC layer using the TT tensor decomposition as described in (28).In the first case (CP/TT-1), CP ranks are 8, 16, 16, 32, 32, 64 for the six convolutional layers, respectively, and the TT-rank is 16 for the FC layer.The output layer has the same size as that of the original CNN.Therefore, there are 60, 338 parameters in total as shown in Table 4. Similarly, the second case (CP/TT-2) with CP ranks 6, 12, 12, 25, 25, 51 and TT-rank 16 has in total 51, 022 parameters, while the third case (CP/TT-3) with CP ranks 3, 6, 6, 12, 12, 25 and TT-rank 16 has in total 33, 363 parameters.
We summarize the compression ratios and test accuracies in Table 5.For CP/TT-1, the test accuracy only drops 1.53%, while the compression ratio is nearly 14 times.The test accuracy drops as the tensor rank decreases, but in return, we can achieve much higher compression ratios and therefore much lower communication overhead.As shown in Fig. 5, the training loss increases as the tensor rank decreases.Although CP/TT-3 achieves the 25-time compression ratio, it is beyond the compression ability within an acceptable accuracy.All neural networks converge after 200 communication rounds.

C. FL WITH LOW-RANK TENSOR MODELS AND LATTICE-CODED TRANSMISSION
For uplink transmission, we compare our lattice-coded overthe-air computation method with the traditional repetition scheme on the MNIST data set.The FC network and the FL system follow the same settings as those in Sec.VI-B.The TT decomposition-based FC network with TT-rank R = 32 is used in our simulations, with the total parameter size S = 211, 850 and 13.75 compression ratio.The lattice dimension is s = 8 and the lattice type is E 8 .The signal-tonoise ratio is defined as SNR = P σ 2 .In Table 6, we summarize test accuracy performances with different repeat times M under several SNRs.In Fig. 6, we show the training performances with different repeat times M under several SNRs.Both the test accuracy and the training loss of the lattice-coded scheme are always better than that of the repetition scheme with certain repeat times for any given M , and the improvement increases substantially as M increases.This is due to the fact that for the repetition scheme, the distortion is inversely proportional to M , whereas for the lattice-coded scheme, the distortion drops exponentially with M .Moreover, it is seen that the lattice-coded scheme is especially effective in very noisy channels, i.e., low SNR, in the sense that it requires a moderate number of transmissions to reach satisfactory learning performance, whereas the repetition scheme needs a huge number of transmissions.Therefore, the proposed lattice-coded transmission will play a key role for efficient FL in a noisy wireless environment.

D. FL WITH LOW-RANK TENSOR MODELS AND NATURAL GRADIENT
We evaluate the performance of the natural gradient approach.For the MNIST data set, we use TT decomposition-based FC networks with TT-rank R = 32 following the same settings as those in Sec.VI-B.The model has parameter size S = 211, 850 and 13.75 compression ratio.For the CIFAR-10 data set, we use CP/TT-1 in Sec.VI-B.2.The learning rate is 0.02.The model has parameter size S = 60, 338 and 13.894 compression ratio.For the perturbation in (46), we set the standard deviation ν = 0.01.
Federated learning settings: There are 10 clients with equal weight.Each client k uploads its loss value L k to the central server using the orthogonal channels described in Sec.II-C.
In Table 7, we summarize test accuracy performances with different repeat times M under several SNRs.In Fig. 6, the training performances under different repetition times M are shown for different SNR values.It is seen that at high SNR, it takes a small number of transmissions M to reach the learning performance close to that under the ideal channel.Recall that the natural gradient-based FL is designed for edge devices with very limited computing and communication capabilities.As a result, its performance is also limited in the sense that it should operate in channels with high SNR, and its training takes a large number of communication rounds, but much less bandwidth in each round.

VII. CONCLUSION
We have proposed communication-efficient techniques for federated learning over wireless channels.We employ lowrank tensor models to represent neural networks and both forward and backward passes are performed with respect to the tensor parameters, which leads to a significant reduction in both computational complexity and communication overhead.Moreover, for the case where edge clients have barely sufficient computing and communication power, we have proposed a lattice-coded over-the-air computation scheme for the uplink transmission of model parameters, that can achieve substantial distortion reduction compared with the conventional repetition transmission.Finally, for the case of extremely resource-constrained edge clients,

d } 4
d=1 are intermediate tensors.As shown in Fig. 2, the TT decomposition compresses the FC layer.The dot lines on the left side denote the tensor contraction operation described in Sec.III-A.Note that the time complexity of the forward pass is reduced from O(N 4 ) in the conventional FC layer to O(2RN (R + N )) ≈ O(2RN 2 ) in the TT decomposition-based FC layer.

FIGURE 2 .
FIGURE 2. Illustrations of the TT decomposition for the FC layer, where TT-rank R = 1.

FIGURE 3 . 2 K
FIGURE 3. Illustrations of natural gradients, where the black arrows are exploration directions, the blue arrows are the weighted losses in each direction, and the red arrow, which is the weighted average of explorations, denotes the natural gradient.

FIGURE 4 .
FIGURE 4. Training losses of FC networks on MNIST data set.

FIGURE 5 .
FIGURE 5. Training loss of CNN networks on CIFAR-10 data set.

FIGURE 6 .
FIGURE 6. Training loss of the repetition scheme and the lattice-coded scheme on MNIST data set.

FIGURE 7 .
FIGURE 7. Training loss of the natural gradient method.(a) The MNIST data set with 20 dB noise.(b) The MNIST data set with 25 dB noise.(c) The CIFAR-10 data set with 20 dB noise.(d) The CIFAR-10 data set with 25 dB noise.

TABLE 6 .
Test accuracy (in percentage %) of the repetition scheme and the lattice-coded scheme on MNIST data set