ResFed: Communication0Efficient Federated Learning With Deep Compressed Residuals

Federated learning allows for cooperative training among distributed clients by sharing their locally learned model parameters, such as weights or gradients. However, as model size increases, the communication bandwidth required for deployment in wireless networks becomes a bottleneck. To address this, we propose a residual-based federated learning framework (ResFed) that transmits residuals instead of gradients or weights in networks. By predicting model updates at both clients and the server, residuals are calculated as the difference between updated and predicted models and contain more dense information than weights or gradients. We find that the residuals are less sensitive to an increasing compression ratio than other parameters, and hence use lossy compression techniques on residuals to improve communication efficiency for training in federated settings. With the same compression ratio, ResFed outperforms current methods (weight- or gradient-based federated learning) by over $1.4\times $ on federated data sets, including MNIST, FashionMNIST, SVHN, CIFAR-10, CIFAR-100, and FEMNIST, in client-to-server communication, and can also be applied to reduce communication costs for server-to-client communication.

various applications [2], such as mobile keyboard prediction, speech recognition, image object detection, etc.
However, with the increasing size of machine learning models, the existing mobile communication infrastructure cannot always meet the requirement in terms of bandwidth and latency in federated learning, which limits the wide deployment of federated learning.For instance, to train a transformer model with billions of parameters, usually 32-bit float parameters, the size of a message in a single federated learning round can be several 10 or 100 Gigabytes, e.g., a CTRL model [3] with 1.6 billion parameters or a T5 model [4] with up to 11 billion parameters.This can cause enormous and extremely costly data traffic, even in 5G NR networks, where the throughput can be from 5 to 18 Gb/s.Another application scenario is to improve machine learning models for road traffic object recognition and detection in vehicle-to-everything (V2X) communication networks [5], where the bandwidth for V2X is also occupied for other traffic services at the same time, e.g., collective perception service, and obviously the safety-related services should have higher priority.Therefore, communication efficiency is a pivotal component for deploying federated learning, especially in wireless networks.
In an attempt to tackle the communication bottleneck, the parameter compression is considered as one of the most effective approaches, which allows for updating the models by transmitting much smaller size of messages in networks, and thereby reduces the required time per communication round in federated learning.The approaches proposed by [6], [7], and [8] can effectively reduce the communication volume (CV) in each round by various quantization techniques, however they only consider the communication efficiency for uploading (client-to-server) but not for downloading (server-to-client).Lin et al. [9] compressed the gradients instead of the model weights for distributed learning, which cannot well fit federated learning, where clients can train multiple epochs in each round.Wu et al. [10] and Song et al. [11] used knowledge or data set distillation to learn and transmit a more compact model or synthetic data set, where the original local model structure or distribution can be affected.Furthermore, most of those works attempt to compress the model parameters or gradients based on the model in a specific round, without consideration of inter-round model update similarity, which contains additional redundancy sequentially.
Inspired by residuals in video compression protocols by [12], we introduce a residual-based federated learning framework, termed as ResFed.It allows the server and clients to share and update models by sharing model residuals rather c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
than model weights or gradients.Particularly, by observing training trajectory in each local client and the aggregation trajectory in the server, we believe model updates in clients and the server can be predictable.Those predictive modelsin analogy with predictive frames in video transmission-can foresee model updates in the federated learning.After each communication round, we use the deviations between the predictive and the actual updated model parameters, which we call the model residuals, for the communication in networks.Note that the actual updated models can be always recoverable by acquiring the residuals, as the predictors in the senders are shared to the receivers in ResFed.More details are provided in Sections III and IV.Unlike transmitting model weights, ResFed can wring out the potential redundancy by removing the predictive information from history updates and only keep the residuals for communication.Compared to transmitting model gradients after each training epoch, ResFed allows the models to be trained locally multiple times.Compared to transmitting residual accumulation for multiple epochs, ResFed further minimizes the information by predicting the model updates from history.As shown in Fig. 2, the values of residuals are overall smaller than weights and gradients during the entire training process.To further shrink the size of messages for communication, we then compress only residuals using sparsification and quantization, and encode the messages for information sharing in client-to-server and server-to-client.
Our main contributions are summarized as follows.

II. RELATED WORK
Deep Compression: Deep compression was originally proposed by [13] and aims at compressing deep learning models by a pipeline, including sparsification (pruning), quantization, and encoding for more efficient deployment.Based on deep compression, Lin et al. [9] have proposed deep gradient compression to reduce communication costs in distributed learning by compressing gradients rather than model weights, which can also be used in federated learning.
However, models are usually trained more than one epoch locally in federated learning [1], [14], [15], [16], which results in gradient accumulation instead of gradients in other distributed learning scenarios [17].Our empirical experiments in Section VI-B also show compressing residuals can achieve better communication efficiency compared to compressing gradients due to the additional prediction step.In ResFed, we especially consider residuals, which eliminate the model similarity in a single inter-round of federated learning communication and achieve a better compression performance by leveraging the deep residual compression.
Federated Learning and Communication Efficiency: Communication efficiency is the key for deploying federated learning in real application scenarios, especially to train a large model.Previous research by [16], [18], [19], [20], and [21] attempted to reduce the number of communication rounds needed for better communication efficiency.Meanwhile, the approaches proposed by [6], [7], [8], and [22] are built upon deep compression and focus on improving communication efficiency by decreasing CV.However, unlike compressing residuals in ResFed, they compressed model weights without considering any potential redundancy in sequential updating of federated learning.The recent work by [23] has also mentioned the update of the predictive model in federated learning, which is concurrent with our work, but the information in the history of model update is not considered to reduce the redundancy of the parameters.
Additionally, all those algorithms above can only be used to improve communication efficiency for up-streaming, while ResFed can handle with both up-and down-streaming for heterogeneous resource-constrained environments.This is in line with the bidirectional compression design in [47].We recently observe that the research focusing on modern error feedback, such as in [48] and [49], provides proven enhancements in communication efficiency, not just limiting to the communication of weights and gradients.
Residuals in Video Encoding: The residuals have been widely and successfully utilized in video encoding since H.261 [12], [24].By considering interframe correlations, the pixel values in the current frame are predicted from history frames and then only residuals, i.e., the deviations between predicted and the actual pixel values in the current frame, are encoded and streamed to the receivers.Inspired by the residuals in video encoding, we integrate the model residuals into federated learning in ResFed, where the inter-round similarity of a model update is analogous to interframe correlation in video encoding.

III. MODEL RESIDUAL AND COMPRESSION
To set the foundation for our framework, we begin by presenting the relevant concepts and methods that are utilized.Our focus is on the communication between a single client i and the server in a federated learning system that comprises N clients and one server, as shown in Fig. 1.The information exchange occurs from communication round t to t + 1 and is consistent for all other clients.
Notation: Throughout this article, we adopt the following notations: w represents the model weights and r denotes  the model residuals.To distinguish between the parameters in clients and the server, v is used to represent the model weights on the server, while u stands for the model weights on the client side.The residuals of client i during uploading are represented as r t i,ul and the residuals of the server during downloading are denoted as r t i,dl .We provide an overview of the most relevant notations in Table I where f predict,i is the used predictor for model prediction in the client i. Server: For the server, we predict model updates vt i → v t for each client i from local and global trajectories L t−1 i and G t−1 i as follows: where h predict,i is the used predictor for model prediction in the server for each client i.

D. Model Residual
Given a model update v t−1 → vt i for the client i at time t in the server or ût i → u t i in the client i, if we can compute the model prediction ṽt i or ũt i based on (1), we define the model residual as follows: where r t i,ul is the residuals from the client i for uploading, and r t i,dl is the residuals from the server for downloading, respectively.

E. Model Compression
To reduce the model size for more efficient communication, we shrink the model size before sending it out.We define a compressed model in the client i and in the server where f compress and h compress is the used compressor for model compression in clients and server, respectively.In our system, we consider to compress and communicate model residuals instead of model itself in the client i and in the server Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

IV. RESFED: RESIDUAL-BASED FEDERATED LEARNING FRAMEWORK
The overview of the ResFed is shown in Fig. 1 and the detailed steps with LC in one communication round is given in Fig. 3.In particular, we introduce (a) predictor sharing, (b) model prediction, and (c) residual generation in Section IV-A.In Section III-E, (d) residual compression is formulated in ( 6) and (7).Then, after (e) communicating residual bits, we provide details on (f) model recovering in Section IV-B.Finally in Section IV-C, we describe (g) trajectory synchronization.

A. Predictor Sharing
In our ResFed framework, a pair of predictors is deployed in both clients and the server, enabling the execution of model predictions based on the local and global model trajectories over time.The server caches the local model trajectories of all clients once local model updates are received.When the received models in the server match both the local model and the models in Ĝi for client i, both of the client's trajectories are considered fully observable in the server.
The predictor on client i, f predict,i , is shared with the server, resulting in f predict,i = f predict,i .If the trajectories are fully observable in the server, the same model predictions can be made by both the server and the clients using (1).The model updates at time t + 1 can then be recovered by communicating the model residuals and combining them with the model of the last time point.Similarly, the set of predictors on the server {h predict,i |i = 1, . . ., N} is shared with the corresponding client, resulting in h predict,i = h predict,i ∀i ∈ 1, . . ., N, allowing the same model predictions to be made using (2).
In this work, the predictor design is chosen based on the model transition dynamics within a sliding history time window.The predictor is formulated as follows: To reduce memory usage for caching trajectories in clients, we apply a short time window [t − M, t] in the prediction process.We categorize the predictors into two types: 1) the stationary predictor when M = 0, and 2) the linear predictor when M = 1.Note that we consider the model updates introduced by [25], [26], and [27] as special residuals, which are calculated using the stationary predictor.In Fig. 4, we compare the statistical properties (sum and variance) of the residual values generated by the stationary and linear predictors.The residual values generated by the linear predictor have a lower sum and a more concentrated variance compared to the stationary predictor at the start of federated learning (before convergence), potentially achieving higher accuracy with the same compression ratio, as shown in Fig. 5.
Specifically, the stationary predictor predicts the next model by using the current model, wt i = ŵt i , resulting in a model residual of r t i = w t i − ŵt i .When the number of local training epochs is fixed to 1, the stationary residuals become proportional to gradients, i.e., r = ηg, where η is the learning rate and g represents the gradients.The linear predictor, on the other hand, takes into account the model transition in the last local training step, The predictor for the client i on the server h predict,i is similar to f predict,i .

B. Model Recovery
In this setup, we cache the model trajectories in both the server and the clients.Each client maintains a history of its local and global model updates, while the server caches the global trajectory and the local trajectories of all connected clients.By sharing the predictors, the server can obtain the same model prediction, wt i , as the client i at round t.If the client sends an uncompressed model residual ( ŵt i = w t ), the server can recover the model update after local training as follows: where ŵt i is the global model in the previous round.Similarly, by predicting the global model update in the client and sending uncompressed residuals, the aggregated model can also be recovered in the client.

C. Model Trajectory Synchronization
In order to ensure that the model updates are consistent across all clients and the server, we need to synchronize the model trajectories.The compression applied to the model residuals results in a loss of information, meaning that the updated model of a sender cannot be fully recovered in receivers, as r = r.This leads to a difference between the original model w in the sender and the recovered model ŵ = r + w in receivers, causing drift in the results of shared predictors.
To avoid the drift effect, we synchronize the model trajectories by simulating the recovery process in the sender.Instead of caching the original model updates, we recover the model locally from the compressed residuals, which helps ensure that the trajectories in the sender are consistent with those in the receiver.This is achieved through the ResFed pseudocode, as outlined in Algorithm 1.

V. THEORETICAL ANALYSIS
The process of ResFed can be bifurcated into two distinct parts: calculating the residual during upload and download.However, this section is focused solely on the analysis of residual properties during the upload phase.The characteristics of the downloading residual are analogous to those of the upload process.If we take into account the remaining clients as a singular entity, this process can be viewed as approximately symmetric, which is experimentally validated in Fig. 6.

A. Lossless Residual
In terms of the residual, we can propose the following statement.
Definition 1 (Fully Recoverability): For any given time t > 0 during the uploading process, let us consider vt i to represent the recovered weights for the ith client on the server.If it holds that vt i = u t i consistently, we consider the weights for the ith client within vt t to be fully recoverable on the server.Such a condition signifies respect toward the convergence property of gradient-based federated learning.An analogous statement applies during the download process.
Theorem 1: If ResFed is implemented without an integrated LC method, it can always be fully recoverable.As a result, the convergence of ResFed is consistent with that of gradientbased federated learning.
Proof: When t ≤ T, the weights in the client and the server are identical, i.e., u t i = v t i .Assuming that the predictor  f predict,i = f predict,i share predictors to server ResFedClientUpdate (i, ri,dl ) client i communicates ri,ul to the server 19: vi ← ṽi + ri,ul Ĝi ← cache(û) ri ← f compress (r i ) 41: L i ← cache(ũ i + ri ) 42: return ri is shared between the client and the server, and t = T the following holds: The residual for the ith client, represented as r t+1 i,ul , can then be calculated as r t+1 i,ul = u t+1 i − ũt i .After this, the ith client communicates r t+1 i,ul to the server.On the server side, the residual Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Assumptions and Remarks
Now we consider that the LC is integrated in ResFed.Typically, in federated learning, the local objective is to reduce the average loss from all clients.In this section, we represent the loss function for each client as f i . 2efinition 2 (Compression Error): For LC ∀i ∈ {1, 2, . . ., N}, exist an compression error ε i between sent weight u i and received weight vi Here, we know vi −u i = ṽi +r i − ũi −r i .Because the predicted weights are the same on the sender and receiver, ṽi = ũi , so we have For a local client i, the local update is at time t > 0 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where gt i is the inexact gradient based on prediction, gt i = g t i − (r t i /η), which can be fully observable for both sender and receiver due to the trajectory synchronization introduced in Section III-B.To compare with a compression result in nondistributed learning, we also conduct the update for global gradient as follows: We follow the assumptions made by the previous work [28], [29], [30].
Assumption 1 (L-Lipschitz Smoothness): The function Assumption 2 (Bounded Gradient): The stochastic gradient for f are unbiased.There exists constant G > 0, such that ∀i ∈ {1, 2, . . ., N} where g i is a stochastic gradient at each node.We consider the error between compression in local and global way is small, which also introduced in previous work [31], [32].
Assumption 3 (Bounded Distributed Compression Error [32]): There exist a small constant ξ , such that Assumption 4 (Bounded Residual): There exists a constant 1 ≥ ζ > 0, such that We combine the Assumption 4 and bounded gradient in Assumption 2, then get Remark 1: For constant residual, we have ζ = 1.For linear residual, based on Assumption 1, we have Given that both inequalities (21) and (20) need to be met concurrently, we adopt the lower value on their right-hand sides as the upper bound of the residual.
Remark 2: For linear residual, there exist Assumption 4 is considered for the residuals.For that, we experimentally validate the Assumption 4 for linear residuals in Fig. 4. We show that the residual expectations are lower than the gradient by comparing the sum of parameters in the model weight.

C. Convergence for ResFed
Now, we need to specify the concrete compression method we want to use.In our compression pipeline defined in Fig. 3, the ε i appears due to sparsification and quantization.Here, we take the Top-K sparsification method as an example for further analysis and we have the guarantee of convergence for ResFed.
Theorem 2: ResFed with a sparse linear residual can converge throughout the duration of training.Given a fixed learning rate defined as θ/ √ T, where θ > 0 is a constant, then for any time instance t > 0, we have Proof: If Assumption 1 holds, we infer Calculating the expectation at t, we obtain Then we calculate the expectation with respect to the gradient before t, it yields Next, by incorporating Lemma 3 into ( 26) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Using Assumption 1, we can infer Taking the expectation of ( 28) and incorporating ( 27) Given a fixed learning rate defined as θ/ √ T, where θ > 0 is a constant, it follows that: The convergence rate for ResFed is approximately 1/ √ T, aligning with that of FedSGD.When employing only lossless compression methods, ResFed's performance mirrors conventional federated learning, as depicted in Fig. 1(b).Opting to lossy compress the residual in ResFed, rather than compressing gradients as in other federated learning frameworks, can potentially enhance the convergence rate.This enhancement arises from the reduced upper bound of the residual's expectation value, compared to the gradient, as shown in Figs. 2 and 4. Furthermore, due to the superior compression efficiency of the residual, ResFed can achieve a higher compression ratio while maintaining the same accuracy, as demonstrated in Fig. 5.More details of the lemmas and proofs can be found in Appendix A.
We employ a LC approach that combines sparsification and quantization, as depicted in Fig. 3.The method involves sparsifying the model parameters using the Top-K approach, where sparsity is achieved by setting a certain number of parameters to zero, adopting the strategy employed in [43] and [44].To reduce the storage required for these zero parameters, we utilize a similar strategy as described in [9] and represent those zeros using a single float 16 parameter that denotes the number of zero parameters between two nonzero parameters.The remaining nonzero parameters are quantized and represented using a single bit, either 1 or −1, inspired by the approach in [6].A codebook is created to map the 1s and -1s to their actual values.Finally, to further minimize the data size for communication, we apply Huffman encoding to all the parameters, following the technique utilized in [13].

B. Residuals Versus Gradients and Weights
On top of the primary evaluation in Fig. 2, we believe deep residual compression can save more CV in federated learning with minimum impact on the accuracy.Thus, we demonstrate the federated learning integrating compressing weights, gradients and two different residuals, i.e., stationary and linear residuals in (8).As indicated in Fig. 5, the deep residual compression approach consistently outperforms weight and gradient compression in terms of accuracy on both IID and Non-IID data sets.Moreover, the linear residuals result in a higher accuracy and faster convergence compared to the stationary residuals.These results highlight the efficacy of communicating residuals in federated learning, as it enables a higher compression ratio per communication round compared to other parameters.

C. Communication Efficiency Improvement
Next, we evaluate the required CV for training three sizes of models on various data sets in both IID and Non-IID settings.Table III shows that to reach a promising target accuracy, ResFed with LC (compression ratio is set on 350×-375×) can save on average around 99% of the total CV for all clients in only up-or down-streaming.Furthermore, the bitsaving ratios of ResFed on IID and Non-IID settings are similar, which indicates the compression performance of ResFed is robust to data heterogeneity in federated learning.We show accuracy and training loss changing with increasing CV in Fig. 6.The results indicate communicating residuals in federated learning can remarkably save overall CV.

D. Comparison With Other Methods
We benchmark ResFed against other advanced methods focusing on communication efficiency, namely, SignSGD [34], Top-K [31], 1-bit CS-FL [35], FTTQ [6], and QSFL [22], in Table IV.To ensure a fair comparison, especially with the state-of-the-art method QSFL, we adopt the same FEMNIST experimental Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.setup in [22] and consider the first 36 writers as federated learning clients for a natural data heterogeneity.We compare their maximum compression ratio after reaching a 70% accuracy over 200 communication rounds, also recording their peak accuracy within this span.We designate the CV in FedSGD as our baseline, thus establishing a compression ratio of 1×.Obviously, by using deep compressed residuals, ResFed outperforms other methods in terms of compression ratio.In particular, ResFed raises the compression ratio by around 15% compared to QSFL, with a marginal accuracy decline of 3.4%.

E. Scalability for Resource-Constrained Communication Environments
Finally, we explore the scalability of ResFed by tuning compression ratios for client-to-server and server-to-client communications.This is important, as real-world applications may have heterogeneous network resources for up-and downstreaming.Fig. 7 illustrates the impact on test accuracy for various levels of sparsity, which results in different compression ratios for each communication round.From Fig. 7(a), we can observe the testing accuracy reduces with higher compression ratio per communication round, when the number of   However, when we consider the dedicated budget for overall CV in up-or down-streaming, it becomes apparent that a higher compression ratio in ResFed results in improved accuracy due to the possibility of implementing more communication rounds, as shown in Fig. 7(b).By adapting the compression ratio based on available network resources for upor down-streaming, ResFed can effectively enhance federated learning through deep residual compression.

VII. CONCLUSION
In this work, we present a novel ResFed that utilizes residual sharing instead of weight or gradient sharing.The framework leverages deep residual compression to enhance communication efficiency in both up-and down-streaming in federated learning, making it suitable for deployment in diverse network environments.Our experimental results demonstrate that this framework significantly reduces overall CV while achieving equivalent prediction accuracy compared to standard federated learning.Furthermore, ResFed outperforms weight or gradient compression in terms of accuracy and convergence speed.
Limitations: In our ResFed, local and global trajectories are stored in each client and the server for model prediction and residual calculation.When performing ResFed with N clients to train a model with V 32-bit float parameters, and with a trajectory length of M, each client needs 2 * 32 * V * M bits of memory to store its local and global trajectories.This results in a memory requirement proportional to M * V in each client.On the server side, to store the local and global trajectories for all N clients, 2 * 32 * V * M * N bits of memory are needed.To reduce memory requirements, one potential solution is to cache compressed models in the trajectories of both the sender and receiver.However, this could result in reduced accuracy in model prediction, and the tradeoff between memory and accuracy needs to be further explored in future work.

APPENDIX A LEMMAS AND PROOFS
Building on the previous Top-K studies in [32] and [45], we establish the following lemmas concerning the upper bound of error introduced by Top-K.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Lemma 1: For any local model residual r i ∈ R V in client i, when we sparsify the weight parameters below the top K parameters (zeros are set), it holds that where We denote ri as compressed residual at client i, based on Lemma 1, we have Lemma 2: If the above assumptions hold, when the Top-K method is integrated, for any t > 0 Proof: For each client performing residual compression, according to (13), we have Given Assumption 3, we deduce When employing the Top-K approach for residual compression, based on Lemma 1, we can infer Lemma 3: When selecting a learning rate schedule for ResFed with a linear residual, such that for any iteration t > 0 for some constant D, we have Proof: Based on Lemma 2 Iterating the inequality (39) for k from 0 to t yields Applying ( 37) and ( 21) into ( 40), the conclusion is reached.

APPENDIX B EXPERIMENTAL DETAILS AND FURTHER RESULTS
In this section, we provide the details on our conducted experiments presented in Section VI.We run the experiments on a computer cluster with 4× NVIDIA-A100-PCIE-40GB GPUs and 4× 32-Core-AMD-EPYC-7513 CPUs.The environment is a Linux system with Pytorch 1.8.1 and Cuda 11.1.
We demonstrate the learning task on six different data sets.1) MNIST [38]: 60 000 data points in the training set and 10 000 data points in the test set.Each data point is a 28×28 gray-scale digit image, associated with a label from ten classes.2) CIFAR-10 [41]: 50 000 data points in the training set and 10 000 data points in the test set.Each data point is a 32×32 RGB image, associated with a label from ten classes.3) Fashion-MNIST [39]: 60 000 data points in the training set and 10 000 data points in the test set.Each data point is a 28×28 gray-scale image, associated with a label from ten classes.4) SVHN [40]: 73 257 data points in the training set and 26 032 data points in the test set.Each data point is a 32×32 RGB digit image, associated with a label from ten classes.5) CIFAR-100 [41]: 50 000 data points in the training set and 10 000 data points in the test set.Each data point is a 32×32 RGB image, associated with a label from 100 classes.6) FEMNIST [36], [42]: 814 255 data points from 3500 writers.Each data point is a 28×28 gray-scale image, associated with a label from 62 classes (ten digits, 26 lowercase, and 26 uppercase).The models trained on those data set are shown in Table II.

A. Experiments for Section VI-B
We divide the data set, i.e., MNIST and CIFAR-100, into 10 clients and run the FedAvg with local optimizer of stochastic gradient descent (SGD) (momentum is 0.9) to train LeNet-5 and CNN (5Conv + 3FC), respectively.The learning rate is fixed to 0.01 and the batch size is 64 for all tests.The number of local epoch is 5.In Non-IID data setting, each client owns only 2 out of 10 classes in MNIST and 50 out of 100 classes in CIFAR-100.
In clients, we consider five different approaches as follows.1) No Compression: Standard federated learning without any compression methods is used as the baseline, which provides the best results among all the approaches when the number of communication rounds is the same.2) Compress Weights: Before communication in standard federated learning, the model weights are first compressed.3) Compress Gradients: The gradients in each epoch are compressed and communicated to the server.4) Compress Res-0: The residuals are computed by stationary predictor, i.e., (8) with T = 0. 5) Compress Res-1: The residuals are computed using a linear predictor, i.e., (8) with T = 1.For each approach on each data set, we run 10 tests with different seeds and show the mean value and standard variance in Fig. 5.
LC: We compress the models parameters using a deep compression pipeline [13] only for client-to-server.In particular, we set the sparsity on 80% and 99% for the residuals in LetNet and CNN (5Conv + 3FC), respectively.We use an SGD optimizer momentum of 0.9.Those sparsified parameters are zero-parameters and the number of the continually appearing zero-parameters are encoded as a 16-bit float parameters [9].After that, we quantize the nonzero parameters in 1 bit with median value of positive and negative parameters [6].Finally, Huffman encoding [46] is used.

B. Experiments for Section VI-C
We train LeNet-5, CNN (5Conv+3FC) and ResNet18 (size from small to large) on Fashion-MNIST, CIFAR-10 and SVHN with ten clients in both IID and Non-IID settings.We demonstrate the ResFed and lossy compress the residuals either only for uploading (UL) or downloading (DL) to study the effects on each direction separately.The learning rate is fixed on 0.01 and the batch size is 64 for all tests.In Non-IID data setting, each client owns 50% classes (5 out of 10).We use mean values from five tests in each experiment for the evaluation in Table III and show the results with standard variance in Fig. 6, which indicate that the overall CV can be remarkably reduced in ResFed.
For all experiments, we set the sparsity on 99% and use 1 bit for each nonzero parameters.Consequently, the compression ratio per communication round for LetNet-5 is approximately 350×, CNN (5Conv + 3FC) is approximately 375× and ResNet-18 is approximately 356×.

C. Experiments for Section VI-D
We train CNN (2Conv+1FC) for the experiments on FEMNIST [36].We select the first 36 writers as clients.The data for each client are partitioned so that 80% is used for training and 20% for testing.A consistent learning rate of 0.01 is utilized across all methods.Except for FedSGD, all methods undergo ten local epochs with a batch size of 1.For Top-K, we employ multiple sparsity levels in {80%, 85%, 90%, 95%, 99%, 99.5%}.Our goal is to identify the level that yields the highest comparison ratio.The best result is achieved with a sparsity of 80%.Other levels cannot efficiently reach a target accuracy of 70%.FTTQ does not reach the target accuracy 70%.For signSGD and QSFL, we refer to the results of [22], given that their experimental setup aligns with ours.The reported result for ResFed is based on a sparsity setting of 99.5%.Quantization uses 1 bit for all nonzero parameters.

D. Experiments for Section VI-E
To show the correlation between deep residual compression in up-and down-streaming, we train the CNN (5Conv + 3FC) model on IID CIFAR-10 with ten clients and tune the sparsity to achieve different compression ratios per communication round in ResFed.The learning rate is fixed on 0.01 and the batch size is 64.We use an SGD optimizer momentum of 0.9.The number of local epochs is 1.Specifically, the value of sparsity is {0%, 90%, 95%, 99%, 99.5%} for both up-and down-streaming and then set 1 bit for all nonzero parameters in quantization.

Fig. 1 .
Fig. 1.Paradigm shift from standard federated learning to residual-based federated learning system, with additional two pairs of predictors and corresponding operators (in green and orange).The model is updated in the client i by local training (Train) and in the server by aggregation (Agg).(a) Standard federated learning system.(b) Residual-based federated learning system.

Fig. 2 .
Fig. 2. Value comparison of model parameters, gradients, and residuals in federated learning.We train a LeNet with 61−706 weights of 32-bits float on the MNIST data sets distributed among 10 clients, with fixed learning rate 0.001 and batch size 64.To fairly compare the gradients and residuals, the number of local epochs in each client is set as 1.We set six checkpoints when the number of communication rounds is {1, 4, 8, 16, 32, 128}.The results show that most of the residual values are smaller than the weights and gradients during training.This suggests that LC of residuals tends to preserve more information and has less impact on the accuracy.

Fig. 3 .
Fig. 3. ResFed system model with a lossy deep compression pipeline for efficiently transmitting encoded model residuals.The following steps should be implemented for one communication.(a) Share the predictor for both sender and receiver before the federated learning starts.(b) Execute the same model prediction before communicating any model information.(c) Generate the model residuals.(d) Compress residuals using deep compression.(e) Communicate residual bits with encoding and decoding.(f) Recover the model from received residuals.(g) Synchronize the model trajectory by simulation, recovering the model locally with consideration of LC.

Algorithm 1
ResFed: Residual-Based Federated Learning Framework.The Sections Highlighted in Indicate the Model Recovery Module, and the Sections Highlighted in Indicate the Module for Residual Computation and Compression 1: server initializes the global model v, the empty local model trajectories L1 , ..., LN , the global model trajectories G 1 , ..., G N , the predictor h predict 2: for i ∈ {1, 2, ..., N} do 3: client i initializes an empty local model trajectories L i , an empty global model trajectories Ĝi

4 :
h predict,i = h predict share predictors to client i 5:

Fig. 4 .Fig. 5 .
Fig. 4. Comparison of linear residuals (Res-1) and stationary residuals (Res-0) while communicating in down-and uploading (server-to-client and client-toserver) for federated learning on IID and non-IID MNIST for each of 10 clients, with fixed learning rate 0.001 and batch size 64.The number of local epochs in each client is set as 3.We show the training loss, variance (Var), and sum of parameter (Param) values.The comparison results show that the parameter values in Res-1 are lower and more concentrated than in Res-0 at the beginning of the federated learning (before convergence), which allows us to compress it with less information loss and higher accuracy using the same sparsification and quantization.The conclusion is further validated in Fig. 5.

Fig. 7 .
Fig. 7. Testing accuracy on various values of sparsity and compression ratio per communication round in deep residual compression for both up-and downstreaming in ResFed (Res-1).We train a CNN model on a federated CIFAR-10 data set with ten clients.The testing accuracy decreases with greater sparsity at the same number of communication rounds, while when the communication resource is constrained, the deep residual compression with greater sparsity can achieve more compromising testing accuracy.(a) Number of communication rounds (300) is the same.(b) Communication cost budget (14 Mb) is the same in up-streaming (left) and down-streaming (right). .
i |i = 1, 2, ..., N}.The models are then aggregated to produce the second update{v t i |i = 1, 2, . .., N} → v t .Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Similarly, at the server, we cache the local and global model updates for all clients in the local and global model trajectories, respectively, i.e., { Li |i = 1, 2, . . ., N} and {G i |i = 1, 2, . . ., N}.Note that if the server can always send lossless global model updates to all clients, the global trajectories at time t would be the same for all clients.
i , . . ., ût i }, from ût i → u t+1 i , as the global model trajectory.Server: C. Model Prediction Client: Given a client i at time point t, we predict ût i → u t i from the local and global training trajectories, L t−1 i and Ĝ t−1 i as follows: 6: end for 7: for t ∈ {1, 2, ..., T} do

TABLE II DATA
SET AND MODELS IN EXPERIMENTS

TABLE III CV
AND THE BITSAVING RATE (BR) TO REACH THE TARGET ACCURACY (ACC) FOR ONLY USING ResFed IN UPLOADING (UL) AND DOWNLOADING (DL).WE USE FEDAVG FOR BOTH BASELINE (WITHOUT ANY COMPRESSION) AND RESFED (RES-1).NOTE THAT THE COMPRESSION RATIO PER COMMUNICATION ROUND IS SET FROM 350× TO 375×