A Flexible Model Compression and Resource Allocation Scheme for Federated Learning

Communication overhead has become one of the major constraints on the application of federated learning (FL). To reduce the overhead by trading off between the number of communication rounds and per-round latency, significant research efforts have been devoted to investigating joint optimization of model compression, client scheduling, and resource allocation to reduce the total training time. In order to reduce the complexity of joint optimization, the existing methods only consider the same compression level, unchanged participating clients, and identical round duration during training, resulting in low resource usage efficiency. In this paper, we propose a flexible model compression and resource allocation scheme to minimize the total communication time for FL in mobile networks. The proposed scheme can assign adaptive compression levels and communication resources to each client in each round. Simulation results show that the proposed scheme outperforms the start-of-the-art methods and is robust to outdated channel information.


I. INTRODUCTION
I N FEDERATED learning (FL), edge devices collaboratively train a model with the help of a central server [1]. Every device only exchanges models or gradients with the central server while keeping its data local. By using the computation and storage capabilities of edge devices, FL provides artificial intelligence (AI) capability for edge devices while avoids problems caused by uploading raw data to the central server, such as privacy leakage. FL is a promising technology for AI in the edge and has broad applications in wireless networks [2].
In FL, a typical training process involves multiple rounds of communication between the clients and the central server, causing massive communication overhead. Significant research efforts have been devoted to mitigating the communication overhead, which respectively reduce the number of communication rounds, per-round latency, and total training time [3].
To reduce the number of communication rounds, the existing approaches include: 1) Federated averaging (FedAvg) that uses multiple local iterations of stochastic gradient descent (SGD) in each round [4], 2) client scheduling that selects appropriate clients to participate in FL [5], and 3) optimization acceleration, such as momentum FL [6], which introduces the momentum to accelerate the convergence for FL.
To reduce per-round latency, model compression is a widely used method. It can reduce the amount of data for transmission through quantization, sparsification, and their combinations. The commonly used quantization methods consist of scalar quantization (e.g., probabilistic quantization in [4]) and vector quantization (e.g., universal quantization in [7]). Compared with vector quantization, scalar quantization is with higher quantization errors but a wider application due to its lower computational complexity and more adjustable quantization level. The typical sparsification methods contain Rand-M sparsification (considered in sketched update [4]) and Top-M sparsification (used in sparse binary compression (SBC) [8]), where M parameters with maximal magnitude are selected or randomly selected from local parameters or gradients. Compared with Rand-M sparsification, Top-M sparsification can reduce must be identical [3], [15], all clients must participate FL in each round [3], [12], or the number of scheduling clients must remain constant [15]. These methods perform well in static wireless networks. However, mobile networks have the following characteristics. 1) Changing participating clients: When some mobile clients move out of the cell or their battery is with low energy, they cannot participate in every round. 2) Dynamic available bandwidth: Since FL is usually regarded as a service [17] in mobile networks that coexists with other wireless services, the available bandwidth for FL may change dynamically with traffic load. 3) Timevarying channel gain: For most FL tasks, the total training time ranges from a few minutes to several hours [3], [16]. It is unrealistic to expect static channels during training. Jointly optimizing resource allocation and model compression for mobile networks face the following challenges.
1) Unknown channel information: To solve the joint optimization problem, the central server is often assumed knowing the channel information of all clients in all rounds at the beginning of the FL. In [3], [12], and [16], it is assumed that the channel information is always unchanged during the training period, which is not reasonable for mobile networks. How to solve a joint optimization problem without future channel information is still an open problem. 2) Low resource usage efficiency: The above constraints to simplify the joint optimization problem in [3], [12], [15], and [16], reduce the flexibility of model compression or resource allocation. For example, when all clients have the same amount of transmission data, the optimal BA has to allocate more bandwidth to the clients with worse channel quality to reduce per-round latency [3], [11], [15], [16]. If the clients are allowed to use different compression levels, more bandwidth can be allocated to the clients with good channel quality, which can reduce the required resource to achieve expected learning performance. Similarly, to enhance resource usage efficiency, it is also necessary to adaptively adjust the compression level and per-round duration according to dynamic participating clients and available bandwidth. 3) Unsuitable sparsification method: In [12], Top-M sparsification with perfect error compensation (EC) was considered. Rand-M sparsification without error compensation is better suited for mobile clients, which however never been considered in existing joint optimization. To address these challenges, we develop a flexible model compression and resource allocation scheme to minimize the total communication time for FL in mobile networks. In the proposed scheme, the model compression includes probabilistic quantization and Rand-M sparsification without error compensation, and the resource allocation consists of intra-round BA and inter-round uplink transmission time allocation (UTTA), which can adaptively adjust the compression level and allocate radio resources to different clients in different rounds. Our main contributions are summarized as follows: 1) To formulate a solvable joint optimization problem, we decompose the multi-round optimization problem that minimizes the total communication time into multiple single-round optimization problems that maximizes the convergence speed. The decomposed problem is independent of future channel information and is capable of balancing the number of rounds and the per-round latency, which reduces the computation complexity significantly without performance loss. 2) To optimize intra-round BA to improve resource usage efficiency, all clients can employ different sparsity ratios. Since our analysis reveals that the convergence speed depends on the average sparsity ratio of all clients rather than the minimal sparsity ratio, the proposed BA method allocates more bandwidth to the client with the maximal channel gain, which can achieve higher resource usage efficiency than the existing BA methods. 3) To optimize inter-round UTTA to maximize the convergence speed in the dynamic environment, it is necessary to analyze the closed-form expression of the convergence speed. Considering that the expression is too complicated to analyze for nonlinear learning models, such as deep neural networks (DNN), we use model-free unsupervised learning to optimize the UTTA. The proposed method can adapt to the dynamic bandwidth and participating clients, and time-varying channel gains and provide optimal uplink transmission time in arbitrary mobile scenario. The rest of the paper is organized as follows. We introduce the system model in Section II and propose a flexible model compression and resource allocation scheme in Section III. The performance of the proposed scheme is evaluated in Section IV. Finally, we conclude the paper in Section V.

II. SYSTEM MODEL
Consider a mobile network consisting of a base station (BS) and K mobile clients, denoted by K = {1, · · · , K }, who collaboratively train a model through FL. The model parameter vector (referred to as model vector for short) is represented by w ∈ R N model , where N model is the model size. For client k, the local empirical loss function is where x j k , y j k is the jth sample in the local dataset D k , L w; x j k , y j k is the loss of the jth sample, and |D| is the cardinality of set D, i.e., the number of elements in D.
The goal of FL is to minimize

A. MODEL COMPRESSION
To adapt to the varying participating clients, we consider the sketched update that includes probabilistic quantization and Rand-M sparsification without error compensation [4].
The relationship between the model vectors before and after compression can be expressed as where (·) is the compression function, b ∈ [1, 32] is the number of quantization bits, ρ = M /N model is the sparsity ratio satisfying ρ ∈ [0, 1], w is the model vector before compressing,w is the model vector only after quantization,w is the model vector after both quantization and sparsification, s is a binary vector that denotes the pre-defined random sparsity pattern (i.e., a random mask), and ⊙ denotes the element-wise multiplication operation. For b-bit probabilistic quantization, it is necessary to first equally divide the range of the model parameters [w min , w max ] into 2 b − 1 intervals, where w min = min j {w j } and w max = min j {w j }, and w j denotes the jth element of the model vector. Then, the nth interval can be expressed as [q n , q n+1 ], where n = 1, · · · , 2 b − 1, q n = w min + (n − 1) and = (w max − w min )/(2 b − 1). When w j falls in the nth interval, i.e., q n < w j ≤ q n+1 , the jth element becomes For Rand-M sparsification, M elements are randomly selected from s and set to one, and the rest are set to zero. The sparse pattern s can be fully specified by a random seed, and therefore it is only required to send the nonzeros model parameters after sparsification [4].
It is worth noting that b has a limited range for adjustment, because in most cases 8-bit quantization is sufficient for not degrading performance, whereas 1 ∼ 2-bit quantization may degrade performance significantly [4]. As a result, rather than adjusting both ρ and b at the same time, we only adjust ρ with given value of b.
In the rth round, the BS first compresses the global model and then broadcasts the compressed global modelw (r) = w (r) , ρ (r) , b down , where w (r) is the global model before compression, ρ (r) is the sparsity ratio of the global model, and b down is the number of bits for downlink quantization. The downlink communication overhead in the rth round is We use FedAvg to reduce the total number of communication rounds. For client k, the local model vector at the ith iteration in the rth round can be obtained as where µ is the learning rate, N iter is the number of local iterations, K (r) ⊆ K is the set of participating clients in the rth round, and w (r,0) k =w (r) . After N iter iterations, the local model update becomes The compressed model update can be expressed by where ρ (r) k is the sparsity ratio of client k, b up is the number of bits for uplink quantization, ∇w (r) k is the local model update only with quantization, and s k is the random sparsity pattern for client k.
The uplink communication cost required by client k is The BS aggregates the model updates of all clients as to update the global model in the next round as The global model converges when the following condition is satisfied where w * is the optimal model vector of FL without model compression, and ε is the acceptable relative error to w * . Therefore, the number of rounds required for FL is where L ε = (1 + ε)L (w * ) is the performance required for model convergence. We divide the round duration into multiple frames, each with a duration of T (in second-scale), and assume that the average channel gain (i.e., the large-scale channel gain including pathloss and shadowing) remains constant in each frame but may vary among frames. The transmission durations in the uplink and downlink contain N down / T ⌉ frames, respectively. In the downlink transmission, the data rate per unit bandwidth in the tth frame is where P BS is the transmit power of the BS per unit bandwidth, h (r,t) k is the average channel gain from the BS to client k, and N 0 is the noise power spectral density.
Given the total available bandwidth B sum R (r,t) T . Then, the maximal downlink sparsity ratio is In the uplink transmission, clients use OFDMA to upload their model updates and avoid interference among clients. Each client is allocated non-overlapping bandwidth, as shown in For client k, the data rate per unit bandwidth in the tth frame is where P UE is the transmit power of each client per unit bandwidth. Given the allocated bandwidth B (r,t) k and the uplink transmission time T (r) up , the total number of bits transmitted by client k is C T . Then, the maximal uplink sparsity ratio of client k is Since the total communication time can reflect the consumed communication resources by FL, we minimize the total communication time instead of the total training time, VOLUME 1, 2023 i.e.,

III. FLEXIBLE MODEL COMPRESSION AND RESOURCE ALLOCATION
In this section, we introduce a flexible model compression and resource allocation scheme. We first formulate the optimization problem and decompose it into two subproblems, i.e., the BA optimization with given uplink transmission time and the UTTA optimization with the optimal BA. Then, we learn to optimize BA and UTTA for adapting to time-varying mobile networks. Finally, we summarize the proposed method.

A. PROBLEM FORMULATION
It is well understood that increasing the downlink and uplink sparsity ratios ρ (r) and ρ (r) k can reduce the per-round duration T (r) but increase the number of communication rounds N round . To minimize total communication time, the sparsity ratios must be optimized to balance T (r) and N round . Because the sparsity ratios are determined by the allocated transmission time T  (13) and (15), we need to optimize the intra-round BA and inter-round UTTA jointly to minimize total communication time. The optimization problem is formulated as follows Problem P 1 is hard to solve for the following reasons. 1) Unknown parameter N round in dynamic scenarios, 2) Unavailable channel information h (r,t) k , r = 1, · · · , N round at the start of FL, and 3) Unaffordable computation complexity for practical systems by optimizing up variables jointly. To solve the problems, we decompose a multi-round joint optimization problem into multiple single-round optimization problems. To achieve the performance close to the joint optimization, we find an appropriate objective function that can balance T (r) and N round during one round.
In Fig. 2, we illustrate the relationship between the loss function and the total communication time of FL. When the global model converges, the cumulative loss decrement and ) is the loss decrement in the rth round.
Problem P 1 aims to minimize , which is equivalent to maximizing the convergence speed in the FL. We can express the convergence speed as which can be further expressed as where S (r) = L (r) /T (r) is the convergence speed in the rth round, β (r) = T (r) / N round r=1 T (r) is the proportion of transmission time for the rth round in the total transmission time, S is the weighted convergence speed in the previous r − 1 rounds, and S As shown in (19), the total convergence speed is the weighted sum of the past, present, and future convergence speeds. However, in the rth round, the past speed S (r) past remains unchanged, and the future speed S (r) future is unknown due to the unavailable channel information. Therefore, we can only maximize the present convergence speed S (r) . It is important to note that S (r) is a random variable affected by random sparsification. Thus, the objective function is chosen as the statistically average convergence speed where E s denotes the expectation over the randomness of sparsity patterns. Compared with the model update vector ∇w k , the model vector w k has many non-zero elements and is sensitive to the model sparsification. Moreover, the BS has more communication resources (e.g., more bandwidth and higher transmit power) to transmit the global model than the users, hence the model sparsification can not save much downlink communication time. Therefore, we take ρ (r) = 1 into account and allocate downlink communication time to satisfy C (r) ≥ b down N model . Then, we only need to maximize E s S (r) by optimizing T (r) up and {B (r,t) k }. The single-round optimization problem becomes Unlike P 1 , Problem P 2 does not depend on N round and only requires the channel information in the current round. Moreover, the number of variables to be optimized becomes up , which reduces the computation complexity dramatically.
Because T (r) up and {B (r,t) k } have different time scales and are hard to optimize simultaneously, we further decompose P 2 into two sub-problems P 3 and P 4 . Specifically, P 3 optimizes the intra-round BA to maximize E s L (r) with given T (r) up , and then P 4 optimizes the inter-round UTTA to maximize E s S (r) with the optimal BA, as to be detailed in the following subsections.

B. BA OPTIMIZATION 1) PROBLEM FORMULATION AND OPTIMAL BA
Since the allocated bandwidth determines the sparsity ratio of each client ρ (r) k , we first investigate the impact of ρ (r) k on the average loss decrement in each round E s L (r) .
From (22), the average loss decrement can be expressed is not affected by the sparsification methods, we have is always larger than or equal to zero, which can be expressed as S . Therefore, the average the loss decrement becomes This indicates that to maximize E s L (r) , we need to minimize the performance loss caused by sparsification . In FL, the learning model is usually nonlinear, such as DNNs, which makes it challenging to derive a closedform expression for E s L (r) . To address this, we find an upper bound on the performance loss by using Lipschitz continuity. Specifically, when the loss function is Lipschitz continuous [18], we have the following inequality is the model error caused by random sparsification, ∥·∥ p is the L p -norm of a vector, and p a non-negative integer.
This suggests that the performance loss can be reduced by minimizing the model error introduced by sparsification. In the following, we further to analyze the impact of ρ (r) k on the model error.
In (24), Lipschitz continuity is defined using the L p -norm, where the commonly used norms include the L 1 -norm and the L 2 -norm. The L 1 -norm is the sum of the absolute values of the elements in a vector, while the L 2 -norm is the square root of the sum of squared elements of the vector. Compared to the L 2 -norm, the L 1 -norm is more suitable for capturing the sparsity of a vector, which allows us to conveniently analyze the impact of random sparsification on the model error. Therefore, we consider the L 1 -norm.
Proposition 1: When the loss function is Lipschitz continuous with the L 1 -norm, and the L 1 -norm of the model updates of all clients are bounded, the model error caused by random sparsification satisfies and the upper bound of the performance loss becomes where α (r) 1 is the local Lipschitz constant in the rth round with the L 1 -norm, ∇w The tightness of the upper bound in (25) depends on the differences in the norms of the client model updates. We can measure it by the ratio of the minimal and maximal L 1 -norms, denoted as λ (r) = ∇w is the minimal L 1 -norm among all clients' model updates. When λ (r) is close to one, the upper bound is VOLUME 1, 2023 tight. We will evaluate the values of λ (r) in the forthcoming simulation to show the tightness of this upper bound.
The tightness of the upper bound in (26) also depends on the value of the Lipschitz constant. Specifically, a small Lipschitz constant tightens the upper bound. To achieve a tighter upper bound, we introduce the local Lipschitz constant in each round α (r) 1 instead of the global Lipschitz constant in (24). Additionally, this Lipschitz constant also reflects the impact of model sparsification on the learning performance. When sparsification leads to a significant performance degradation, the Lipschitz constant tends to be large, otherwise the Lipschitz constant is small. Through analysis and simulation results in [4] and [8], it is clear that to find a balance between the number of rounds and perround latency, the chosen sparsity ratios generally do not cause a significant performance loss. Therefore, when we optimize the sparsity ratio to maximize convergence speed, it is reasonable to assume that the Lipschitz constant is small and the upper bound is tight.
Minimizing this upper bound of the performance loss is a conservative design, which ensures the performance loss to be always within the bounds.
Proposition 1 indicates that one can maximize the average sparsity ratioρ (r) depends on the allocated bandwidth in all frames. To maximizeρ (r) by optimizing BA in multiple frames, the BS should know the channel information in all frames at the start of each round. However, future channels of mobile clients are unknown. Consequently, we maximize the cumulative sparsity ratio until the current frame ρ (r,t) is the cumulative sparsity ratio of client k from the first frame until the tth frame. From (15), we have is the increment of sparsity ratio in the tth frame, and ρ (r,t−1) k is the cumulative sparsity ratio in the past frames, which is independent with the current BA method.
We try to remove the minimization operation in (27) because it makes ρ (r,t) k hard to analyze and optimize. To meet ρ , the clients who have enough bandwidth to upload all model updates are no longer allocated more bandwidth. As a result, the allocated bandwidth for client k should satisfy where K (r,t) = {k|ρ (r,t−1) k < 1, k ∈ K (r) } is the set of clients who has data to be transmitted in the tth frame. When (28) holds, the cumulative sparsity ratio of client k becomes ρ , and the sum of cumulative sparsity ratios becomes ρ (r,t) We can select ρ (r,t) as the optimization objective to maximize ρ (r,t) . The BA optimization problem can then be formulated as Since Problem P 3 is a linear programming with linear constraints, we can use a simplex algorithm to obtain the optimal solution. From (30a), we know that the optimal solution always allocates the bandwidth to the client with the largest channel gain. If there is remaining bandwidth after allocation, the bandwidth will be allocated to the one with the best channel quality among other clients. Repeat this allocation until all bandwidth is allocated.
When the bandwidth of each frame is only allocated to one client, the closed-form expression of the optimal BA is obtained as follows Hence, we call it the maximal gain BA (Max-BA).

2) PERFORMANCE ANALYSIS
In the following, we analyze the performance of Max-BA.
To show the performance gain of the optimal BA, we consider two classical allocation methods as baselines.
• Equal BA (Equ-BA): the bandwidth is equally allocated to each client, i.e., which does not require the channel information of any client.
• Rate inversion BA (Inv-BA): To ensure that all clients have the same sparsity ratio, the allocated bandwidth is inversely proportional to the data rate, i.e., To provide the close-form expression ofρ (r) , we focus on the case where all clients have balanced sparsity ratios, i.e., 0 < ρ (r) k < 1 ∀k ∈ K (r) . Then, K (r,t) remains unchanged among frames, i.e., K (r,t) = K (r) . Proposition 2: When the sparsity ratios of all clients satisfy 0 < ρ (r) k < 1, ∀k ∈ K (r) , the average sparsity ratios with Max-BA, Equ-BA, and Inv-BA methods can be expressed as and p is the order of mean and its value is listed in Table 1.
Proof: See Appendix V-B. According to the monotonicity of the generalized mean, we can compare the performance of different BA methods as follows.
Corollary 1:  Inv−BA also increases. According to Corollary 1, Max-BA always achieves the maximal average sparsity ratio, Equ-BA is in the middle, and Inv-BA achieves the minimum.
Since Inv-BA in (33) is the optimal BA to maximize the minimal sparsity ratio ρ (r) min = min k∈K (r) ρ (r) k , it always allocates more bandwidth to clients with lowest channel gains. However, Proposition 1 indicates that the convergence speed depends onρ (r) rather than ρ (r) min . Therefore, it is not necessary to force all clients to use the same sparsity ratio. If the clients are allowed to have different sparsity ratios, they can choose better BA methods to improve resource usage efficiency.
It is worthy to note that the Max-BA method always allocates all resources to the clients with the largest channel gain, resulting in the imbalanced sparsity ratios among all clients. In other words, some clients have higher sparsity ratios, while others have lower sparsity ratios. Due to this imbalance in sparsity ratios, the global model is only learned from the local datasets of a few clients, which degrades the learning performance of FL. However, in mobile networks, the channel gain varies among frames, hence the client with the largest channel gain may change across rounds. When the channel conditions of different clients fluctuate significantly, the client with the largest channel gain will be different in each round, giving each client an opportunity to be allocated with bandwidth. However, when the client with the largest channel gain remains the same in each round, we can use the suboptimal Equ-BA method to avoid the imbalance. Therefore, we consider both the Max-BA and Equ-BA methods in optimizing inter-round UTTA in the next subsection.

C. UTTA OPTIMIZATION
Given the BA method, Problem P 2 can be simplified to In Problem P 4 , the convergence speed J depends on the optimization variable T Although exhaustive searching can yield the optimal solution, solving P 4 at the beginning of each round leads to unacceptable computational complexity and decision-making latency. To cope with these issues, we consider a learning-based policy network to obtain T (r) * up . The input-output relation of the policy network is expressed aŝ whereT (r) * up is the learned optimal uplink transmission time from the policy network, and θ p contains the model parameters of the policy network.
To find a UTTA policy in dynamic environments, x (r) should contain the parameters changing in each round, such as the available bandwidth, the set of clients participating in FL, and the channel gains of all clients. Hence, the input vector is When the policy network is obtained by deep learning, it encounters these troubles.
1) If supervised learning is used to learn the policy network, one has to use exhaustive searching to find the optimal solution for generating the training samples (x (r) , T (r) * up ). This causes high computational complexity in generating labels.

2) The dimension of input vector in (38) is
up , which depends not only on the varying number of clients |K (r) |, but also on the unknown parameter N up . However, the channel information from the second to the last frames is unknown. In the following, we address these issues by resorting to the model-free unsupervised learning (MFUL) proposed in [19].

1) MODEL-FREE UNSUPERVISED LEARNING
The basic idea of MFUL, as shown in Fig. 3, is to introduce a value network to first estimate the objective functionĴ and then use a policy network to obtain the optimal variable to maximizeĴ . The input-output relationship of the value network is denoted aŝ where θ v is the model vector of the value network. Here, ''model-free'' indicates that the objective function J = J x (r) , T (r) up is unknown and estimated from the training data.
In P 4 , the constraint in (36b) is easy to satisfy by setting the output layer's activation function as ReLU and introducing a rounding down operation. We only need to optimize the value network to minimize the estimated error of objectivê J , and then optimize the policy network to maximizeĴ . Consequently, using MUFL to solve P 4 amounts to solving the following two optimization problems, According to (38) and (39), it is clear that the current UTTA policy T (r) up does not affect the future input vector and objective function x (r+1) and E s S (r+1) . This indicates that Problem P 5 is a typical non-Markov Decision Process (non-MDP) problem. As the MFUL can be viewed as a simplified version of reinforcement learning (RL) tailored to address non-MDP problems [19], we employ MFUL instead of RL.

2) MODEL INPUT SIMPLIFICATION
In order to keep the input dimension unchanged, we simplify the input of the policy network. According to Proposition 2, we can approximate the average sparsity ratio as follows.
Corollary 2: When the SNR is high enough such that R (r,t) k ≈ log 2 P UE h (r,t) k /N 0 , the average sparsity ratio can be approximated as is the logarithmic channel gain in the tth frame. According to Corollary 2, in high SNR, the objective function E s S (r) only depends on the average logarithmic channel gain of all clientsh (r) dB and the number of clients K (r) . Then, the input vector x (r) becomes Consequently, the dimension of input vector becomes a constant N x = 3. Moreover, since N x in (43) is much smaller than that in (38), the numbers of model parameters of the value and policy networks decrease, which reduces the training complexity.

3) END-TO-END (E2E) LEARNING
In (43),h (r) dB also depends on the future channel information, which is unknown for the BS at the start of each round. Fortunately, the average channel gain over multiple frames and multiple clientsh (r) dB is highly correlated to the history channel gain in the previous roundh To enable E2E learning, the set of training samples is is the convergence speed related to the predictive channel informationh (r) dB .

4) TRAINING PROCESS
The procedure of training the policy network and value network is shown in Fig. 3. First, the parameter vector of the value network is updated in the ith iteration by Mix ; θ i p is the output of the policy network, ξ denotes a Gaussian white noise with zero mean and variance σ 2 ξ that is used to increase exploration opportunities aroundT is the corresponding convergence speed, B i ∈ D MFUL is the batch in the ith iteration, and |B i | is the batch size, and η v is the learning rate of the value network.
Then, according to the updated value network, the parameter vector of the policy network is updated by where η p is the learning rate of the policy network.
To train the DNNs of the policy network and the value network offline, we assume the BS has a simulation dataset for a similar learning task as that in FL and generates training samples x (r) Mix , T (r) up , J from the simulation dataset. First, the BS needs to simulate FL using the simulation dataset and generate the global models and the local model updates. Then, to estimate the convergence speeds, the BS generates different channel information, compresses the local updates according to the corresponding BA and UTTA, aggregates the compressed local update to update the global model, and estimate the loss reductions on the simulation dataset.
Specifically, for given available bandwidth B Existing works, such as [3], [11], and [15] assume all rounds having identical durations. By contrast, T (r) up obtained from MFUL can be adaptively adjusted according to the bandwidth, participating clients, and channel qualities. We call the UTTA with the unchanged T (r) up as ''Fixed-UTTA'' and the UTTA obtained from MFUL as ''Adap-UTTA''.

5) CHANNEL INFORMATION REPORTING
Different from the static scenario where the channels remains constant during the training period, the clients have to report their channels frequently to the BS in time-varying channels. For Max-BA, the BS needs to know the channel information of all clients at the beginning of the current frame, h (r,t) dB,k . While for Adap-UTTA, the BS needs to collect the channel gains of all frames in the previous round h There are two ways to report the channel information, respectively called per-frame reporting and per-round reporting, depending on whether the participating clients report their channel information once a frame or once a round. For the per-frame reporting, each client reports its channel gain at the beginning of each frame. Then, at the end of each round, the BS can collect the required channel gains for Adap-UTTA. In this way, the channel information for BA is perfect, which however is not applicable for new clients who did not participate in the previous round and not report their channel information. For the per-round reporting, the clients only report the channel gains from the past N (r−1) up th frame to the current frame, i.e., h 1) dB,k at the beginning of each round. Then, the channel information for BA becomes outdated. Notably, as shown in (31), the channel information for the Max-BA method primarily is used to determine the prioritization of clients for resource allocation. When the channel information is imprecise, it is possible for the BS to incorrectly identify the client with the highest channel gain. If the channel information is not severely outdated, the Max-BA method may allocate resources to the client with the second highest channel gain. However, this does not lead to a significant decrease in the average sparsity rate, indicating the robustness of Max-BA to outdated channel information. This indicates the feasibility of adopting the inter-round reporting scheme for resource allocation. Compared with the per-frame reporting, the perround reporting can reduce decision delay. For both per-frame VOLUME 1, 2023 reporting and per-round reporting, each client only needs to update no more than (N (r−1) up + 1) channel gains. This overhead is far less than that caused by uploading the local model update, which can be ignored.
When the channel information is severely outdated, we can use the Equ-BA method, which does not require any channel information.

D. SUMMARY
Finally, we summarize the proposed model compression and resource allocation scheme, which includes Rand- for client k. When some clients satisfy ρ (r,t) k = 1, the BS do not allocate any bandwidth to these clients.

E. POTENTIAL APPLICATIONS
The proposed method in this paper is applicable to various scenarios that require mobile client involvement in federated learning. Some typical use cases are listed as follows. 1) Mobile client trajectory or channel prediction: This involves predicting the trajectory or channel conditions of mobile clients based on their local location data. It can be applied in traffic management, route planning, wireless resource management, and mobility management. 2) Intelligent traffic flow prediction: This entails predicting traffic flow based on mobile client data such as speed, direction, and travel time. It can be used to optimize traffic flow and reduce congestion. 3) Mobile client behavior analysis: This involves analyzing client behavior patterns and providing personalized recommendations based on mobile client behavior data, such as app usage, clicks, and purchase preferences. 4) Others: There are various other applications where the proposed method can be used, such as mobile client health monitoring, mobile client speech recognition, mobile client image recognition, and object detection.
In these cases, the mobile clients exhibit mobility, resulting in variations in the number of participating clients, channel gains, and available bandwidth. The proposed model compression and resource allocation method has the following advantages: 1) adaptability to fluctuating client numbers, ensuring flexible resource allocation; 2) requirement of only the average channel gain over a period of time, avoiding the need for frequent reporting of channel quality and effectively reducing communication overhead. 3) dynamic adjustment of transmission duration based on available bandwidth, enhancing resource utilization. Therefore, the proposed method can be easily applied to these use cases.

IV. SIMULATION RESULTS
In this section, we evaluate the performance of the flexible model compression and resource allocation scheme. We first introduce the learning task and simulation setups, then justify the selected compression method and optimization objective, and finally compare the performance of different BA and UTTA methods.

A. LEARNING TASK AND SIMULATION SETUPS
We consider a task of predicting the average channel gains for vehicle clients, where Figs. 4(a) and 4(b) show the time series of channel gains for prediction and the trajectories of the vehicle clients.
As shown in Fig. 4(a), the average channel gains in the prediction window are predicted from those in the history window. Let h Pre = [ĥ 1 , · · · ,ĥ N Pre ] T and h His = [h 1−N His , · · · , h 0 ] T denote the vectors of the channel gains in the prediction and history windows, where N Pre and N His are the numbers of frames in the two windows, h n is the average channel gain in the nth frame. We use the path loss model in 3GPP [20], the average channel gain can be expressed as h n = 36.8 + 36.7 log 10 (d n ) + log 10 (χ), where d n is the distance between the client and the BS in the nth frame, χ is the shadowing and log 10 (χ) is a Gaussian random variable with zero mean and standard variance σ χ = 8 dB.
To reduce the complexity of the predictor, we consider the single-output predictor considered in [21], which only predicts the average channel gain in the nth frame at the (n − N Pre )th frame, i.e., h n = f (x n ; w) , n = 1, · · · , N Pre (48) where w is the prediction model vector and x n = [h n−N His , · · · , h n−N Pre ] T is the input vector. Then, the average channel gains in the prediction window is denoted aŝ h Pre = [ĥ 1 , · · · ,ĥ N ]. To minimize the average prediction error, the mean absolute error (MAE) is taken as the loss function, i.e., To evaluate the performance of the predictor under varying time-varying channels, we consider a mobile network deployed in an urban area. The training and testing data are generated from simulated datasets. First, we generate the vehicle clients' trajectory dataset based on the road topology in Fig. 4(b). Subsequently, we generate the channel gains using the mentioned channel model in 3GPP. As shown in Fig. 4(b), the vehicle clients traverse three roads, each with a length of 500 m. The minimum distances between Roads A, B, C, and the BS are 10 m, 100 m, and 200 m, respectively. The vehicle clients strictly adhere to the road traffic safety regulations. The speed limits on urban roads vary among different countries, and for reference, we consult the Road Traffic Safety Law of the People's Republic of China [22]. In urban road scenarios, the typical maximal speed limit is around 60 km/h (approximately 16.7 m/s) or 80 km/h (approximately 22 m/s) or even higher. Therefore, in the training dataset, the vehicle speeds range from 18 km/h to 54 km/h (i.e. 5 m/s to 15 m/s). To evaluate the impact of outdated channel information, we consider a maximal speed of 90 km/h (i.e. 25 m/s) during the testing phase. For each client, the training set contains approximately 25,000 to 25,600 samples, while the testing set contains around 8,500 to 8,600 samples.
To reduce the MAE of prediction errors, the clients participate in FL to collaboratively train a fully connected DNN (FNN) as the predictor, where the parameters of the mobile network, predictor, and FNN are listed in Table 2.

B. SIMULATION RESULTS
We compare sketched update [4] with SBC [8] in Fig. 5, which are respectively examples of Rand-M sparsification and Top-M sparsification, where their sparsity ratios are set according to the results in [4] and [8]. The performance of FedAvg without compression is also provided as a baseline. When some mobile clients cannot participate in each round, SBC either does not compensate the error or compensates the stale cumulative error, where the legend of the corresponding performance is ''w/o EC'' or ''w/ imperfect EC''. We can see that SBC suffers from severe performance degradation. By contrast, sketched update can achieve better learning  performance without error compensation, which is applicable to mobile networks. In addition, as shown in Fig. 5, the minimal MAE achieved by FedAvg without model compression is L (w * ) = 6.15 dB. When the acceptable relative error is ε = 3%, the required MAE for model convergence is L ε = (1+ε)L (w * ) = 6.15×1.03 = 6.33 dB.
In Table 3, we show the values of λ (r) to demonstrate the tightness of the upper bound on model error caused by random sparsification, as stated in Proposition 1. It can be observed that in each round, the values of λ (r) vary from 0.89 to 0.95. Therefore, the upper bound is tight.
In Fig. 6 and Fig. 7, we investigate the impact of the uplink transmission time T up on the total communication time and convergence speed, where Equ-BA is considered and the clients only move on Road A. The learning curves of different T up are compared in Fig. 6. When T up increases as 1, 3, 5, and 7 seconds, the number of communication rounds decreases monotonously as 286, 80, 61, and 51. Considering that the downlink transmission time is T down = 1 second, the total communication time is T tot = N round T down + T up = 572, 320, 366, and 408 seconds. It decreases first, then increases, and reaches the minimum when T up = 3 seconds. This indicates that there is an optimal value of T up to balance the number of communication rounds and per-round delay. This explains why optimizing UTTA can reduce the total communication time.
In Fig. 7, we show the convergence speed and the total communication time. It can be seen that the convergence speed decreases with the round number r during the training process. No matter in which round, as T up increases, the convergence speed increases first, then decreases, and reaches its maximum when T up = 3 seconds. It suggests that minimizing the total communication time is equivalent to maximizing the convergence speed of each round. As a consequence, transforming the optimization problem from P 1 to P 2 does not incur performance loss.
To illustrate the performance gain provided by inter-round UTTA or intra-round BA, we first compare the performance of different BA methods with given uplink transmission time, and then compare the performance of different UTTA methods with the optimal BA (i.e., max-BA). The hyperparameters of MFUL for Adap-UTTA are listed in Table 4, where the FNNs are used for both the policy network and value network. In Fig. 8, we compare the total communication  sum and the number of participating clients K (r) are unchanged. In Fig. 9, the performance is shown when either B (r) sum or K (r) changes in different rounds, where clients only move on Road A.
To demonstrate the advantages of the proposed method, we need to compare it with existing joint optimization approaches [3], [11], [15], [16]. However, these approaches address joint user scheduling optimization [11], [15], [16], consider the impact of computation time on total round duration [3], [15], [16], and use the theoretical analysis results assuming static channel conditions during training [3], [16]. Since our focus is on optimizing the intra-round BA and inter-round UTTA methods, we compare them with the BA and UTTA methods in these studies. Specifically, when ignoring the impact of computation time, the BA algorithms in [3], [15], and [16] reduce to the Inv-BA method, which can be compared with our intra-round BA method. Additionally, the works in [3], [11], and [15] consider the assumption of identical round durations, which is referred to as the Fixed-UTTA method and can be compared with our inter-round UTTA method. The optimal performance of the Fixed-UTTA method is achieved by exhaustively finding all possible UTTA strategies. Therefore, if our Adap-UTTA method outperforms the Fixed-UTTA method with exhaustive search, we can conclude that our inter-round UTTA method surpasses the existing UTTA methods.
In Section V, we discuss two ways to collecting channel information, i.e., the per-frame and the per-round channel information reporting. Since the former can provide perfect channel information for BA, while the latter can only provide outdated channel information, we call the former as ''perfect channel information'' and the latter as ''outdated channel information''. We first provide the total communication time with ''perfect channel information'' in Figs. 8 and 9. Then, to show the robustness of BA methods to outdated channel information, the performance with ''outdated channel information'' is provided in Fig. 10.
As shown in Fig. 8, for an arbitrary given T up , Max-BA always achieves the best performance, Equ-BA is in the middle, and Inv-BA is the worst. Since Max-BA requires the minimal total communication times, it means that Max-BA can always reach the maximal convergence speed for given T up . Therefore, Max-BA achieve highest resource usage efficiency. When clients move on all three roads, the difference between the channel gains and the data rates of clients is the largest among all scenarios, hence the gap between the total communication time of three BA also gets the largest. This agrees with the analysis in Corollary 1.
When Max-BA is used, the optimal uplink transmission time T * up in three scenarios is 3, 4, and 2 seconds, respectively, which varies in different scenarios. To obtain T * up , Fixed-UTTA needs to collect the channel information of all clients in all rounds at the beginning of the FL and finds T * up by exhaustive searching. Adap-UTTA does not require the future channel information any more. It can always obtain the optimal value from the well-trained DNN with low complexity.
As shown in Fig. 9, T * up varies in different rounds as B (r) sum or K (r) changes. As a result, Fixed-UTTA with exhaustive searching cannot achieve the minimal total communication time. By contrast, Adap-UTTA can always adjust T up adaptively according to the available bandwidth and the number of participating users, which can shorten the total communication time.
As Max-BA and Adap-UTTA can reach the required performance with the minimal total communication, the proposed scheme can improve resource usage efficiency to accelerate the convergence of FL.
We compare the total communication time of different BA methods with different channel information in Fig. 10, where Adap-UTTA is considered. Specially, the total communication time when clients move on different trajectories at 5 m/s is shown in Fig. 10(a), while the performance when clients move on Road A at different speeds is shown in Fig. 10(b). Since Equ-BA does not require any channel information, the performance of Equ-BA without channel information is provided as a baseline.
In all scenarios, the performance of Max-BA with outdated channel information is close to that with perfect channel information. By contrast, the performance of Inv-BA with outdated channel information suffers a considerable performance loss from Inv-BA with perfect channel information, especially when the movement speed of clients increases. Therefore, Max-BA is more robust to outdated channel information than Inv-BA. This is because for Max-BA, the channel information mainly determines which client to be allocated resources first, whereas for Inv-BA, the channel information determines how many resources are allocated to each user.
Moreover, with outdated channel information, the total communication time of Max-BA is always lower than that of Equ-BA and Inv-BA. This means that Max-BA outperforms existing BA methods even when using outdated channel information. Specifically, when the clients move on all three roads as shown in Fig. 10(a), the total communication time with Max-BA, Equ-BA, and Inv-BA is 6.2, 7.2, and 8.2 minutes, respectively. When clients move at 25 m/s, the total communication time of Max-BA, Equ-BA and Inv-BA is 4.1, 4.7, and 9.0 minutes.

V. CONCLUSION
In this paper, we proposed a flexible model compression and resource allocation scheme for federated learning in a bandwidth-limited mobile network, where the considered Rand-M sparsification does not require error compensation and is more suitable to the mobile clients than Top-M sparsification. The formulated single-round optimization problem to maximize the convergence speed achieves the same performance as the multi-round optimization problem, which not only avoids the difficulty of deriving the number of communication rounds and obtaining future channel information, but also reduces the complexity of solving joint optimization problem. By taking advantage of end-to-end model-free unsupervised learning, the designed intra-round bandwidth allocation and the inter-round uplink transmission time allocation can adaptively adjust the compression level and transmission resources to different clients in different rounds. The proposed scheme outperforms the existing methods in dynamic environments and is robust to the outdated channel information. The average sparsity ratios are proportional to the maximum, arithmetic mean, and harmonic mean of R (r,t) k , k ∈ K (r) , which are special cases of the generalized mean [23] as follows, By substituting (B.2) into (B.1), we obtain (34).