A Traffic Model Based Approach to Parameter Server Design in Federated Learning Processes

This letter proposes a model to describe the data traffic generated by a Federated Learning (FL) process in a wireless network with asynchronous Parameter Server (PS) orchestration and heterogeneous clients. The model accounts for the local update processes implemented by individual clients and it is used to enforce requirements on the PS design, namely to regulate the interval among consecutive global model updates. PS requirements are validated on realistic pools of resource-constrained wireless edge devices, typically found in Internet-of-Things (IoT) setups. Numerical results show that the proposed policy is effective when devices have unbalanced resources, namely, different sample distributions and computational capabilities. It permits an accuracy gain of up to 15-17% on average with respect to typical asynchronous PS designs.

Abstract-This letter proposes a model to describe the data traffic generated by a Federated Learning (FL) process in a wireless network with asynchronous Parameter Server (PS) orchestration and heterogeneous clients. The model accounts for the local update processes implemented by individual clients and it is used to enforce requirements on the PS design, namely to regulate the interval among consecutive global model updates. PS requirements are validated on realistic pools of resource-constrained wireless edge devices, typically found in Internet-of-Things (IoT) setups. Numerical results show that the proposed policy is effective when devices have unbalanced resources, namely, different sample distributions and computational capabilities. It permits an accuracy gain of up to 15-17% on average with respect to typical asynchronous PS designs.
Index Terms-Federated learning over networks, traffic modelling, edge devices, computing.

I. INTRODUCTION
F EDERATED Learning (FL) enables resource-constrained edge devices to cooperate over a network for training a shared Machine Learning (ML) model. It protects data ownership by ensuring that the observations used for training never leave the device responsible for its production. As depicted in Fig. 1, FL alternates the computation at each device of local model parameters, i.e., the weights of deep neural network layers, with the communication to a Parameter Server (PS) that fuses the local models and returns a global model [1]. Different FL implementations [2] and enablers [3] emerged in the past few years. Most applications call for geographically distributed [4] and heterogeneous clients with different temporal alignments. In many cases, an asynchronous orchestration of the FL process is also a prerequisite, especially in next generation networks.
Current state-of-the-art on asynchronous FL strategies mainly have the following limitations. First, in vanilla algorithms, the PS updates the global model as soon as a local model is received [5], with no regard to client-specific resource constraints. This can lead, for example, to biased updates from faster clients. Secondly, the update at the clients is not optimized as the number of local epochs is usually fixed and not tuned according to the type of traffic or the quantity/quality of the data [6]. The letter proposes a moment-matching approximation to represent the traffic generated by clients engaged in an asynchronous FL process. The model is validated through a real FL prototype consisting of physically separated clients implementing distributed training over a wireless network, using the Message Queuing Telemetry Transport (MQTT) protocol. Besides adapting and formalizing the moment matching technique to the context of FL, the letter provides the necessary requirements on the time interval among consecutive global model updates, namely the server response time T P S . This is used by the PS to decide whether to update the global model or not, depending on the backlog of local models retained by the clients. The proposed requirements account for the traffic type, the local data size, quality, and the channel impairments affecting the FL local round time.
The letter is organized as follows. Sect. II introduces the proposed traffic model for asynchronous FL. The model uses the moment-matching approximation and permits to categorize the traffic of each client using the dispersion index (D) metric. Requirements in Sect. III exploit the proposed model to define operational points that regulate the clients and the PS behavior, while Sec. IV describes a practical policy for T P S selection that fulfills the proposed requirements. The policy is validated through a real-time FL platform prototype.

II. FL DATA TRAFFIC MODEL
We consider a FL system composed of one PS and a set of K clients K = {1, . . . , K}, each with its own private dataset S k of size S k = |S k |. As depicted in Fig. 1, the aim of the FL process is to obtain a global ML model, of size G, that minimizes a loss function w G = argmin w L(w) with L = 1 K K k=1 L k (w, S k ) and L k being the local costs measured by clients using the data batches S k . The FL process requires the clients to produce local models through optimization, typically via supervised and gradient-based methods. Each client k performs M (k) local epochs before exchanging the local model with the PS, which is in charge of the global model update.
In asynchronous FL, the PS produces a new instance of the global model w G,t at time t = nT P S , n = 1, . . . , N F L [4]: where w k,t−T k represents the k-th local model available at time t−T k , T k ≥ 0 is the time interval required by the client k to produce an updated local model, while T P S regulates the time span between two global model updates. n ≤ N F L is the index of the federated rounds. Finally, ϵ t controls the stability of the update. Considering that clients may have different computing capabilities and datasets, the associated network traffic can vary significantly depending on T k . A client-specific characterization is thus proposed to model the local model update process and identify the corresponding traffic pattern. For the exchange of the NN model parameters among the clients and the PS, we propose to employ the MQTT protocol [7] as it enables a real-time exchange of the model parameters and allows the monitoring of the client training time required for T P S tuning. The time required by a client k to implement a local round can be broken down into the time span to download the weights (T down ), train the new model for M (k) epochs using local data batches (T train ), encrypt, compress and upload the weights, i.e., to the MQTT broker, (T up ): These quantities can be computed locally by each client, through standard time measurement functions, and permit to separate the contribution of computing capabilities (T train ) from possible channel disturbances affecting uplink (UL) and downlink (DL) communications (T up , T down ). Based on the above assumption, we introduce a model to approximate the probability density function p T k (T k ) of the traffic pattern generated by each client k. A moment-matching approximation is employed which divides the process into three categories: Bernoulli, Poisson, and Pascal [8]. We classify the traffic into one of these categories by matching the first two moments defined respectively as: The Dispersion Index (D), also called Variance to Mean Ratio (VMR), is: According to the moment-matching technique, we can obtain a Poisson traffic by setting D(k) = 1, i.e., by imposing a regular traffic pattern. On the contrary, burst-traffic, i.e., Pascal, and smooth traffic, i.e., Bernoulli, are obtained with D(k) > 1 and D(k) < 1, respectively. Based on the above metrics, in the following, we give upper and lower bounds on the characteristics of the PS, especially regarding the T P S .

III. MINIMAL REQUIREMENTS ON CLIENTS AND T P S
The choice of the server response time T P S is underpinned by the local model update process running on each client, therefore by the number M (k) of epochs that directly reflects on the dispersion index D(k) in (4). Low values of M (k) correspond to frequent contributions of the clients to the global model, at the expense of an increased communication overhead, and possibly non-informative local model updates. Conversely, large M (k) forces the client to implement many local epochs and possibly produce a biased local model (penalized by overfitting). Optimal such that the client local model can improve the FL process while satisfying the communication overhead constraints. The upper bound M U (k) is the maximum M (k) before the client starts overfitting. Note also that M L (k) is limited by UL and DL maximal communication efficiency η M AX [bit/sec/Hz] dedicated to the link between the PS and the clients, with bandwidth B. Being T F L = N F L T P S the FL training duration, it is: This leads to the following: Note that M U (k) depends mainly on the size of training data and the local model, since more data (or small sized models) require more local epochs for overfitting. For client k, and w k,m being the local model observed at local epoch m ∈ {1, . . . , M (k)}, M U (k) is assigned as: where L k (w k,m , S val k ) is the validation loss computed by client k on the validation dataset S val k at epoch m.  In orange, red and green the probability density function p T k (T k ) of a client with hardware ARM-Cortex-A57 SoC, GPU: 128-core Maxwell (Jetson Nano model, i), ARM-Cortex-A72 SoC (Raspberry pi4, ii), ARMv8-Cortex-A53 SoC (Raspberry pi3, iii), respectively. With dotted black line we represent the log-normal distribution that fits the real probability density function of T k . M (k) is set to 2 and the model size S is 51 KB. global model updates, namely M (k) > M U (k), might produce biased local models when T P S ≪ A(k). These could negatively contribute to the FL process by either slowing down convergence, reducing the accuracy [9] or possibly preventing the device to complete the local round [10].

IV. PS DESIGN PRINCIPLES AND VALIDATION
This section proposes a policy to regulate the PS response time T P S based on the knowledge of the client dispersion index D(k), the dataset S k size and possible conditions on local overfitting. The proposed policy is validated in two scenarios where clients are characterized by different traffic patterns, namely varying computing capabilities, and non Independent and Identical Distributed (non-IID) datasets. Validation is based on a FL platform prototype.

A. FL Network Platform and Traffic Modelling
Fig. 2 provides a validation of the proposed client-specific traffic modelling approach based on moment matching. We consider a realistic pool of resource-constrained devices equipped with: i) CPU ARM-Cortex-A57 and GPU 128-core Maxwell (Jetson Nano model [11], orange), ii) CPU ARM-Cortex-A72 SoC (Raspberry pi4, red) and iii) CPU ARMv8-Cortex-A53 SoC (Raspberry pi3, green). For each client, we collected measurements of local round times T k to obtain the sample probability functions p T k (T k ). Notice that each client is connected via WLAN to a router which forwards the MQTT packets to the PS. The traffic parameters T k and D(k), are computed directly by clients at the end of each local round and then sent through a dedicated connection to the PS. The measured statistics p T k (T k ) are thus reliable and realistic as they are independent from the PS hardware or from the FL processing. As evident from Fig. 2, the local round time distributions are well approximated by log-normal (dashed lines) with mean and standard deviation of 1/0.07, 3.4/0.2 and 10/0.3 for clients i), ii) and iii), respectively.
The goal of the following tests is to analyze the impact of client heterogeneity on PS response time T P S . Based on experiments in Fig. 2, we simulate different execution times of the local rounds according to the log-normal model. Compute M L (k) with (6), M U (k) with (7) 6: Compute performance metric: P = P(w k,M U (k) , S val k ) 7: Return T * P S (k) ← A * (k) = E n [T * k ] 10: end procedure

B. Client-Specific Policy for the PS Response Time
The choice of the PS response time T P S must take into account both the traffic model of Sec. II and the requirements of Sec. III. The optimal server time T * P S corresponds to a value M * (k) bounded by M L (k) and M U (k). The main idea, shown in Algorithm 1, is that each client computes its own optimal M * (k): where F is a policy function. Function F takes as input the local accuracy P and the traffic type D(k). It can be written analytically as: where Q k (γP ) = m : P(w k,m , S val k ) = γP is the number of epochs that corresponds to a validation accuracy of γP , and P(.) is the cross-entropy accuracy function. P = P(w k,M U (k) , S val k ) is obtained at local epoch M U (k), C > 0 is a constant (see Sec. IV-C) and 0 < γ < 1 is a hyper-parameter. Optimal M * (k) is bounded as M * (k) ≥ M L (k) from (8), and M * (k) ≤ M U (k) which follows from (9), since C · D(k) is a positive term and Q k ≤ M U (k) as γP < P .
By replacing M * (k) in (2), each client derives the PS time T * P S (k) = A * (k) using the device-specific parameters M U (k), M L (k) and D(k), as well as the training S train k and validation datasets S val k , respectively, as inputs. The device returns the T * P S (k) value to the PS which makes a final decision for T P S . The traffic statistics, the upper and lower bounds, M U (k) and M L (k), are obtained independently by each client during an initial training stage using M (k) = 1. The log-normal model parameters (3) are computed by means of consecutive time measurements T k , that account for global model download, local training and model upload steps as in (2). After the training stage, we obtain the performance metric P and apply the policy function F in (8) using P and VMR D(k). Fig. 3 shows an example of local training, with loss and validation accuracy P for varying local epochs. Notice that few epochs are typically sufficient to improve the local model without incurring in overfitting. Cross-entropy function P in FL also follows a negative exponential behavior, namely P ≈ a − e −bm , for m < M U , while for such case γ = 0.5 is found as reasonable (see Sec. IV-C). With the proposed policy we avoid the overfitting region, transferring at the same time a great portion of local information. The term C · D(k) in (9) Fig. 3. Example of validation accuracy (blue) and loss (red) during local model training on a client. M U (k) and M * (k) values are highlighted together with accuracy P and γP (γ = 0.5) respectively. Solid and dotted lines are obtained with 100% and 10% of the training dataset (of size S k ). Note that overfitting is expected, as the local training process uses few training samples. accounts for the traffic variance. Intuitively, a high variance, as a result of a client with varying computing resources, might increase the probability of observing high local training intervals (slow client). For such scenario, choosing a low value of M (k) allows the PS to process the client local model updates more often and compensate for this effect.
The PS collects the optimal T * P S (k) computed by Algorithm 1 and derives the response time to be used for all clients. Considering the previous analysis, this is obtained as: with the requirements on the efficiency already satisfied by In the following, we explore two scenarios in detail. In the first one, the clients are homogeneous as featuring the same VMR D(k), M * (k) and T train (k). In the second scenario, the clients are heterogeneous and organized into two clusters (C 1 and C 2 ). Clients in each cluster k ∈ C i have similar computing power/capabilities, namely A(k) = A i , σ 2 (k) = σ 2 i and VMR D(k) = D i , ∀k ∈ C i corresponding to the same processing unit, and TPU, if any. Accuracy improvements obtained by following the policy (10) are assessed in both cases.

C. Experimental Assessment
For the experiments, we consider the CIFAR10 [12] dataset using the full validation data and a local training set of S k = 500 images for each of the K = 9 clients. Policy validation is based on a real-time FL platform prototype featuring physically separated clients (here Jetson Nano devices). The approach adopted for modelling the client heterogeneity is twofold. First, we used an additive delay whose log-normal distribution fits the observed completion times in Fig. 2. Second, the training data is non-IID distributed. To simplify the analysis, the data size is the same for all clients (as typical in FL). The prototype consists of devices connected via WLAN to the PS, thus permits to quantify the impact  of communication impairments. The adopted FL algorithm is FedAvg [9] while Adam [13] is used as local optimizer with default hyper-parameters. The ML model is described in Fig. 1 and consists of 10 5 parameters with size G = 2.53 MB. Loss and performance metrics are the categorical cross-entropy and categorical accuracy, respectively. For each considered case, the validation accuracy is evaluated after T F L = 800 seconds. Considering the communication with the PS, the bandwidth is B = 40 MHz while the max. efficiency dedicated to FL is set to η M AX = 0.05 bit/sec/Hz. MQTT publishing uses Quality of Service (QoS) level 2 as this permits re-transmissions in exchange for larger T up (k) and T down (k).
As previously described, the FL process starts with a client local training to find M U (k) and subsequently M * (k) and T * P S (k) according to Algorithm 1, with γ = 0.5. C = 0.2 adapts the VMR D(k) contribution in (9) to the considered traffic types. Fig. 3 shows the validation loss/accuracy versus the local epochs that are used to retrieve M U (k) and M * (k) from (7) and (9). For clients with training data size S k , the optimal M * (k) is 6 epochs. Setting now the training size to 10%, we observe shifts in the overfitting region (around M U (k)) according to the bias-variance of the model: now M * (k) = 4.
In Table I we consider at first homogeneous clients, i.e., clients with the same computing power and traffic distribution, but with non-IID data (80% of the local samples are drawn from the same class, chosen randomly). Traffic parameters are set to vary within the set A(k) ∈ [2, 80] s and σ 2 (k) ∈ [0.001, 100]. To evaluate the proposed policy, we choose T P S ∈ [1, 400] s and obtain the empirical optimal T P S for each type of traffic. Notice that for the proposed example, setting η M AX = 0.05, a client with A(k) ≤ 80 s would require M L (k) = 1. Table I summarizes the results for all experiments  TABLE II  CLIENTS ORGANIZED IN TWO CLUSTERS (C 1 AND C 2 ) AND NON-IID  DATA (SEE TABLE I grouped based on the traffic type (Bernoulli, Poisson, Pascal), mainly determined by the value of σ 2 (k). For each experiment, we highlight the traffic parameters A(k), D(k) (obtained for M (k) = 1), the resulting optimal T * P S , the corresponding M * (k) and the fraction (%) of utilized bandwidth w.r.t η M AX . Last column quantifies the accuracy improvement w.r.t. a vanilla asynchronous strategy where the PS updates the global model as soon as a local model is available.
Considering the Bernoulli distribution (Table I.a) and D(k) ≈ 0, the optimal values of T P S and M (k), obtained empirically, are in-line with the model (9), that gives M * (k) = 6 for γ = 0.5 and C = 0.2. On the other hand, when A(k) > 20, namely the clients being all very slow, it is more convenient to keep the T P S as low as possible, rather than following the policy (waiting the end of the optimized round). Increasing D(k), namely using Poisson and Pascal traffic, this behaviour is less evident as the optimal M * (k) (4 and 1, respectively) is now in line with the policy and maintained for all A(k). To summarize, the policy is effective for the prediction of the optimal values of M * (k) and can be used to tune the T * P S when A(k) ≪ T F L (see cases highlighted in green). Table II analyzes a more general case where the clients belong to clusters C 1 and C 2 and have different local learning completion times. Cluster C 2 contains much slower clients compared with C 1 : the example is thus useful to verify the proposed policy for T P S and whether the PS should specifically follow any cluster C i , or not. To achieve that, we vary A(k) = A i=1,2 of both clusters within the set [1, 80] s, T P S ∈ [1, 400] s and σ 2 i=1,2 ∈ [0.001, 100] s 2 . Cluster C 1 contains 5 clients while the remaining 4 clients belong to C 2 . The results highlighted in blue show that the optimized T P S should be set to follow the optimal number of local rounds M * (k) of the faster clients. For example, this can be seen in the extreme case of (A 1 = 1, A 2 = 80) where the cluster C 2 almost does not affect the choice of T * P S . For all the considered cases, the use of an underestimated value of M * (k) should be preferred to prevent overfitted local models. In other words, it is more beneficial to update more often the global model following the faster clients, k ∈ C 1 , as opposed to wait for the slower clients, k ∈ C 2 , to complete their round. To conclude, we observe that designing T P S based on the knowledge of the client-specific traffic is able to outperform asynchronous FL strategies. Optimization of T P S is particularly critical for the case of two clusters. Observed accuracy gain increases from 5.1% (single cluster), to 15-17% on average.

V. CONCLUSION
The letter proposed a stochastic traffic model to describe the clients' behavior in FL processes. The model is based on the moment-matching approximation and it is verified with practical resource-constrained devices communicating with a Parameter Server (PS) using MQTT transport. Traffic characterization is used to develop a policy for the selection of the PS response time T P S in asynchronous FL. The policy satisfies spectral efficiency constraints and avoids FL overfitting impairments. Results obtained in real setups show that an accuracy increase of up to 15-17% is possible when clients exhibit different local model completion times.