Federated Learning Using Three-Operator ADMM

Federated learning (FL) has emerged as an instance of distributed machine learning paradigm that avoids the transmission of data generated on the users' side. Although data are not transmitted, edge devices have to deal with limited communication bandwidths, data heterogeneity, and straggler effects due to the limited computational resources of users' devices. A prominent approach to overcome such difficulties is FedADMM, which is based on the classical two-operator consensus alternating direction method of multipliers (ADMM). The common assumption of FL algorithms, including FedADMM, is that they learn a global model using data only on the users' side and not on the edge server. However, in edge learning, the server is expected to be near the base station and has often direct access to rich datasets. In this paper, we argue that it is much more beneficial to leverage the rich data on the edge server then utilizing only user datasets. Specifically, we show that the mere application of FL with an additional virtual user node representing the data on the edge server is inefficient. We propose FedTOP-ADMM, which generalizes FedADMM and is based on a three-operator ADMM-type technique that exploits a smooth cost function on the edge server to learn a global model in parallel to the edge devices. Our numerical experiments indicate that FedTOP-ADMM has substantial gain up to 33% in communication efficiency to reach a desired test accuracy with respect to FedADMM, including a virtual user on the edge server.


I. INTRODUCTION
Centralized training of machine learning models becomes prohibitive for a large number of users, particularly if the users -also known as clients or agents or workers -have to share a large dataset with the central server.Furthermore, sharing a dataset with the central server may not be feasible for some users due to privacy concerns.Therefore, training algorithms using distributed and decentralized approaches are preferred.This has led to the concept of federated learning (FL), which results from the synergy between large-scale distributed optimization techniques and machine learning.Consequently, FL has received considerable attention in the last few years since its introduction in [1], [2].
In the FL framework -illustrated in Figure 1 -a distributed optimization problem, such as problem (1), is essentially solved considering a central server and other devices by exchanging the parameters/weights of a considered model rather than sharing private data among themselves.The devices desire to achieve a learning model using data from all the other devices for the training.Instead of sending data from the devices to the edge server that computes such a model, these devices execute some local computations and periodically share only their parameters.Specifically, the FL technique intends to minimize a finite sum of (usually assumed) differentiable functions that depend on the data distributions on the various devices.The common solution to such a minimization involves an iterative procedure, wherein at each global communication iteration, the clients transmit their updated local parameters to a central edge server -illustrated as Step ① in Figure 1.The edge server then updates the global model parameters -shown as Step ② in Figure 1.However, the processing Steps 2a and 2b are specific to our proposed approach, and leverage the possibility of using a dataset on the edge server to improve the learning model.Thereafter, the server broadcasts the updated parameters or weights to all or the selected nodes-see Step ③ in Figure 1.Lastly, the clients update their local parameters using their local (private) dataset and the received global model parameters from the central server-Step ④ in Figure 1-to proceed to the next iteration.S. Kant, B. Göransson, and G. Fodor are with Ericsson AB and KTH Royal Institute of Technology, Stockholm, Sweden (e-mail: {shashi.v.kant, bo.goransson, gabor.fodor}@ericsson.com).The work of S. Kant was supported in part by the Swedish Foundation for Strategic Research under grant ID17-0114.
José Mairton B. da Silva Jr. is with Princeton University, Princeton, NJ, USA, and also with the KTH Royal Institute of Technology, Stockholm, Sweden (e-mail: jmbdsj@kth.se).His work was supported by the European Union's Horizon Europe research and innovation programme through the Marie Skłodowska-Curie project FLASH under Grant 101067652.
M. Bengtsson and C. Fischione are with KTH Royal Institute of Technology, Stockholm, Sweden (e-mail: {matben, carlofi}@kth.se)Many state-of-the-art FL techniques, such as FedAvg [2] and FedProx [3] can be seen as an instance of one-operator proximal splitting techniques [4], i.e., described in (1) with function g = 0, which considers learning on the users' side mainly.Differently from these techniques, we propose to go beyond and use a three-operator 1 proximal splitting technique, such as our recently proposed three-operator alternating direction method of multipliers (TOP-ADMM) method [6], to leverage the possibility of learning on the edge server together with the traditional learning on the edge devices.

A. Motivation for Learning Model on the Edge Server
In future (6G) cellular networks, machine learning services will be used both to design the networks and as a service provided by the networks [7].We expect to leverage the raw data generated not only on the users' side, but also on the edge server's side collocated at the base station or/and radio access networks.More concretely, we foresee scenarios, in which the raw data is available at the base station because data is generated continuously at the physical layer and radio in 5G NR and beyond.Thus, data at the server side together with data on the users' side, enable many distributed signal processing applications such as joint communication and sensing, or edge learning with the Internet of Things (including edge devices).The optimization problem for FL considering learning on the edge server and on the edge devices is naturally described by a sum of three functions or three operators-see (5).A three-operator problem for FL can be solved via traditional block-wise two-operator splitting techniques or treating the learning model on the edge server as a virtual user, e.g., in an existing two-operator-based 2 FedADMM [8].However, we show numerically, see Figure 5, that such an approach is not necessarily more communication efficient than tackling such a problem fundamentally from the three-operator proximal splitting perspective.
To the best of our knowledge, FL using three-operator techniques have not been addressed in the literature, which tackles the learning on both the server and users independently.This proposal is highly novel and benefits from the richly available datasets on the edge server and edge devices from current fifth generation (5G) and future 6G cellular networks.Hence, these are key motivating reasons to consider a three-operator problem being tackled in an FL fashion.

B. Contributions
We present a new communication-efficient and computationally efficient FL framework, referred to as FedTOP-ADMM, using a three-operator ADMM method.Specifically, our key contributions are: • We demonstrate the viability of a practical edge learning scenario in which private datasets are available on the devices, and another private dataset is available on the edge server/base station.We model this edge learning scenario using a novel three-operator splitting method that benefits from the private datasets on both edge server and edge devices.• We propose the FedTOP-ADMM method by applying and extending our recently proposed TOP-ADMM method [6] to tackle a composite optimization problem (5) comprising a sum of three functions (or three operators).More specifically, we propose two variants of FedTOP-ADMM, termed FedTOP-ADMM I and FedTOP-ADMM II, where FedTOP-ADMM II does not learn on the server side when aggregating the parameters from the users and the server itself to generate a common model parameter.However, FedTOP-ADMM I learns a model before aggregation of the parameters in addition to learning in parallel with the users.Thus, FedTOP-ADMM I has a slightly better performance compared to FedTOP-ADMM II.• We extend the results of [6] by establishing a new theoretical convergence proof of TOP-ADMM under general convex settings.Additionally, extending the convergence results of TOP-ADMM, we prove the optimality conditions of FedTOP-ADMM.• FedTOP-ADMM capitalizes on the possible data available on the edge server collocated at the base station in addition to the data available on the users' side.Consequently, our numerical experiments show noticeable communication efficiency gain over the existing state-of-the-art FL schemes using real-world data.• Our proposed FedTOP-ADMM is built on the existing framework of communication-and computationally efficient FedADMM [8].Therefore, FedTOP-ADMM inherits all the merits of FedADMM.Furthermore, we show that our proposed FedTOP-ADMM is up to 33% more efficient in terms of communication rounds to achieve the same target test accuracy as the one achieved by FedADMM, where the extra dataset on the edge server is modelled as an additional virtual client collocated at the base station.

II. STATE OF THE ART
In this section, we briefly describe the related works on FL, including two or three operators in proximal splitting techniques useful for FL.
FL may suffer from two drawbacks: privacy leakage and communication inefficiency.Although FL keeps the data local on the clients and thus inherently has privacy properties, it does not guarantee complete privacy because significant information may still leak through observing the gradients [13].Moreover, the FL algorithms may have an unsustainable communication cost: the local parameters must be communicated via uplink from the devices to the edge server, and via downlink from the server to the local devices.The local parameters can be vectors of huge sizes whose frequent transmissions and reception may deplete the battery of the devices and consume precious communication resources.
A number of works have addressed the problem of communication efficiency in FL [11], [14]- [19].We can roughly divide them into two classes: 1) data compression in terms of quantization and sparsification of the local parameters before every transmission [14]- [17], and 2) reduction of the communication iterations [11], [17]- [19].The works in the second class attempt to reduce some communication rounds between the devices and the edge server, as for example proposed in lazily aggregated gradient (LAG) approach [11], [18].In LAG, each device transmits its local parameter only if the variation from the last transmission is large enough.However, both classes of approaches assume an underlying iterative algorithm whose iterations are sometimes eliminated or whose carried information (in bits) is processed to consume fewer communication resources.The process of making the underlying algorithm more communication efficient, regardless of the improvements that can be done on top of it, has been less investigated.The state of the art can be found in FedADMM [8], which not only allows the averaging of users' parameters periodically to reduce the communication rounds but also aims to improve the communication efficiency by using one-operator proximal splitting techniques in FL methods.Our paper focuses on this line of research and extends it to three-operator proximal splitting while proposing learning jointly on the edge server and edge devices side.

B. Related Works on Operator/Proximal Splitting
In recent decades, a plethora of proximal or operator splitting techniques, see, e.g., [6], [20]- [28], have been proposed in the literature.Although the convergence of many proximal/operator splitting algorithms is proven only for convex settings, these techniques can still be employed to solve many nonconvex problems prevalent in machine learning problems (without convergence or performance guarantees).In the past decade, ADMM-like methods, see, e.g., [21], [28], have enjoyed a renaissance because of their wide applicability in large-scale distributed machine learning problems by breaking down a large-scale problem into easy-to-solve smaller problems.However, these proximal splitting techniques are typically first-order methods, which can be very slow to converge to solutions having high accuracy.Nonetheless, modest accuracy can be sufficient for many practical FL applications.
Operator splitting for two operators, or loosely speaking, optimization problems, in which the objective function is given by the sum of two functions, have recently been employed for FL [4], [8], [29]- [31].However, some FL optimization problems -see problem (5) -can be cast as a composite sum of three functions comprising smooth and non-smooth functions.Unfortunately, operator splitting with more than two composite terms in the objective function are either not straightforward or converges slowly [26], [27].Recently, some authors have extended the primal-type or dual-type, including ADMM-type algorithms, as well as the primal-dual classes of splitting algorithms from two operators to three operators [6], [26]- [28], [32]- [36].

C. Notation and Paper Organization
Let the set of complex and real numbers be denoted by C and R, respectively.ℜ{x} denotes the real part of a complex number x ∈ C. The i-th element of a vector a ∈ C m×1 and j-th column vector of a matrix A ∈ C m×n are denoted by a[i] := (a) i ∈ C and A [:, j] ∈ C m×1 , respectively.We form a matrix by stacking the set of vectors a respectively.The transpose and conjugate transpose of a vector or matrix are denoted by (•) T and (•) H , respectively.The complex conjugate is represented by (•) * .The K × K identity matrix is written as I K .An i-th iterative update reads (•) (i) .
The remainder of the paper is organized as follows.In the next section, we introduce the TOP-ADMM technique.In Section IV, we establish our proposed FedTOP-ADMM algorithm.In Section V, we present the numerical results, and in Section VI we conclude with a summary and future work.Appendix A contains some useful definitions and lemmas.Appendix B presents the completely novel convergence proof of our recently proposed TOP-ADMM [6] algorithm.

III. INTRODUCTION TO THE TOP-ADMM ALGORITHM
In this section, we firstly introduce the classical two-operator consensus ADMM.Subsequently, we present our recently proposed TOP-ADMM algorithm [6], [37].

A. Classical Consensus ADMM
The two-operator consensus ADMM [21], [22] is a popular method in the optimization and machine learning communities to solve problems of the form minimize where {f m } and g(•) are closed, convex, and proper functions.The classical ADMM algorithm that solves problem (1) can be summarized, following [21,Chapter 7], as follows with positive definite matrices Q x and Q z to (2a) and (2b), respectively.

B. Consensus TOP-ADMM
The recently proposed three-operator-based3 TOP-ADMM [6] is one of the generalized algorithms for the classical consensus ADMM.The TOP-ADMM can be employed to solve many centralized and distributed optimization problems in signal processing, which are difficult to solve by traditional ADMM methods.Specifically, the TOP-ADMM algorithm has been used to solve spectrum shaping via spectral precoding and peak-to-average power ratio reduction in multiple carriers-based wireless communication systems, such as 5G cellular networks [6], [37], which were difficult to solve with the classical twooperator ADMM methods.More precisely, we can refer to [6, Figure 4 and Figure 5] as a concrete and realistic example within wireless communications that shows numerically that two-operator ADMM-like algorithm solving three-operator problem exhibit slow convergence compared to TOP-ADMM.
First, let us understand the shortcomings of classical (linearized) ADMM and subsequently present TOP-ADMM to deal with the demerits of ADMM.To this end, let us add a convex L-smooth (L > 0) function h to the cost function of (1) with some scaling β ∈ R >0 such that the problem formulation becomes minimize There are at least two possibilities to tackle problem (3) using the classic two-operator ADMM: 1) product space or problem reformulation [38], [39], and 2) consider g (z) := g (z)+ βh (z) or f m (x m ) := f m (x m ) + (β/M )h (x m ).The first approach is either not straightforward or the resulting algorithm converges slowly [26], [27].The second approach yields a subproblem, i.e., z (i+1) or x (i+1) m update in (2), which may not have a computationally efficient solution in general.Furthermore, the classical two-operator ADMM does not exploit the smoothness of h.Recall that if h is a quadratic function, then clubbing it with either update (2a) or (2b) and employing linearized ADMM can solve the respective subproblem.However, this approach does not necessarily yield a better performance than the TOP-ADMM.
In essence, the TOP-ADMM algorithm can also be classified as a divide-and-conquer method, which decomposes a large optimization problem -difficult to solve in a composite form (3) -into smaller subproblems that are easy to solve.
In the following, we lay out the general definition of our TOP-ADMM method, which was introduced in [6].For completeness, we present the convergence proof of TOP-ADMM, which is novel and was not present in [6].
Theorem 1 (TOP-ADMM).Consider a problem given in (3) with at least one solution and a suitable step-size τ ∈ R ≥0 .Assume subproblems (4a) and (4b) have solutions, and consider a relaxation/penalty parameter ρ ∈ R >0 and some arbitrary initial values at any limit point, converges to a Karush-Kuhn-Tucker (KKT) stationary point of (3).
Proof : See Appendix B. ■ Note that the classical consensus ADMM algorithm is a special case of our proposed TOP-ADMM algorithm when h = 0 in (3) or ∇h = 0 in (4).Although classical consensus ADMM can solve many problems in machine learning, it does not necessarily yield an implementation-friendly algorithm, particularly, if the proximal operator of the L-smooth function h is inefficient to compute.

IV. FEDERATED LEARNING USING TOP-ADMM
FL can used to solve the distributed consensus problem (3).Furthermore, we envision learning on the edge server, say with a loss function h in (3) with some constraint or regularizer expressed by g in ( 3).Therefore, we pose the generic distributed problem (3) for FL using new variables as given below: where the global weight vector of the considered learning model is given by w ∈ R n and the weight vector of user m corresponds to w m ∈ R n .Furthermore, the loss function at user m and the server are denoted by f m and h with training dataset D m and D, respectively.The loss function f m of user m is weighted by α m ≥ 0 satisfying m α m = 1, and the server's loss function h is weighted by some nonnegative β ≥ 0.
The edge server is expected to be collocated at the base station such that the server learns the global model using the data D generated or stored on the server side together with the learning from the data generated/available on the users' side.Observe that in many existing federated learning frameworks, such as FedADMM [8], g = 0 and h = 0 in (5).
We are now ready to apply the TOP-ADMM algorithm to (5) for the FL purpose with some modifications.Firstly, we swap the update order of TOP-ADMM, i.e., (4a) and (4b).As suggested in [24,Chapter 5], the sequence updates of ADMM for z (i+1) in (2b) and x (i+1) m in (2a) are performed in Gauss-Seidel fashion.Specifically, one can interchange these updates without penalizing the convergence guarantee but the generated sequences over iterations may be different in general.Since TOP-ADMM generalizes ADMM, we can also interchange the updates.Secondly, we add a proximal term scaled by the parameter ζ (i) to the subproblem in (2b).Lastly, we employ a so-called Glowinski's relaxation factor γ ∈ (0, 2) to the dual update, see, e.g., [28].Hence, the TOP-ADMM algorithm tackling (5) can be summarized as follows: where step size τ (i) ∈ R ≥0 and proximity parameter ζ (i) ∈ R ≥0 are adaptive over iterations, i.e., more concretely, τ (i+1) ≤ τ (i) , and Unfortunately, a direct application of the enhanced TOP-ADMM algorithm (6) in the FL context would incur high communication costs due to the exchange of parameters among the server and the selected clients in every global iteration.Hence, TOP-ADMM may negatively affect the communication efficiency.Additionally, at a given global iteration in the TOP-ADMM algorithm (6), each client is expected to send and receive the updates to the server synchronously.Unfortunately, not all of the involved clients can transmit and receive the updated parameters in FL due to the limited communication bandwidth and their computational capabilities.Considering the aforementioned limitations of the direct application of TOP-ADMM for the FL, we extend the TOP-ADMM algorithm using the FL framework introduced in [8].Therefore, we establish a novel algorithm named FedTOP-ADMM described in the next section catering to the FL purpose.
We would like to accentuate that it is unclear how to extend the existing FedADMM [8] framework to support learning on the server, i.e., described by a loss function h in (5) while supporting a nonsmooth regularizer/function g for the distributed learning.Ignoring g for the moment in (5), one could argue to artificially add yet another parallel client on the server in the existing FedADMM framework, i.e., h ≡ f M +1 .However, the additional virtual client does not necessarily yield better convergence performance.Therefore, we evolve the FedADMM [8] framework using our TOP-ADMM algorithm.Subsequently, we show in the numerical Section V that FedTOP-ADMM renders superior performance over FedADMM with additional virtual client-see, e.g., Figure 5. Recall that the framework proposed in [8] is based on the classical two-operator consensus ADMM-cf.(1) with g = 0. Therefore, the proposed FL using TOP-ADMM, i.e., FedTOP-ADMM, not only inherits all the properties of classical two-operator ADMM, i.e., FedADMM [8], but also additionally exploits the L-smooth function on the server.In other words, FedADMM is a special case of FedTOP-ADMM.
In the sequel, we establish the FedTOP-ADMM algorithm that is built on the FedADMM [8] framework utilizing the TOP-ADMM algorithm.
[Server weight updates] 4: if FedTOP-ADMM I OR (FedTOP-ADMM II AND i / ∈ P) then end if Uplink communications and global parameter updates: 7: if i ∈ P then % Uplink communications with users' selection 8: [Server receives updated weights] v for downlink communications 10: else 11: [Server utilizes previous weights] v [Global parameter aggregation/update] Downlink communications and local parameter updates: %Parallel 14: [Server multicasts] 15: for every m ∈ U (i) do % update sequences 16: if i ∈ P then %Selected users receive the updated weights end if 19: [User weight updates] for every m / ∈ U (i) do 23: end for 25: end for

A. FedTOP-ADMM: Communication-Efficient Algorithm
We present in Algorithm 1 our novel FedTOP-ADMM using the TOP-ADMM (6) and FedADMM [8] framework for communication-efficient FL.Specifically, we propose two variants of FedTOP-ADMM algorithms, which are referred to as FedTOP-ADMM I and FedTOP-ADMM II.In FedTOP-ADMM I, we learn the model on the server side continuously in every global iteration.Conversely, in FedTOP-ADMM II we learn the considered global model when the server is not communicating and aggregating the parameters of the model, i.e., the server learns in parallel to the selected users.Notice that we refer to FedTOP-ADMM I and II as common FedTOP-ADMM, where the performance difference between I and II is non-noticeable.Hence, FedTOP-ADMM corresponds to both FedTOP-ADMM I and FedTOP-ADMM II depending on the context.
As our proposed FedTOP-ADMM generalizes the FedADMM algorithm, it inherits all the properties of the FedADMM algorithm, including its communication efficiency when using only data available on the edge devices.In the FedTOP-ADMM, there are three additional hyperparameters compared to FedADMM, namely τ (i) , ζ (i) , and γ.Meanwhile, note that FedADMM requires tuning of {Q m } and {ρ m } parameterssee [8] and also discussion in Section V.
To establish the convergence of FedTOP-ADMM Algorithm 1, we extend the vanishing property of residual errors proved in Theorem 1.Using this result, we establish the global convergence of FedTOP-ADMM in the following theorem.
Proof : See Appendix C. ■ We describe the necessary processing steps of our proposed Algorithm 1 as follows.In Algorithm 1, the total number of users participating in the FL process is denoted by M .We specify the maximum number of global iterations as I.However, note that many heuristics-based early stopping techniques can potentially be employed on the server, e.g., when reaching the required test accuracy, but are not considered herein.
In Step-1 of Algorithm 1, we provide the required inputs to the algorithm with appropriate initialization of vectors and (iterative) parameters, including the set U (0) of selected users for the communication with the server.We denote communication events with users by P := {0, J, 2J, . ..}, which shows periodic events4 at every J iterations.Therefore, J represents the number of local iterations on the users' side.Consequently, the communication rounds in Algorithm 1 is given by [8] Communication rounds := ⌊ i /J⌋ , where ⌊•⌋ denotes flooring to the nearest integer.
The iterative FedTOP-ADMM algorithm starts at Step-2.Note that FedTOP-ADMM stops when the total number of global iterations I is exhausted, or one employs an early stopping criteria within the loop, e.g., using test accuracy criteria as the server is expected to have some test dataset.
If FedTOP-ADMM I is employed, then at Step-3 and Step-4 for any global iteration i, the server performs some intermediate processing reminiscent of gradient descent-like step.Conversely, if FedTOP-ADMM II is employed, then when there is no communication event between the server and any users for a given iteration i, i.e., i / ∈ P, the server performs the same processing in Step-3 and Step-4 as FedTOP-ADMM I.This intermediate step processing at server, i.e., Step-4, is part of the global weight vector w update-cf.(8) and Step-4 of Algorithm 1.More specifically, assuming g is closed, convex, and proper function (possibly nonsmooth), the solution corresponding to the subproblem (6a), i.e., the w update, reads where the definition of proximal operator is given in Definition 3 of Appendix A, , and u . In the FedTOP-ADMM algorithm, instead of directly using u in ( 8), the server utilizes v m , which is updated as described from Step-7 to Step-12 of Algorithm 1.When communication events i ∈ P occur, the server receives parameter vector u (i) m from selected user m ∈ U (i) and consequently updates v (i)  m .Note the difference between the update in ( 8) with the Step-13 of Algorithm 1, which aggregates all the weights of users and server appropriately.Consequently, the global updated weight vector w (i+1) is generated at Step-13.
The second subproblem of TOP-ADMM (6b), corresponding to the primal update of weight vector w (i+1) m for each user m, is equivalent to the subproblem in the classical ADMM or FedADMM [8], [31].We use the inexact solution to this subproblem, i.e., a linear approximation of the function f m , proposed in [8], [31], in which the recipe is given in Step-19 of Algorithm 1.We refer the interested readers to [8], [31] for the detailed analysis of the inexact solution.
Step-19 is repeated J-times before the server receives the updated parameter u (i+1) m from the selected user m ∈ U (i) .In Step-23, if the user m was not selected for the communication with the server, i.e., m / ∈ U (i) , the server essentially utilizes the previously received parameter from the nonselected user m.

C. Connections with Existing Works on FL
The other benchmarking algorithms besides FedADMM [8] are FedProx [3] and FedAvg [2], which can easily coexist with the Algorithm 1 framework as highlighted in [8].Specifically, in Step-19 of Algorithm 1, the FedTOP-ADMM becomes FedProx or FedAvg by setting the following TOP-ADMM parameters to zero, i.e., τ (i) = 0, ζ (i) = 0, and γ = 0 and replacing the weight update for each user with , where η is a step size and µ is a scaling parameter for the proximal term, in which µ = 0 for FedAvg.Moreover, note that FedADMM-VC represents FedADMM with a virtual client collocated at the edge server.

V. NUMERICAL RESULTS
In this section, we conduct the experiments using the distributed (sparse) logistic regression to benchmark the performance of these existing algorithms against our proposed FedTOP-ADMM algorithms.minimize where the training dataset on the user m is D m := {t m , A m } M m=1 , i.e., the binary output is t m ∈ {0, +1} dm , the input feature matrix is A m ∈ R n×dm , and the regression weight vector is w ≡ w m ∈ R n .The scaling factor to the regularizer is κ = 0.001 in the experiments, unless otherwise mentioned.In case of non-sparse logistic regression problem, one can ignore the ℓ 1 -norm regularizer (9b) by setting zero to υ ∈ R ≤0 .Notice that the non-sparse problem is also used in [8].
For all the considered methods including our proposed method, the loss or objective function at each user m or at the server for FedTOP-ADMM in (9) reads

A. Experimental Settings
In this section, we present numerical results to illustrate the performance of our proposed FedTOP-ADMM I and FedTOP-ADMM II algorithms compared to the state-of-the-art algorithms FedADMM [8], FedProx [3], and FedAvg [2].
Before we proceed further with more realistic dataset and comparison of our proposed FedTOP-ADMM with the abovementioned benchmarking methods, we want to highlight the strength of our proposed FedTOP-ADMM in contrast to these existing algorithms.In other words, we show the potency of threeoperator structure of FedTOP-ADMM in contrast to these existing one-operator, i.e., corresponding to sum of separable f m functions, FedAvg, FedProx, FedADMM.More concretely, we show how easily our proposed three-operator FedTOP-ADMM can exploit the non-differentiable regularizer (9b) that can easily be handled by setting, say, function g as a scaled ℓ 1 -norm, whose proximal operator is well known in the literature, see, e.g., [25], [28].Although it is unclear how to employ ℓ 1 -norm regularizer with existing one-operator methods FedAvg and FedProx because of non-differentiablity of ℓ 1 -norm, we can only modify FedADMM to incorporate the second operator corresponding to function g since FedADMM is built on the classical consensus two-operator ADMM (1).Thus, in the modified FedADMM, after the global aggregation one can include the second operator, i.e., proximal operator corresponding to the function g, without the third operator corresponding to the gradient of h.More precisely, the Step-13 of Algorithm 1 for the modified FedADMM without the third operator (corresponding to the gradient of function h) can be expressed as w with γ = 1.The numerical convergence behaviour of FedTOP-ADMM and modified FedADMM, in terms of objective, primal residual, and dual residual against iterations are depicted in Figure 2. In this simple but illustrative example, the random test setup is similar to [21,Section 11.2].Additionally, in this toy example, we have generated the synthetic sparse training dataset with a total of 20000 examples having a feature vector length n = 100.We have employed M = 100 users that are distributed, and each user has d M = 200 training examples, where all the users are active and J = 1.Clearly, FedTOP-ADMM shows faster convergence than FedADMM, while delivering similar/better training error (0.82%) than that of FedADMM (0.85%).
In our next set of experiments, we have ignored the sparse parameter in the logistic regression such that we can compare FedTOP-ADMM with not only (modified) FedADMM but also FedAvg and FedProx.Additionally, in all the subsequent considered simulations, the total number of users is fixed to 200, i.e., M = 200.However, 10 users are selected uniformly at random during each communication event of the global iteration, i.e., i ∈ P.
We have conducted experiments using one of the most popular real-world datasets, namely MNIST [41].Specifically, we have scaled/normalized the MNIST input data.There are many ways to scale the input data matrix A ∈ R n×d , where d = M m=1 d m and n represent the total number of data samples for training and testing, and feature vector length, respectively.Moreover, we have analyzed two scaling approaches: 1) a ∈ R n×1 , a := mean (A, 2)⊘std (A, [ ], 2), and 2) a :=mean (A, 2)⊘var (A, [ ], 2) ∈ R n×1 , where ⊘ corresponds to elementwise division 5 , and mean, std, and var represent the mean, standard deviation, and variance along the second dimension of matrix A as used in MATLAB expressions.Finally, the scaled version of the MNIST dataset is expressed as A ← A−1 1×d ⊗a, where 1 1×d denotes a row vector of all ones with dimension 1 × d, and ⊗ represents a Kronecker product.Additionally, we use other popular datasets, such as CIFAR-10 and CIFAR-100 [42].
We evaluate the performance using two data partitioning: 1) independent and identically distributed (i.i.d.), where the data is randomly shuffled and the corresponding labels are shuffled accordingly, and 2) non-i.i.d., where the training labels are sorted in ascending order and the corresponding input data is ordered accordingly.Thus, in case of MNIST dataset, this non-i.i.d.data split is one of the pathological cases because each user or base station would have at most two class labels.
In Figure 3, we illustrate the examples of MNIST digits without, see Figure 3a and with these two aforementioned scaling approaches, see Figs. 3b-3c.Based on our exhaustive experiments, we have found the second approach in Figure 3c more challenging to learn than the first scaling approach in Figure 3b and the unscaled original version in Figure 3a.Consequently, we have employed the second scaling approach for our further numerical analysis.The data distribution among the users and the server is i.i.d.unless otherwise stated.Moreover, we have considered a binary classifier by simply employing digit 1 as the true label and other digits as false labels.
Figure 5 compares the performance in terms of both loss function or objective (10) and test accuracy, among FedTOP-ADMM I, FedTOP-ADMM II, FedADMM, and FedADMM-VC with the aforementioned chosen parameters.Additionally, Figure 6 compares the performance for J ∈ {1, 5, 10}.Noticeably, these results substantiate our argument that exploiting the data knowledge on the edge server using our proposed FedTOP-ADMM schemes outperform FedADMM-VC, i.e., with a virtual client.Moreover, these results indicate that FedTOP-ADMM II has non-noticeable performance loss compared to FedTOP-ADMM I when J > 1.For instance, with J = 10, FedTOP-ADMM has a gain of up to 33% in the communication efficiency with respect to FedADMM to reach a test accuracy of 98%.Furthermore, as mentioned before, FedTOP-ADMM II boils down to FedADMM when J = 1.
In Figure 7, we compare the performance of FedTOP-ADMM with FedADMM and FedProx under non-i.i.d.distribution of the MNIST dataset for J = 10.Recall that this non-i.i.d.data split is one of the pathological cases because each user or base station would have at most two class labels.Nevertheless, FedProx performs slightly better than FedAvg.However, both FedTOP-ADMM and FedADMM outperform FedProx and FedAvg.Further, FedTOP-ADMM has a gain of up to 27% in the communication efficiency with respect to FedADMM to reach a test accuracy of 97% under non-i.i.d.distribution of MNIST dataset.Observe that we have changed the tunable parameters of all the methods compared to the previous results of MNIST.In particular, we have chosen the following parameters of respective methods: a) FedTOP-ADMMI/II (mean ({ρ m }) = 6.5731e − 6, τ (0) = 1e − 7, ζ (0) = 1.5), b) FedADMM(-VC) (mean ({ρ m }) = 6.5731e − 6), c) FedProx (η = 1e − 3; µ = 0.5), and d) FedAvg (η = 0.5e − 3).For completeness, in Figure 8, we compare the performance of non-i.i.d. with i.i.d.data considering the same tunable parameters used for Figure 7, where the performance under i.i.d.data is unsurprisingly slightly better than the performance under non-i.i.d.data.
Lastly, in Figure 9 and Figure 10, we present the performance for CIFAR-100 and CIFAR-10 dataset, respectively.We can construe the similar performance trend as observed in MNIST.

VI. CONCLUSIONS
In this paper, we proposed a novel FedTOP-ADMM algorithmic framework for communication-efficient FL utilizing our recently proposed consensus TOP-ADMM algorithm, which can tackle the sum of three composite functions in a distributed manner.Specifically, we developed two variants of FedTOP-ADMM, namely FedTOP-ADMM I and FedTOP-ADMM II that learn a considered global machine learning model using data on both the edge server and the users.Our experiments showed that FedTOP-ADMM has a significant gain of up to 33% in the communication efficiency with respect to FedADMM to reach a desired test accuracy of 98% using the proposed scaling of the MNIST dataset.For future works, we intend to establish the convergence analysis of FedTOP-ADMM for J > 1 and enhanced TOP-ADMM.Moreover, we intend to investigate the scheduling of edge devices to participate in the FL using FedTOP-ADMM, as well as the power allocation of the selected devices.

APPENDIX A SOME USEFUL LEMMAS AND DEFINITIONS
We present herein some useful definitions, propositions and lemmas that are important to ADMM methods.Definition 1 (L-smooth function [5], [25]).A differentiable function f : Definition 3 (Proximal mapping [22] [25]).Let us consider a proper closed convex function f : dom f → (−∞ , +∞], where dom f corresponds to the domain of a function f .Then, the proximal mapping of f is the operator given by:  for any x ∈ dom f , where ∂f is a subdifferential of f [5], [43], and λ > 0. If z is complex-valued or realvalued, β = 1 or β = 2, respectively.Note that the proximal operator to an indicator function becomes an orthogonal projection, i.e., prox λδ C (z) = proj C (z).

APPENDIX B CONVERGENCE ANALYSIS OF TOP-ADMM
To establish the convergence of TOP-ADMM algorithm 4, we first present two standard assumptions from ADMM proofs in the literature followed by five lemmas in the sequel.Subsequently, these assumptions and five lemmas are required to prove Proposition 1, which guarantees that the primal and dual residual errors vanish asymptotically.Finally, we establish the global convergence of TOP-ADMM, i.e., proof of Theorem 1.Although the proof structure is inspired by the convergence results from the classic ADMM [21], our convergence analysis results are new from the sum of three functions with consensus constraints perspective, i.e., for TOP-ADMM.
Towards the convergence analysis goal, we define the augmented Lagrangian to problem (3) as and, for brevity, we define the objective value at iteration i as Then, let us consider two assumptions that are standard in ADMM literature [21], [24].
be a saddle point for the (unaugmented) Lagrangian L 0 in (12).Specifically, the following holds for all {x m } M m=1 , z, {y m } M m=1 : Assumption 2. Consider subproblems (4a) and (4b).We assume that each subproblem has at least one solution.
Note that Assumption 2 does not require the uniqueness of the solution.
Then, the difference between the optimal objective and the objective at iteration i+1 is as follows: Proof : Using the primal feasibility x ⋆ m − z ⋆ = 0 ∀m = 1, . . ., M , and Assumption 1 (or duality theory [43]), let us write Then, we can write which can be rewritten as , where ∆r ■ The following lemma will be useful in Lemma 3.
Lemma 2 (Three-point inequality).Let the convex and differentiable6 function h : C n → R have L-Lipschitz continuous gradient (for L ≥ 0), where z ∈ C n , then ∀z (i+1) , z (i) , the following inequality holds: Proof : It follows by applying the descent lemma in [5], and the convexity of h.■ In the subsequent lemma, we will use the dual residual error definition: ∆x m .Observe that we will use Lemma 1 and the following Lemma 3 in Lemma 4.
Lemma 3. The difference between the achieved objective at iteration i + 1, i.e., p (i+1) , and the optimal objective, p ⋆ , is Proof : We know that x (i+1) m minimizes the sequence update defined in (4a) such that where we use dual update (4c), the notation ∂ denotes subdifferential [5], [43], and ⇐⇒ means if and only if and also overloaded as an equivalent operator.

■
We define the following function for the subsequent lemma.Definition 4. Let a Lyapunov candidate function for the TOP-ADMM algorithm at given iteration i be defined as Lemma 4. The difference between the Lyapunov function (22) at every iteration i + 1 and the previous iteration i fulfils the following inequality: Proof : We add the inequalities of Lemma 1 and Lemma 3, and then rearrange the terms such that Now, replace part (a) of ( 24) with (25) such that (24) becomes .Thus, ( 27) can be expressed as follows: Now, we substitute the sum of parts (b)-(d) of ( 26) with (28) such that ( 26 Using the Lyapunov definition-see (22)-in (29), the inequality can be rearranged as We will now bound the last component in (30).Utilizing the result in (20) by recalling that z (i+1) minimizes (4b), subsequently applying the result of three-point inequality Lemma 2, and replacing z ⋆ with z (i) , the following inequality holds at iteration i+1 with y (31) Similarly, recalling that z (i) minimizes (4b), subsequently applying the result of the three-point inequality in Lemma 2, and replacing z ⋆ with z (i) at the i-th iteration with y (i) m at hand, the following inequality holds: Now, we add ( 31) and (32), and rearrange such that We use ( 33) in (30), which finally yields (23).■ In the subsequent lemma, we show that the Lyapunov function is non-increasing for τ ≥ 0.
Lemma 5. Consider the Lyapunov function V (i) from Definition 4.Then, V (i) is non-increasing over the iterations when τ ≥ 0.
Proof : The proof follows when τ ≥ 0 is considered in the right hand side of (23) (with ρ > 0), i.e., We are now prepared to present and prove Proposition 1, which establishes the convergence to zero of the residual error, the objective residual error, and the primal residual error.

−ρ ∆r
Proposition 1.The TOP-ADMM iterative scheme in (4) ensures that the residual error, the objective residual error, and the primal residual error converge to zero, asymptotically: To show the convergence of the proposed FedTOP-ADMM, we show that the server and client processing satisfy the abovementioned optimality conditions asymptotically, i.e., when i → ∞.

Now
The processing at the user side corresponding to (6b), each w Now, using dual variable update (6c), (48) is which always satisfies the stationarity condition (43) for sufficiently large iteration number i → ∞.

Fig. 1 :
Fig.1: Illustration of FL architecture, with the new scenario investigated in this paper of a dataset available on the edge server.

( a ) 2 Fig. 3 :
Fig. 3: Examples of MNIST handwritten digits without scaling and with two different scaling approaches.More specifically, the inexact version of FedADMM sets Q m = r m I with r m = eig max A T m A m / (4+κ) that utilizes the maximum eigenvalue of A T m A m Gram matrix of input data, such that the hyperparameter ρ m (a) = [a log (M d m ) α m r m ]/(log (2+J)) with α m = 1/ (M d m ) and a user defined parameter a.We have considered the following mean values of the hyperparameters mean ({ρ m }) ∈ 3.4035e-1, 3.4035e-2,

TABLE I :
Distributed optimization problem formulations m = w ∀m s. t. w m = w ∀m s. t. w m = w ∀m