GTAdam: Gradient Tracking with Adaptive Momentum for Distributed Online Optimization

This paper deals with a network of computing agents aiming to solve an online optimization problem in a distributed fashion, i.e., by means of local computation and communication, without any central coordinator. We propose the gradient tracking with adaptive momentum estimation (GTAdam) distributed algorithm, which combines a gradient tracking mechanism with first and second order momentum estimates of the gradient. The algorithm is analyzed in the online setting for strongly convex cost functions with Lipschitz continuous gradients. We provide an upper bound for the dynamic regret given by a term related to the initial conditions and another term related to the temporal variations of the objective functions. Moreover, a linear convergence rate is guaranteed in the static setup. The algorithm is tested on a time-varying classification problem, on a (moving) target localization problem, and in a stochastic optimization setup from image classification. In these numerical experiments from multi-agent learning, GTAdam outperforms state-of-the-art distributed optimization methods.


I. INTRODUCTION
In this paper, we deal with online optimization problems over networks and propose a new distributed algorithm. In this framework, interconnected computing agents have only a partial knowledge of the problem to solve, but can exchange information with neighbors according to a given communication graph and without any central unit. In particular, we consider networks represented by a weighted graph G = (V, E, W), where V = {1, . . . , N } is the set of agents, E ⊆ V × V is the set of edges (or communication links), and W ∈ R N ×N is the (weighted) adjacency matrix of the graph. The matrix W is compliant with the topology described by E, i.e., being w ij the (i, j)-entry of W, then w ij > 0 if (i, j) ∈ E and w ij = 0 otherwise. We denote N i = {j ∈ V | (j, i) ∈ E} the set of (in-)neighbors of agent i.
The aim of the network is to cooperatively solve the online optimization problem where each f t i : R n → R is a local function revealed only to agent i at time t. In the following, we let f t (x) ≜ distributed data classification and localization in smart sensor networks, see the recent survey [1] for an overview.
In this paper, we address the distributed solution of the online optimization problem (1) in terms of dynamic regret (see, e.g., [1]). In particular, let x t i be the solution estimate of the problem at time t maintained by agent i, and let x t ⋆ be a minimizer of N i=1 f t i . Then, the agents want to minimize the dynamic regret defined as for a finite value T > 1 withx t ≜ 1 N N i=1 x t i . Another possible performance metric is the so-called static regret (see, e.g., [1]). The dynamic regret (2) is known to be more challenging than the static one [1] and, for this reason, consistently with the majority of the recent papers in literature, this work focuses on the dynamic regret (2). As it is customary in the distributed setting, we also complement these measures with the consensus metric N i=1 x T i −x T 2 , quantifying how far from consensus the local decisions are.
Related work: The proposed distributed algorithm combines a gradient tracking mechanism with an adaptive estimation of first-and second-order momenta.
We organize the literature review in three main parts: distributed algorithms for online optimization, gradient tracking distributed schemes (mainly suited for static optimization), and centralized methods for online and stochastic optimization based on adaptive momentum estimation.
Online optimization problems, characterized by timevarying cost functions, have been originally addressed in the centralized framework, see, e.g., [2], [3] and references therein, but recently have received significant attention also in the distributed optimization literature. In [4] an online optimization algorithm based on a distributed subgradient scheme is proposed. In [5] an adaptive diffusion algorithm is proposed to address changes regarding both the cost function and the constraints characterizing the problem. A class of coordination algorithms that generalize distributed online subgradient descent and saddle-point dynamics is proposed in [6] for network scenarios modeled by jointly-connected graphs. An algorithm consisting of a subgradient flow combined with a push-sum consensus is studied in [7] for time-varying directed graphs. Cost uncertainties and switching communication topologies are addressed in [8] by using a distributed algorithm based on dual subgradient averaging. A distributed version of the mirror descent algorithm is proposed in [9] to address online optimization problems. In [10] an online algo-rithm based on the alternating direction method of multipliers is proposed, and in [11] time-varying inequality constraints are also considered. Online optimization is strictly related to stochastic optimization. Regarding distributed algorithms for stochastic optimization, in [12] authors investigate the convergence properties of a distributed algorithm dealing with subgradients affected by stochastic errors. In [13] a block-wise method is proposed to deal with high-dimensional stochastic problems, while in [14] a distributed gradient tracking method is analyzed in a stochastic set-up.
The gradient tracking scheme, which we extend in the present paper, has been proposed in several variants in recent years and studied under different problem assumptions [15]- [22]. This algorithm leverages a "signal tracking action" based on the dynamic average consensus (see [23], [24]) in order to let the agents obtain a local estimate of the gradient of the whole cost function. Recently, in [25] the gradient tracking algorithm has been applied to online optimization problems. Finally, in [26] a dynamic gradient tracking update is combined with a recursive least squares scheme to address in a distributed way the (centralized) personalized optimization framework introduced in [27].
The other algorithm inspiring our work is Adam, a centralized method originally proposed in [28]. Adam is an optimization algorithm based on adaptive estimates of firstand second-order gradient momenta that has been successfully employed in many online and stochastic optimization frameworks. Additional insights about Adam are given in [29]- [31], where some frameworks in which the algorithm is not able to reach the optimal solution are also shown. This limitation is addressed in [32], where an effective extension of Adam, namely AdaShift, is proposed. In [33], the authors proposed an enhanced version of the distributed gradient method with adaptive estimates of first-and second-order gradient momenta.
Contribution: The main contribution of this paper is the design of a new distributed algorithm to solve online optimization problems for multi-agent learning over networks. This novel scheme builds on the recently proposed gradient tracking distributed algorithm. Specifically, in the gradient tracking the agents update their local solution estimates using a consensus averaging scheme perturbed with a local variable representing a descent direction. This variable is concurrently updated using a dynamic consensus scheme aiming at reconstructing the total cost function gradient in a distributed way. Inspired by the centralized Adam algorithm, we accelerate the basic gradient tracking scheme by enhancing the descent direction resorting to first-and second-order momenta of the cost function gradient. The use of momenta turned out to be very effective in the centralized Adam to solve online optimization problems with a fast rate. Therefore, we design our novel gradient tracking with adaptive momentum estimation (GTAdam) distributed algorithm to solve online optimization problems over networks. The algorithm relies on local estimators for the two momenta, in which the total gradient is replaced by a (local) gradient tracker. Although the intuition behind the construction of GTAdam is clear and consists of mimicking the centralized Adam in a distributed setting by using a gradient tracking scheme, its analysis presents several additional challenges with respect to both the gradient tracking and Adam. Indeed, being the descent direction a nonlinear combination of the local states updated through a consensus averaging, the proof approach of the gradient tracking needs to be carefully reworked. We provide an upper bound about the dynamic regret for strongly convex online optimization problems. This bound consists of a constant term, related to the initial conditions of the algorithm, and another term depending on the temporal variations of both the optimal solution of the problem and the gradients of the objective functions. Thus, if the latter variations are sublinear with respect to time, then our bound about the dynamic regret is sublinear too. A similar result is also guaranteed for an agent-specific dynamic regret. Moreover, we show that in the static case our algorithm reaches the optimal solution with a linear rate. Finally, we perform extensive numerical simulations on three application scenarios from distributed machine learning: a classification problem via logistic regression, a source localization problem in smart sensor networks and an image classification task. We show that GTAdam outperforms in all cases the current stateof-the-art algorithms in terms of convergence rate.
Organization and Notation: The paper is organized as follows. In Section II we recall the two algorithms that inspired the novel distributed algorithm proposed in this paper. In Section III GTAdam is presented with its convergence properties which are proved in Section IV. Finally, Section V shows numerical examples highlighting the advantages of GTAdam.
The vertical concatenation of the vectors v 1 and v 2 is col(v 1 , v 2 ). We use diag(v) to denote the diagonal matrix with diagonal elements given by the components of v. The Hadamard product is denoted with ⊙, while the Kronecker product with ⊗. The identity matrix in R m×m is I m , while 0 m is the zero matrix in R m×m . The column vector of N ones is denoted by 1 N and we define 1 ≜ 1 N ⊗ I n . The spectral radius of a square matrix M is denoted as ρ(M ).

II. INSPIRING ALGORITHMS
In this section we briefly recall two existing algorithms that represent the building blocks for GTAdam.

A. Adam centralized algorithm
Adam [28] is an optimization algorithm that solves problems in the form (1) in a centralized computation framework. It is an iterative gradient-like procedure in which, at each iteration t, a solution estimate x t is updated by means of a descent direction which is enhanced by a proper use of the gradient history, i.e., through estimates of their firstand second-order momenta. Specifically, the (time-varying) gradient g t = ∇f t (x t ) of the function drives two exponential moving average estimators. The two estimates, denoted by m t and v t , represent, respectively, mean and variance (1 st and 2 nd momentum) of the gradient sequence and are nonlinearly combined to build the descent direction. A pseudo-code of Adam algorithm is reported in Algorithm 1 in which α > 0 is the step-size, the constant 0 < ϵ ≪ 1 is introduced to guarantee numerical robustness of the scheme, while the hyper-parameters β 1 , β 2 ∈ (0, 1) control the exponential-decay rate of the moving average dynamics.

B. Gradient tracking distributed algorithm
The gradient tracking is a distributed algorithm mainly tailored to static instances of problem (1). Agents in a network maintain and update two local states x t i and s t i by iteratively combining a perturbed average consensus and a dynamic tracking mechanism. Consensus is used to enforce agreement among the local agents' estimates x t i . The agreement is also locally perturbed in order to steer the local estimates toward a (static) optimal solution of the problem. The perturbation is obtained by using a tracking scheme that allows agents to locally reconstruct a progressively accurate estimate of the whole gradient of the (static) cost function in a distributed way. A pseudo-code of the gradient tracking distributed algorithm is reported in Algorithm 2, in which N i denotes the set of (in-)neighbors of agent i, while α > 0 is the step-size. The protocol is shown from the perspective of agent i only. initialization: In this section we present the main contribution of this paper, i.e., the gradient tracking with adaptive momentum estimation (GTAdam) distributed algorithm. GTAdam is designed to address in a distributed fashion problem (1), taking inspiration both from Adam and from the gradient tracking distributed algorithm.
Along the evolution of the algorithm, each agent i maintains four local states: (i) a local estimate x t i of the current optimal solution x t ⋆ ; (ii) an auxiliary variable s t i whose role is to track the gradient of the whole cost function; (iii) an estimate m t i of the 1 st momentum of s t i ; (iv) an estimate v t i of the 2 nd momentum of s t i . The momentum estimates of s t i are initialized as m 0 i = v 0 i = 0, while the tracker of the gradient is initialized as (iii) it updates the local gradient tracker s t i via a "dynamic consensus" mechanism. A pseudo-code of GTAdam is reported in Algorithm 3.
Some remarks are in order. The algorithm proposed in this paper is different from [33]. In fact, although they both use a similar strategy involving first-and second-order momenta, in that work only local gradients are considered, without resorting to any tracking mechanism. Note that a saturation term G ≫ 0 is introduced in the update of v t i , where the min operator is to be intended element-wise. The value of G guarantees a bound for the scaling factor that multiplies the descent direction. Such a bound will turn out to be important for analysis purposes. We suggest to take it proportional to the initial estimates v 0 i . We now state some regularity requirements on problem (1). We first make two assumptions regarding each f t i . Assumption 1 (Lipschitz continuous gradients). We point out that, in light of Assumption 2, the minimizer x t ⋆ is unique for all t ≥ 0. Finally, the following characterizes the communication structure.
Assumption 3 (Network Structure). The weighted graph G is connected with doubly stochastic matrix W stochastic.
In order to analyze GTAdam, we rewrite it into an aggregate form. Given the variables Similar definitions apply to the quantities m t , v t , d t , g t , s t and their averagesm t ,v t ,d t ,s t . With these definitions at hand, GTAdam can be rephrased from a global perspective as Moreover, the averaged quantities of (3) satisfȳ Our analysis is based on studying the aggregate dynamical evolution of the following: average first momentum ∥m t ∥, average tracking momentum difference ∥s t −m t ∥, first momentum error ∥m t −1m t ∥, gradient tracking error ∥s t −1s t ∥, consensus error ∥x t −1x t ∥ and solution error ∥x t −x t ⋆ ∥. Let y t be the vector stacking the above quantities at iterations t Notice that, due to the distributed context and no assumptions on the boundedness of the gradients, we need to take into account all these quantities to study the convergence. Let us introduce two useful variables that will be used to provide the main result of the paper, namely Then, the main result of this paper is stated as follows.
Theorem 1. Consider GTAdam as given in Algorithm 3. Let Assumptions 1, 2, and 3 hold. Then, for a sufficiently small step-size α > 0, there exists a constant 0 <ρ < 1, such that where R T is defined in (2), the constant λ is defined in the proof (cf. (18)) and where η t , ζ t are defined in (6) and we assume that are finite. Moreover, it holds As it requires several intermediate results, the proof of Theorem 1 is carried out in Section IV.
There is evidence in the literature, see, e.g., [1], [9], [26], [34]- [36], that the bound on the dynamic regret cannot be sublinear with respect to T . As stated, e.g., in [1], when the objective functions are strongly convex and have bounded gradients, the bound on dynamic regret is O(1 + η t ). Our work does not assume gradient boundedness and, thus, our bound has additional terms due to variations over time of the gradients. Specifically, Theorem 1 shows that R T is upper bounded by a constant depending on the initial conditions and by other two terms. The latters involve S T and Q T , which capture the time-varying nature of the problem itself. Indeed, suppose that the problem varies linearly, i.e., there exists C > 0 so that η t , ζ t ≤ C for all t ≥ 0. Then, being ρ ∈ (0, 1), we can exploit the geometric series properties to write the following In this case, (7) ensures that the average regret R T /T asymptotically approaches a constant when T → ∞, specifically The key point of the proof consists in showing that the error vector y t (see (5)) evolves according to a linear system with state matrix A(α) (whose entries depend on the problem parameters such, e.g., the strong convexity function or the network connectivity) which is perturbed by an input q t related to the variations of the problem over time (see (11)). Notice that the parameterρ is related to the spectral radius of A(α) and, thus, depends also on the network topology. Agent Regret: We may also consider a regret for each agent where λ,ρ, S T , and Q T are defined as in Theorem 1.
The proof is given in Appendix G. Static set-up: We provide an additional corollary of Theorem 1 asserting theoretical guarantees in a static scenario. Specifically, for this special case the GTAdam distributed algorithm converges to the optimal solution with a linear rate.
Corollary 2 (Static set-up). Under the same assumptions of Theorem 1, if additionally holds f t = f for all t ≥ 0, then, for a sufficiently small step-size α > 0, there exists a constant 0 <ρ < 1 such that where the constant λ is defined in (18).
The proof is given in Appendix H.

IV. ANALYSIS
This section is devoted to provide the proof of Theorem 1.

A. Preparatory Lemmas
We now give a sequence of intermediate results, providing proper bounds on the components of y t (defined in (5)), that are then used as building blocks for proving Theorem 1.
Lemma 1 (Average first momentum magnitude). Let Assumption 1 holds. Then, for all t ≥ 0, it holds The proof is given in Appendix A.
The proof of Lemma 2 follows by combining (3a) and (4a) with the triangle inequality.
The proof is given in Appendix B.
The proof is given in Appendix C.
Lemma 5 (Consensus error). Let Assumptions 1, and 3 hold. Then, for all t ≥ 0, it holds The proof is given in Appendix D.
Lemma 6 (Tracking momentum difference magnitude). Let Assumptions 1, 2, and 3 hold. Then, for all t ≥ 0, it holds The proof is given in Appendix E.
Lemma 7 (Solution error). Let Assumptions 1, 2, and 3 hold. Then, for all t ≥ 1, it holds where ζ t is defined in (6) and δ ≜ min The proof is given in Appendix F.

B. Proof of Theorem 1
By recalling the definition of y t given in (5) and combining Lemma 1, 2, 4, 5, 6, 7, it is possible to write where q t ≜ col 0, 1 where we used the following shorthands Being A 0 triangular, it is easy to see that its spectral radius is 1 since both β 1 and σ W are in (0, 1). We want to study how the perturbation matrix αE affects the simple eigenvalue 1 of A 0 . Hence, we denote by χ(α) such eigenvalue of A(α) as a function of α. Call w and v respectively the left and right eigenvectors of A 0 associated to the eigenvalue 1, then w = col (0, 0, 0, 0, 0, 1) and v = col (L, 0, 0, 0, 0, 1). Since the eigenvalue 1 is simple, from [37, Theorem 6.3.12] it holds Then, by continuity of eigenvalues with respect to the matrix entries, χ(α) is strictly less than 1 for sufficiently small α > 0. Then, it is always possible to choose α > 0 so as the remaining eigenvalues stay in the unit circle. Therefore, the spectral radius is ρ(A(α)) < 1. Moreover, since A(α) and q t have only non-negative entries, one can use (11) to write From [37, Lemma 5.6.10], we have that for any γ > 0, there exists a matrix norm, say |||·||| γ , such that Let us pick γ ∈ (0, 1 − ρ(A(α))) and defineρ ≜ ρ(A(α)) + γ.
Then, in light of (13)  and v ∈ R 6 . Hence, we can manipulate (12) taking the norm and using the triangle inequality to write which shows that first term decreases linearly with rateρ < 1 while the second one is bounded. By using the Lipschitz continuity of the gradients of f t (cf. Assumption 1), we have where in (a) we use the fact that ∥x t − x t ⋆ ∥ represents a component of y t leading to the trivial bound ∥x t − x t ⋆ ∥ ≤ ∥y t ∥. Recalling that all norms are equivalent on finite-dimensional vector spaces, there always exist λ 1 > 0 and λ 2 > 0 such that Thus, by applying (16a), we bound (15) as where in (a) we use the geometric property series and the relation (16b). The proof follows by using the triangle inequality, the definitions of U T and Q T (cf. (8)), and by setting Finally, in order to prove (9), we notice that γ , in which we apply (16a). By applying the bound (14) for t = T , we get The first term of the latter inequality vanishes as T → ∞, while the second one can be bounded by relying on geometric series property and max k { q k 2 }. By exploiting these arguments, we can write where in (a) we apply (16b) and the definition (18) of λ. The result (9) follows by noticing that

V. NUMERICAL EXPERIMENTS
In this section we consider three multi-agent distributed learning problems to show the effectiveness of GTAdam. The first scenario regards the computation of a linear classifier via a regularized logistic regression function for a set of points that change over time. The second scenario involves the localization of a moving target. The third example is a stochastic optimization problem arising in a distributed image classification task. In all the examples, the parameters of GTAdam are chosen as β 1 = 0.9, β 2 = 0.999, and ϵ = 10 −8 . Moreover, we compare GTAdam with the gradient tracking distributed algorithm (GT) (cf. Algorithm 2 in Section II), the distributed gradient descent (DGD) (see [38]), and the distributed Adam (DAdam) (see [33]) described by for all i ∈ {1, . . . , N }. As suggested in [33], we set β 1 = β 3 = 0.9, β 2 = 0.999, and a diminishing stepsize γ t = ( α t ) −1/2 for some α > 0.

A. Distributed classification via logistic regression
Consider a network of agents that want to cooperatively train a linear classifier for a set of (moving) points in a given feature space. At time t ≥ 0, each agent i is equipped with m i ∈ N points p t i,1 , . . . , p t i,mi ∈ R d with binary labels l i,k ∈ {−1, 1} for all k ∈ {1, . . . , m i }. The problem consists of building a linear classification model from the given points, also called training samples. In particular, we look for a separating hyperplane described by a pair (w, b) ∈ R d × R given by {p ∈ R d | w ⊤ p + b = 0}. This online classification problem can be posed at each time t ≥ 0, as a minimization problem described by where C > 0 is the so-called regularization parameter. Notice that the presence of the regularization makes the cost function strongly convex. Each point p t i,k ∈ R 2 moves along a circle of radius r = 1 according to the following law where p c i,k ∈ R 2 represents the randomly generated center of the considered circle. We consider a network of N = 50 agents and pick m i = 5 (for all i). We performed an experimental tuning to optimize the step-sizes to enhance the convergence properties of each algorithm. In particular, we selected α = 0.1 for GTAdam, α = 0.05 for GT, α = 0.1 for DGD, and α = 0.1 for DAdam. We performed Monte Carlo simulations consisting of 100 trials, in which we alternatively consider an undirected, connected Erdős-Rényi graph with connectivity parameter 0.5, and a ring graph. In Fig. 1, we plot the average across the trials of the relative cost error, namely , with x t ⋆ being the minimum of f t for all t. The plot highlights that GTAdam exhibits a faster convergence compared to the other algorithms, and achieves a smaller tracking error. Finally, we consider a static instance of problem (20), i.e., with fixed objective function f t i = f i for all t ≥ 0 and i ∈ {1, . . . , N }. We consider a network of N = 50 agents in a ring topology. We take α = 0.001 for GTAdam, α = 0.01 for GT, α = 0.1 for DGD, and α = 0.5 for DAdam. In Fig. 2, we plot the error ∥x t − x ⋆ ∥ achieved by the considered methods, where x ⋆ ∈ R d is the (fixed) optimal solution of the problem. Fig. 2 clearly shows the benefit of the tracking mechanism, which allows GTAdam and GT to achieve the exact problem solution. The plot also shows that GTAdam is faster than GT.

B. Distributed source localization in smart sensor networks
The estimation of the exact position of a source is a key task in several applications in multi-agent distributed estimation and learning. Here, we consider an online version of the static localization problem considered in [39,Section 4.2]. An acoustic source is positioned at an unknown and time-varying location θ t target ∈ R 2 . A network of N sensors is capable to measure an isotropic signal related to such location and aims at cooperatively estimating θ t target . Each sensor is placed at a fixed location c i ∈ R 2 and takes, at each time instant, a noisy measurement according to an isotropic propagation where A > 0, γ ≥ 1 describes the attenuation characteristics of the medium through which the signal propagates, and ϵ t i is a zero-mean Gaussian noise with variance σ 2 . With this data, each node i at each time t ≥ 0 addresses a nonlinear least-squares online problem We consider a network of N = 50 agents randomly located according to a two-dimensional Gaussian distribution with zero mean and variance a 2 I 2 = 100I 2 . The agents want to track the location of a moving target which starts at a random location θ 0 target ∈ R 2 generated according to the same distribution of the agents. The target moves along a circle of radius r = 0.5 according to the following law θ t target = θ center + r where θ center ∈ R 2 represents the randomly generated circle center. We pick γ = 1, A = 100 and a noise variance σ 2 = 0 1,000 2,000 3,000 4,000 5,000 0.001. We take α = 0.05 for GTAdam, α = 0.02 for GT, α = 0.05 for DGD, and α = 0.0725 for DAdam. The agents communicate according to a ring graph. In Fig. 3 we compare the algorithm performance in terms of the (instantaneous) cost function evolution. Fig. 4 shows that the best performance in terms of average dynamic regret is obtained by GTAdam. GTAdam seems to achieve a smaller error with respect to the other algorithms. We make these comparisons by using θ t target as the optimal estimate associated to the iteration t, but we note that the actual optimal solution may be slightly different since the noise ϵ t i affects the measurement of each agent.

C. Distributed image classification via neural networks
In this example, we consider an image classification problem in which N nodes have to cooperatively learn how to correctly classify images. We pick the Fashion-MNIST dataset [40] consisting of black-and-white 28×28-pixels images of clothes belonging to 10 different classes. Each agent i has a local dataset D i = {(p i,k , y i,k )} mi k=1 consisting of m i images p i,k ∈ R 28×28 and their associated labels y i,k ∈ {1, . . . , 10}. The goal of the agents is to learn the parameters x ⋆ of a function h(p; x ⋆ ) so that h(p i,k ; x ⋆ ) gives the correct label for p i,k . The resulting optimization problem is  where V (·) is the categorical cross-entropy loss, and C > 0 is a regularization parameter. The local cost function is We represent h(·) by a neural network with one hidden layer (with 300 units with ReLU activation function) and an output layer with 10 units. Moreover, we pick N = 16 agents and associate each of them m i = 3750 labeled images for all i. We performed Monte Carlo simulations consisting of 100 trials and each trial lasts 10 epochs over the local datasets. The results are reported In Fig. 5 and Fig. 6 in terms of the global training loss f ({x ep , D 1 , . . . , is the accuracy achieved withx ep on the local dataset of the agent i at the end of epoch ep. We take α = 0.001 for GTAdam, and α = 0.1 for DGD, GT, and DAdam. As it can be appreciated from Fig. 5 and Figure 6, in both cases GTAdam outperforms the other algorithms.

CONCLUSIONS
We proposed GTAdam, a novel distributed optimization algorithm tailored for multi-agent online learning. Inspired by the popular Adam algorithm, our novel GTAdam is based on the gradient tracking distributed scheme which is enhanced with adaptive first-and second-order momentum estimates of the gradient. We provided theoretical bounds on the convergence of the proposed algorithm. Moreover, we tested GTAdam in three different scenarios showing a performance improvement with respect to state-of-the-art algorithms.

APPENDIX
We report a lemma that will be used in the proof of Lemma 7 (cf. Appendix F).
Proof. Let h(x) be a function such that ∇h(x) = D∇f (x) for all x. It can be easily shown that h hasL-Lipschitz continuous gradients, in fact Notice that, by definition, g is convex and with (L − s)-Lipschitz continuous gradient. Thus, by definition we have ⟨∇g(x) − ∇g(y), x − y⟩ ≥ 1 L−s ∥∇g(x) − ∇g(y)∥ 2 . (21) Now, by using the definition of g one has Moreover By combining (21), (22), and (23) we get Now, by using the update rule, one has By using the result (24) with ∇h(x) = D∇f (x), we have The proof follows by taking the square root of both sides.

A. Proof of Lemma 1
By using the update (4a), we can write (25) in which we use the triangle inequality. Regarding the term ∥s t ∥, we use the relations t = 1 where in (a) we exploit the Lipschitz continuity of the gradients of the cost functions (cf. Assumptions 1), in (b) we use the basic algebraic property N i=1 ∥θ i ∥ ≤ √ N ∥θ∥ for a generic vector θ ≜ col(θ 1 , . . . , θ N ), and in (c) we add and subtract the term 1x t and apply the triangle inequality. The proof follows by combining the bounds (25) and (26).

B. Proof of Lemma 3
By using (3c) and (4c), one has where in (a) we apply the Cauchy-Schwarz inequality combined with I − 1 N 11 ⊤ ≤ 1, in (b) we use the bound (V t+1 + ϵI) −1/2 ≤ 1 √ ϵ (justified by the fact that v t+1 ≥ 0 for all t ≥ 0), in (c) we add and subtract within the norm 1m t+1 and apply the triangle inequality and an algebraic property. The proof follows by using Lemma 1 and 2 in (27).

C. Proof of Lemma 4
By combining (3e) and (4e) one has where (a) uses 1 ∈ ker W − 1 N 11 ⊤ and the triangle inequality, and (b) combines the Cauchy-Schwarz inequality with the bounds W − 1 N 11 ⊤ ≤ σ W and I − 1 N )) and manipulate the term g t+1 − g t in (28) as where in (a) we use the Lipschitz continuity of the gradients of the cost functions (cf. Assumption 1), (b) uses the variable η t (cf (6)), and (c) uses the update (3d) of x t+1 . Let us manipulate the first term on the right-hand side of (29): where (a) uses the fact that ker (W − I) = span(1) and in (b) we add and subtract the term 1d t+1 within the norm and we apply the triangle inequality and the Cauchy-Schwarz inequality. Regarding ∥1d t+1 ∥, we use (3c) and (4c) to write where in (a) we apply the Cauchy-Schwarz inequality and the bounds 1 N 11 ⊤ ≤ 1 and (V t+1 + ϵ) −1/2 ≤ 1 √ ϵ , in (b) we add and subtract within the norm the term 1m t+1 , apply the triangle inequality, and use an algebraic property. By combining (30) and (31), we bound (29) as Now, by using the bound (32) within (30), we get The proof follows by using Lemma 1, 2 and 3 to bound m t+1 , m t+1 − 1m t+1 , and d t+1 − 1d t+1 .

D. Proof of Lemma 5
By combining (3d) and (4d), we have where in (a) we apply the triangle inequality and (b) follows by W − 1 N 11 ⊤ ≤ σ W . The proof follows by Lemma 3.

E. Proof of Lemma 6
From the updates ofs t+1 andm t+1 (cf. (4e), (4a)), we get where (a) uses the triangle inequality. By adding and subtracting within the second norm 1 , we use the triangle inequality to obtain where in (a) we use the Lipschitz continuity of the gradients of the cost functions (cf. Assumptions 1) for the second and the third norm, and we use η t (cf. (6)). Now, we replacex t+1 with its update (4d) within the last term of (34) obtaining where in (a) we use (31) to bound 1d t+1 . The proof follows by using Lemma 5, 1, and 2 to bound x t+1 − 1x t+1 , m t+1 , and m t+1 − 1m t+1 , respectively.

F. Proof of Lemma 7
By using (4d), one has where in (a) we add and subtract within the norm x t ⋆ , use the triangle inequality, and use ζ t (cf. (6)). Now, we add and subtract within the norm α 1 ⊤ (V t+1 +ϵI) −1/2 1 N 2 ∇f t (x t ) and we use the triangle inequality to write Consider the second term of (36) and use (4c) to write where in (a) we add and subtract within the norm the term 1 ⊤ (V t+1 +ϵI) −1/2 1 Nm t+1 and we apply the triangle inequality, in (b) we apply the Cauchy-Schwarz inequality combined with the bounds 1 ⊤ (V t+1 +ϵI) −1/2 1 N ≤ 1 √ ϵ and N . Now, we add and subtract the term 1 N N i=1 ∇f t i (x t i ) and then we use the triangle inequality to rewrite the first term of the second member of (37) as where in (a) we use (4a), (b) uses the relations t = 1 N N i=1 ∇f t i (x t i ), and the Lipschitz continuity of the gradients of the cost functions (cf. Assumption 1). Next, in order to bound the right-hand side of (36), first notice that Moreover, being f t µstrongly convex for all t ≥ 0 (cf. Assumption 2) and having L-Lipschitz continuous gradients (cf. Assumption 1), we apply Lemma 8 (in the Appendix) to write where ϕ ≜ max 1 − α √ ϵ+G µ , 1 − α √ ϵ L . If we take α < min √ ϵ+G µ , √ ϵ L , then it holds ϕ = 1 − αδ, where δ is defined in the statement of Theorem 1. By combining the latter with (38) and (39), it is possible to upper bound (36) as The proof follows by invoking Lemma 2 to bound m t+1 −m t+1 within (40).

G. Proof of Corollary 1
We add and subtract f t ( where in (a) we apply (15) and in (b) we use the Lipschitz continuity of the gradients of the cost functions (cf. Assumption 1). Being ∇f t (x t ⋆ ) = 0, we rewrite (41) as where in (a) we use the Cauchy-Schwarz inequality and the Lipschitz continuity of the gradients of the cost functions (cf. Assumption 1). Now, we notice that both ∥x t − x t ⋆ ∥ and ∥x t i −x t ∥ represent a component of the vector y t defined in (5), and thus, can be both upper bounded by ∥y t ∥. Hence, the inequality (42) can be elaborated as By summing over t the inequality in (43), we bound R T,i as where in (a) we apply (16a). As done above to prove (7), the proof follows by combining (44), (14), and (16b).

H. Proof of Corollary 2
Using the same arguments of Theorem 1 we start from (14). Differently from the dynamic case, in the static set-up we have ∇f t i (x) = ∇f i (x) for all t and i, leading to x t ⋆ = x ⋆ for all t. Thus, we can combine (14) with q t ≡ 0, the Lipschitz continuity of the gradient of the cost function (cf. Assumption 1) and (16a), to write f (x t ) − f (x ⋆ ) ≤ρ 2t Lλ 2 1 2 y 0 2 γ ≤ ρ 2t Lλ 2 1 λ 2 2 2 y 0 2 , in which we use (16b). The proof follows by using the definition (18) of λ.