Dif-MAML: Decentralized Multi-Agent Meta-Learning

The objective of meta-learning is to exploit the knowledge obtained from observed tasks to improve adaptation to unseen tasks. As such, meta-learners are able to generalize better when they are trained with a larger number of observed tasks and with a larger amount of data per task. Given the amount of resources that are needed, it is generally difficult to expect the tasks, their respective data, and the necessary computational capacity to be available at a single central location. It is more natural to encounter situations where these resources are spread across several agents connected by some graph topology. The formalism of meta-learning is actually well-suited to this decentralized setting, where the learner would be able to benefit from information and computational power spread across the agents. Motivated by this observation, in this work, we propose a cooperative fully-decentralized multi-agent meta-learning algorithm, referred to as Diffusion-based MAML or Dif-MAML. Decentralized optimization algorithms are superior to centralized implementations in terms of scalability, avoidance of communication bottlenecks, and privacy guarantees. The work provides a detailed theoretical analysis to show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML objective even in non-convex environments. Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.


Introduction
Training of highly expressive learning architectures, such as deep neural networks, requires large amounts of data in order to ensure high generalization performance. However, the generalization guarantees apply only to test data following the same distribution as the training data. Human intelligence, on the other hand, is characterized by a remarkable ability to leverage prior knowledge to accelerate adaptation to new tasks. This evident gap has motivated a growing number of works to pursue learning architectures that learn to learn (see [Hospedales et al., 2020] for a recent survey).
The work [Finn et al., 2017] proposed a model-agnostic meta-learning (MAML) approach, which is an initial parameter-transfer methodology where the goal is to learn a good "launch model". Several works have extended and/or analyzed this approach to great effect such as [Nichol et al., 2018;Finn et al., 2018;Raghu et al., 2020;Li et al., 2017;Fallah et al., 2020a;Ji et al., 2020;Zhuang et al., 2020;Balcan et al., 2019]. However, there does not appear to exist works that consider model agnostic meta-learning in a decentralized multi-agent setting. This setting is very natural to consider for meta-learning, where different agents can be assumed to have local meta-learners based on their own experiences. Interactions with neighbors can help infuse their models with new information and speed up adaptation to new tasks.
Decentralized multi-agent systems consist of a collection of agents with access to data and computational capabilities, and a graph topology that imposes constraints on peer-to-peer communications. In contrast to centralized architectures, which require some central aggregation of data, decentralized solutions rely solely on the diffusion of information over connected graphs through successive local aggregations over neighborhoods. While decentralized methods have been shown to be capable of matching the performance of centralized solutions [Lian et al., 2017;Sayed, 2014a], the absence of a fusion center is advantageous in the presence of communication bottlenecks, and concerns around robustness or privacy. Decen-tralized settings are also well motivated by swarm intelligence or swarm robotics concepts where relatively simple agents (insects, machines, robots etc.) collaboratively form a more robust and complex system, one that is flexible and scalable [Beni, 2004;Sahin, 2004]. Applications that can benefit from decentralized metalearning algorithms include but are not limited to the following: • A robot swarm might be assigned to do environmental monitoring [Dunbabin and Marques, 2012]. The individual robots can share spatially and temporally dispersed data such as images or temperatures in order to learn better meta-models to adapt to new scenes. This teamwork is vital for circumstances where data collection is hard, such as natural disasters.
• Different hospitals or research groups can work on clinical risk prediction with limited patient health records [Zhang et al., 2019] or drug discovery with small amount of data [Altae-Tran et al., 2017]. The individual agents in this context will benefit from cooperation, while avoiding the need for a central hub in order to preserve the privacy of medical data.
• In some situations, it is advantageous to distribute a single agent problem over multiple agents. For example, training a MAML can be computationally demanding since it requires Hessian calculations [Finn et al., 2017]. In order to speed up the process, tasks can be divided into different workers or machines.
The contributions in this paper are three-fold: • By combining MAML with the diffusion strategy for decentralized stochastic optimization [Sayed, 2014a], we propose Diffusion-based Model-Agnostic Meta-Learning (Dif-MAML). The result is a decentralized algorithm for meta-learning over a collection of distributed agents, where each agent is provided with tasks stemming from potentially different task distributions.
• We establish that, despite the decentralized nature of the algorithm, all agents agree quickly on a common launch model, which subsequently converges to a stationary point of the aggregate MAML objective over the task distribution across the network. This implies that Dif-MAML matches the performance of a centralized solution, which would have required central aggregation of data stemming from all tasks across the network.
In this way, agents will not only learn from locally observed tasks to accelerate future adaptation, but will also learn from each other, and from tasks seen by the other agents.
• We confirm through numerical experiments across a number of benchmark datasets that Dif-MAML outperforms the traditional non-cooperative solution and matches the performance of the centralized solution.
Notation. We denote random variables in bold. Single data points are denoted by small letters like x and batches of data are denoted by big calligraphic letters like X . To refer to a loss function evaluated at a batch X with elements {x n } N n=1 , we use the notation Q(w; X ) 1 N N n=1 Q(w; x n ). To denote expectation with respect to task-specific data, we use E x (t) , where t corresponds to the task.

Problem Formulation
We consider a collection of K agents (e.g., robots, workers, machines, processors) where each agent k is provided with data stemming from tasks T k . We denote the probability distribution over T k by π k , i.e., the probability of drawing task t from T k is π k (t). In principle, for any particular task t ∈ T k , each agent could learn a separate model w o k (t) by solving: (1) where w denotes the model parametrization (such as the parameters of a neural network), while x (t) k denotes the random data corresponding to task t observed at agent k. The loss Q k ) denotes the penalization encountered by w under the random data x (t) k , while J (t) k (w) represents the stochastic risk. Instead of training separately in this manner, metalearning presumes an a priori relation between the tasks in T k and exploits this fact. In particular, MAML seeks a "launch model" such that when faced with data arising from a new task, the agent would be able to update the "launch model" with a small number of taskspecific gradient updates. It is common to allow for multiple gradient steps for task adaptation. For the analytical part of this work, we will restrict ourselves to a single gradient step for simplicity. Nevertheless, our experimental results suggest that the theoretical conclusions hold more broadly even when allowing for multiple gradient updates to the launch model. With a single gradient step, agent k can seek a launch model by minimizing the modified risk function: The resulting gradient vector is given by (assuming the possibility of exchanging expectations and gradient operations, which is valid under mild technical conditions): where α > 0 is the step size parameter. In practice, due to lack of information about π k and the distribution of x (t) k , evaluation of (2) and (3) is not feasible. It is common to collect realizations of data and replace (3) by a stochastic gradient approximation: are two random batches of data 1 , S k ⊂ T k is a random batch of tasks, and |S k | is the number of selected tasks. We assume that all elements of X o are independently sampled from the distribution of x (t) k and all tasks t ∈ S k are independently sampled from T k .
In a non-cooperative MAML setting, each agent k would optimize (2) in an effort to obtain a launch model that is likely to adapt quickly to tasks similar to those encountered in T k . In a cooperative multi-agent setting, however, one would expect transfer learning to occur between agents. This motivates us to seek a decentralized scheme where the launch model obtained by agent k is likely to generalize well to tasks similar to those observed by agent during training, for any pair of agents k, . This can be achieved by pursuing a launch model that optimizes instead the aggregate risk: By pursuing this network objective in place of the individual objectives, the effective number of tasks and data each agent is trained on is increased and hence a better generalization performance is expected. Even though both the centralized and decentralized strategies seek a solution to (5), in the decentralized strategy, the agents rely only on their immediate neighbors and there is no central processor.
1 Different batches of data are used while computing the inner and outer gradients. The reason is that we want our model to adapt to models that perform well on data that is not used for training. If the two batches were the same, then this would train the launch model to be an initialization for task-specific models that memorize their training data. This memorization would get in the way of generalization.

Related Work
Early works on meta-learning or learning to learn date back to [Schmidhuber, 1987[Schmidhuber, , 1992Bengio et al., 1991;Bengio et al., 1992]. Recently, there has been increased interest in meta-learning with various approaches such as learning an optimization rule [Andrychowicz et al., 2016;Ravi and Larochelle, 2017] or learning a metric that compares support and query samples for few-shot classification [Koch et al., 2015;Vinyals et al., 2016].
In this paper, we consider a parameter-initializationbased meta-learning algorithm. This kind of approach was introduced by MAML [Finn et al., 2017], which aims to find a good initialization (launch model) that can be adapted to new tasks rapidly. It is modelagnostic, which means it can be applied to any model that is trained with gradient descent. MAML has shown competitive performance on benchmark fewshot learning tasks. Many algorithmic extensions have also been proposed by [Nichol et al., 2018;Finn et al., 2018;Raghu et al., 2020;Li et al., 2017] and several works have focused on the theoretical analysis and convergence of MAML [Fallah et al., 2020a;Ji et al., 2020;Zhuang et al., 2020;Balcan et al., 2019] in single-agent settings.
A different line of work [Khodak et al., 2019;Jiang et al., 2019;Fallah et al., 2020b;Chen et al., 2018] studies meta-learning in a federated setting where the agents communicate with a central processor in a manner that keeps the privacy of their data. In particular, [Fallah et al., 2020b] and [Chen et al., 2018] propose algorithms that learn a global shared launch model, which can be updated by a few agent-specific gradients for personalized learning. In contrast, we consider a decentralized scheme where there is no central node and only localized communications with neighbors occur. This leads to a more scalable and flexible system and avoids communication bottleneck at the central processor.
Our extension of MAML is based on the diffusion algorithm for decentralized optimization [Sayed, 2014a]. While there exist many useful decentralized optimization strategies such as consensus [Nedic and Ozdaglar, 2009;Xiao and Boyd, 2003;Yuan et al., 2016] and diffusion [Sayed, 2014a,b], the latter class of protocols has been shown to be particularly suitable for adaptive scenarios where the solutions need to adapt to drifts in the data and models. Diffusion strategies have also been shown to lead to wider stability ranges and lower mean-square-error performance than other techniques in the context of adaptation and learning due to an inherent symmetry in their structure. Several works analyzed the performance of diffusion strategies such as [Sayed, 2014b; Nassif et al., 2016; Chen and Sayed, 2015a,b]. The works [Vlaski and Sayed, 2019a,b] examined diffusion under non-convex losses and stochastic gradient conditions, which are applicable to our work with proper adjustment since the risk function for MAML includes a gradient term as an argument for the risk function.

Dif-MAML
Our algorithm is based on the Adapt-then-Combine variant of the diffusion strategy [Sayed, 2014a].

Diffusion (Adapt-then-Combine)
The diffusion strategy is applicable to scenarios where K agents, connected via a graph topology A = [a k ], collectively try to minimize an aggregate risk J(w) 1 K K k=1 J k (w) , which includes the setting (5) considered in this work. To solve this objective, at every iteration i, each agent k simultaneously performs the following steps: The coefficients {a k } are non-negative and add up to one: K =1 a k = 1, a k > 0 if agents and k are connected For example, matrix A can be selected as the Metropolis rule.
Expression (6a) is an adaptation step where all agents simultaneously obtain intermediate states φ k,i by a stochastic gradient update. Recall that ∇Q k (w k,i−1 ) from (4) is the stochastic approximation of the exact gradient ∇J k (w k,i−1 ) from (3) . Expression (6b) is a combination step where the agents combine their neighbors' intermediate steps to obtain updated iterates w k,i .

Diffusion-based MAML (Dif-MAML)
We present the proposed algorithm for decentralized meta-learning in Algorithm 1. Each agent is assigned an initial launch model. At every iteration, the agents sample a batch of i.i.d. tasks from their agent-specific distribution of tasks. Then, in the inner loop, taskspecific models are found by applying task-specific stochastic gradients to the launch models.
o,i for each task (Check (4) to see gradient expression explicitly) for all agents do Update the launch models by combining the intermediate states

Theoretical Results
In this section, we provide convergence analysis for Dif-MAML in non-convex environments.

Assumptions
Assumption 1 (Lipschitz gradients). For each agent k and task t ∈ T k , the gradient ∇Q (t) k (·; ·) is Lipschitz, namely, for any w, u ∈ R M and x (t) k denoting a data point: We assume the second-order moment of the Lipschitz constant is bounded by a data-independent constant: We establish in Appendix A.1 that Assumption 1 holds for gradients under a batch of data. In this paper, for simplicity, we will mostly work with L max Assumption 2 (Lipschitz Hessians). For each agent k and task t ∈ T k , the Hessian ∇ 2 Q (t) k (·; ·) is Lipschitz in expectation, namely, for any w, u ∈ R M and x (t) k denoting a data point: We establish in Appendix A.2 that Assumption 2 holds for Hessians under a batch of data. In this paper, for simplicity, we will mostly work with ρ max Assumption 3 (Bounded gradients). For each agent k and task t ∈ T k , the gradient ∇Q (t) k (·; ·) is bounded in expectation, namely, for any w ∈ R M and x (t) k denoting a data point: We establish in Appendix A.3 that Assumption 3 holds for gradients under a batch of data. In this paper, for simplicity, we will mostly work with B max Assumption 4 (Bounded noise moments). For each agent k and task t ∈ T k , the gradient ∇Q (t) k (·; ·) and the Hessian ∇ 2 Q (t) k (·; ·) have bounded fourth-order central moments, namely, for any w ∈ R M : We establish in Appendix A.4 that Assumption 4 holds for gradients and Hessians under a batch of data.
Defining the mean of the risk functions of the tasks in T k as J k (w) E t∼π k J (t) k (w), we have the following assumption on the relations between the tasks of a particular agent. Assumption 5 (Bounded task variability). For each agent k , the gradient ∇J (t) k (·) and the Hessian ∇ 2 J (t) k (·) have bounded fourth-order central moments, namely, for any w ∈ R M : Note that we do not assume any constraint on the relations between tasks of different agents.
Assumption 6 (Doubly-stochastic combination matrix). The weighted combination matrix A = [a k ] representing the graph is doubly-stochastic. This means that the matrix has non-negative elements and satisfies: The matrix A is also primitive which means that a path with positive weights can be found between any arbitrary nodes (k, ), and moreover at least one a kk > 0 for some k.

Alternative MAML Objective
The stochastic MAML gradient (4), because of the gradient within a gradient, is not an unbiased estimator of (3). We consider the following alternative objective in place of (2): The gradient corresponding to this objective is the expectation of the stochastic MAML gradient (4): See Table 1 for a summary of the notation in the paper. We establish (17) in Appendix B.1. This means that the stochastic MAML gradient (4) is an unbiased estimator for the gradient of the alternative objective (16).
While the MAML objective (2) captures the goal of coming up with a launch model that performs well after a gradient step, the adjusted objective (16) searches for a launch model that performs well after a stochastic gradient step. Using the adjusted objective allows us to analyze the convergence of Dif-MAML by exploiting the fact that it results in an unbiased stochastic gradient approximation.
In the following two lemmas, we will perform perturbation analyses on the MAML objective J k (w) and the adjusted objective J k (w). We will work with J k (w) afterwards. At the end of our theoretical analysis, we will use the perturbation results to establish convergence to stationary points for both objectives. Lemma 1 (Objective perturbation bound). Under assumptions 1,3,4, for each agent k, the disagreement between J k (·) and J k (·) is bounded, namely, for any w ∈ R M : 2 2 In this paper, for simplicity, we assume that for each Table 1: Summary of some notation used in the paper.
Next, we perform a perturbation analysis at the gradient level.
Lemmas 1 and 2 suggest that the standard MAML objective and the adjusted objective get closer to each other with decreasing inner learning rate α and increasing inner batch size |X in |. Next, we establish some properties of the adjusted objective, which will be called upon in the analysis.
Lemma 3 (Bounded gradient of adjusted objective). Under assumptions 1,3, for each agent k, the gradient ∇ J k (·) of the adjusted objective is bounded, namely, for any w ∈ R M : Lemma 4 (Lipschitz gradient of adjusted objective). Under assumptions 1-3, for each agent k, the gradient ∇ J k (·) of adjusted objective is Lipschitz, namely, for any w, u ∈ R M : where L (L(1 + αL) 2 + αρB) is a non-negative constant.
Lemma 5 (Gradient noise for adjusted objective). Under assumptions 1-5, the gradient noise defined as ∇Q k (w)−∇ J k (w) is bounded for any w ∈ R M , namely: for a non-negative constant C 2 , whose expression is given in (123) in Appendix B.6.

Evolution Analysis
In this section, we analyze the Dif-MAML algorithm over the network. The analysis is similar to [Vlaski and Sayed, 2019a,b]. We first prove that agents cluster around the network centroid in O(log µ) = o(1/µ) iterations, then show that this centroid reaches an O(µ)mean-square-stationary point in at most O(1/µ 2 ) iterations. Figure 1 summarizes the analysis.
The network centroid is defined as It is an average of the agents' parameters. In the following theorem, we study the difference between the centroid launch model and the launch model for each agent k.
Theorem 1 (Network disagreement). Under assumptions 1-6, network disagreement between the centroid launch model and the launch models of each agent k is bounded after O(log µ) = o(1/µ) iterations, namely: where λ 2 is the mixing rate of the combination matrix A, i.e., it is the spectral radius of Proof. See Appendix C.1.
In Theorem 1, we proved that the disagreement between the centroid launch model and agent-specific launch models is bounded after sufficient number of iterations. Therefore, we can use the centroid model as a deputy for all models and establish the convergence properties on that. Theorem 2 (Stationary points of adjusted objective). In addition to assumptions 1-6, assume that J(w) is bounded from below, i.e., J(w) ≥ J o . Then, the centroid launch model w c,i will reach an O(µ)mean-square-stationary point in at most O 1/µ 2 iterations. In particular, there exists a time instant i such that: Proof. See Appendix C.2.
Next, we prove that the same analysis holds for the standard MAML objective, using the gradient perturbation bound for the adjusted objective (Lemma 2).
Corollary 1 (Stationary points of MAML objective). Assume that the same conditions of Theorem 2 hold. Then, the centroid launch model w c,i will reach an O(µ)-mean-square-stationary point, up to a constant, in at most O 1/µ 2 iterations. Namely, for time instant i defined in (26): Corollary 1 states that the centroid launch model can reach an O(µ)-mean-square-stationary point for sufficiently small inner learning rate α and for sufficiently large inner batch size |X in |, in at most O 1/µ 2 iterations. Note that as µ → 0, the number of iterations required for network agreement becomes negligible compared to the number of iterations necessary for convergence (O(1/µ 2 )).

Experiments
In this section, we provide experimental evaluations.
In particular, we present comparisons between the centralized, diffusion-based decentralized, and noncooperative strategies. Our demonstrations cover both regression and classification tasks. Even though our theoretical analysis is general with respect to various learning models, for the experiments, our focus is on neural networks.
For all cases, we consider the network with the underlying graph in Figure 2a with K = 6 agents. The centralized strategy corresponds to a central processor that has access to all data and tasks. Note that this is equivalent to having a network with a fully-connected graph. The non-cooperative strategy represents a solution where agents do not communicate with each other. In other words, they all try to learn separate launch models.

Regression
For regression, we consider the benchmark from [Finn et al., 2017]. In this setting, each task requires predicting the output of a sine wave from its input.   Figure  2b. It can be seen that Dif-MAML quickly converges to the centralized solution and clearly outperforms the non-cooperative solution throughout the training. This suggests that cooperation helps even when agents have access to different task distributions. Moreover, we also test the performance after training with respect to number of gradient updates for adaptation in Figure 2c. It is visible that the match between the centralized and decentralized solutions does not change and the performance of the non-cooperative solution is still inferior. Note that this plot is also showing the average performance on 1000 tasks.

Classification
For classification, we consider widely used few-shot image recognition tasks on the Omniglot [Lake et al., 2015] and MiniImagenet [Ravi and Larochelle, 2017] datasets (see Appendix D.2 for dataset details). In contrast to the regression experiment, in these simulations, all agents have access to the same tasks and data. However, in the centralized and decentralized strategies, the effective number of samples is larger as we limit the number of data and tasks processed in one agent. See Appendix D.3 for details on the architecture and hyperparameters. Average accuracy on test tasks at every 50th training iteration is shown in Fig. 3.3 for MiniImageNet 5-way 5-shot setting trained with Adam. See Appendix D.4 for additional experiments on 5-way 1-shot MiniImagenet, 5-way 1-shot and 20-way 1-shot Omniglot and SGD variants. Similar to the regression experiment, the decentralized solution matches the centralized solution and is substantially better than the non-cooperative solution.
Moreover, we observe that batch normalization [Ioffe and Szegedy, 2015] is necessary for applying Dif-MAML, and diffusion in general, on neural networks since the combination step (6b) reduces the variance of the weights due to averaging.

Conclusion
In this paper, we proposed a decentralized algorithm for meta-learning. Our theoretical analysis establishes that the agents' launch models cluster quickly in a small region around the centroid model and this centroid model reaches a stationary point after sufficient iterations. We illustrated by means of extensive experiments on regression and classification problems that the performance of Dif-MAML indeed consistently coincides with the centralized strategy and surpasses the non-cooperative strategy significantly.

APPENDIX A The Implications of Assumptions for Batches of Data
In Appendices A.1-A.4 we denote a batch of data by X Assumption 1 implies for the stochastic gradient constructed using a batch: Proof. For the stochastic gradients under a batch of data: where (a) follows from Jensen's inequality, and (b) follows from (7). Likewise, where (a) follows from Jensen's inequality, and (b) follows from (8).

A.2 The Implication of Assumption 2
Assumption 2 implies for the loss Hessian under a batch of data: Proof. For the loss Hessians under a batch of data: where (a) follows from Jensen's inequality, and (b) follows from (9).

A.3 The Implication of Assumption 3
Assumption 3 implies for the loss gradient under a batch of data Proof. The bound for the norm of the stochastic gradients constructed using a batch is derived as follows: where (a) follows from Jensen's inequality, and (b) follows from (10).

A.4 The Implication of Assumption 4
Assumption 4 implies for the gradient and the Hessian under a batch of data: Proof. We apply induction on N Then, we get: where (a) follows from expansion of the square and dropping the cross-terms that are zero due to the independence assumption on the data,(b) follows from Cauchy-Schwarz, (c) follows from independence assumption on the data, and (d) follows from the induction hypothesis, (11), and the following variance reduction formula: For proving (37), just replacing the gradients with the Hessians in (38) is enough.

B Alternative MAML Objective Proofs
B.1 Proof of (17) Recall the definition of the adjusted objective: The gradient corresponding to this objective is: Expectation of the stochastic MAML gradient is given by: where (a) follows from the i.i.d. assumption on the batch of tasks, (b) follows from conditioning on X (t) in , and (c) follows from the relation between loss functions and stochastic risks.

B.2 Proof of Lemma 1
The disagreement between (2) and (16) is: in ) and (a) follows from Jensen's inequality. Lipschitz property of the gradient (Assumption 1) implies: Combining the inequalities yields: where (a) follows from Assumption 3, (b) follows from inserting w 1 and w 2 expressions, and (c) follows from Assumption 4.

B.3 Proof of Lemma 2
Recall (3) and (17): The norm of the disagreement then follows: where (a) follows from applying Jensen's inequality and rearranging the terms, and (b) follows from applying triangle inequality. We bound the terms in (50) separately. For the first term we have: where (a) follows from Assumption 1, and (b) follows from Assumption 4.
Rewriting the second term in (50): where (a) follows from adding and subtracting the term ∇ 2 J (t) in )) and applying the triangle inequality. We bound the terms in (52) separately. For the first term: where (a) follows from sub-multiplicity of the norm, (b) follows from Assumption 3, and (c) follows from Assumption 4. For the second term in (52): where (a) follows from sub-multiplicity of the norm, (b) and (c) follow from Assumption 1, and (d) follows from Assumption 4. Combining the results completes the proof.

B.4 Proof of Lemma 3
Recall the formula for the gradient of the adjusted objective (17): where (a) follows from Jensen's inequality, (b) follows from sub-multiplicity of the norm, (c) follows from conditioning on X (t) in , (d) follows from Assumption 3, and (e) follows from Assumption 1.

B.5 Proof of Lemma 4
Define the following variables: Recall the formula for the gradient of the adjusted objective (17): Bounding the disagreement: where (a) follows from Jensen's inequality, and (b) follows from the triangle inequality. We bound the terms in (60) separately. For the first term, where (a) follows from Assumption 1, (b) follows from replacing w 2 , u 2 and applying triangle inequality, (c) follows from Assumption 1, (d) follows from the independence assumption on X (t) in , X (t) o and taking the expectation. For the second term we have: where (a) follows from adding and subtracting the same term and triangle inequality. For the first term in (63), we have: where (a) follows from sub-multiplicity of the norm, (b) follows from Assumption 1 and (61), (c) follows from the independence assumption on X (t) in , X (t) o and taking the expectation. For the second term in (63), we have: where (a) follows from sub-multiplicity of the norm, (b) follows from conditioning on X in , (c) follows from Assumption 3 and (d) follows from Assumption 2. Combining the results completes the proof.

B.6 Proof of Lemma 5
We will first prove three intermediate lemmas, then conclude the proof.
First defining the task-specific meta-gradient and task-specific meta-stochastic gradient: Lemma 6. Under assumptions 1,3,4, for each agent k, the disagreement between ∇Q k (t) (·) and ∇J k (t) (·) is bounded in expectation, namely, for any w ∈ R M : Proof. Defining the error terms: Rewriting (66): h,x e (t) g,x Bounding the disagreement: where (a) follows from sub-multiplicity of the norm and triangle inequality, (b) follows from ab ≤ a 2 +b 2 2 .
Taking the square of the norm and using ( Taking the expectation with respect to inner and outer batch of data yields: We bound the terms of (75) one by one.
where (a) follows from Assumption 1.
where (a) follows from defining w 2 w − α∇Q in and Assumption 4.
where (a) follows from Assumption 1, (b) follows from Assumption 4, (c) follows from taking square and expectation of (a), and (d) follows from Assumption 4.
where (a) and (b) follow from Assumption 4. Moreover, because of Assumption 3, and where (a) follows from (78), and (b) follows from (84). Substituting the results into (75) completes the proof.
Proof. Recall the definitions: Defining the error terms: Placing the new error definitions we have: where (a) follows from sub-multiplicity and triangle inequality, (b) follows from ab ≤ a 2 +b 2 2 . Using ( i ) (Cauchy-Schwarz) and taking expectation yield: We bound terms in (95) one by one. Note that by Assumption 1, while where (a) follows from the definition w 3 w − α∇J k (w), (b) follows from triangle inequality, and (c) follows from Assumption 1.
For fourth-order moment of e (t) g,t , using (a + b) 4 ≤ 8(a 4 + b 4 ) (Cauchy-Schwarz) and taking expectation result in: where (a) follows from Assumption 5. Also, by Assumption 3, and where (a) follows from Assumption 5, (b) follows from Jensen's inequality, and (c) follows from taking square root of (a). Inserting all the results into (95) completes the proof.
Next, we prove the last intermediate lemma.
Proof. Recall the definitions: Defining the error terms: Then, we can rewrite the components of the adjusted objective gradient as: and we can write the distance as: g,x where (a) follows from Jensen's inequality, (b) follows from triangle inequality and (a + b)(c + d) ≤ a 2 + b 2 + c 2 + d 2 (sub-multiplicity and triangle inequality).
Inserting all the bounds into (112) completes the proof. Now, combining the results of the previous three intermediate lemmas, we will prove that C 2 = 3 |S k | (C 2 1 +C 2 2 +C 2 3 ), i.e., where C 1 , C 2 and C 3 expressions are given in Lemma 6, Lemma 7 and Lemma 8, respectively. Proof.

D.1 The Regression Experiment Details
The same model architecture (a neural network with 2 hidden layers of 40 neurons with ReLu activations) is used for each agent. The loss function is the mean-squared error. As in [Finn et al., 2017], while training, 10 random points (10-shot) are chosen from each sinusoid and used with 1 stochastic gradient update (α = 0.01). For the Adam experiment µ = 0.001 and for the SGD experiment µ = 0.005. Each agent is trained on 4000 tasks over 6 epochs (total number of iterations = 24000). As in training, 10 data points from each sinusoid with 1 gradient update is used for adaptation.

D.2 The Classification Experiment Dataset Details
The Omniglot dataset comprises 1623 characters from 50 different alphabets. Each character has 20 samples, which were hand drawn by 20 different people. Therefore, it is suitable for few-shot learning scenarios as there is small number of data per class.  The number of gradient updates is equal to 5 for training, 10 for testing. For 5-way 1-shot, training meta-batch has 4 tasks whereas 5-way 5-shot training meta-batch has 2 tasks. Note that the first testings occur after the first training step. In other words, the first data of all classification plots are at 1st iteration, not at 0th iteration.

D.4 Additional Plots
In this section, we provide additional plots.
In Figure 6, the results of the SGD experiment on regression setting can be found. Evidently, Dif-MAML is matching the centralized solution and outperforming the non-cooperative solution as our analysis suggested. Also, similar to the Adam experiment, the relative performances stay the same with the number of gradient updates.
In Figures 7,8,9,10 additional plots for MiniImagenet 5-way 5-shot, MiniImagenet 5-way 1-shot, Omniglot 5-way 1-shot and Omniglot 20-way 1-shot can be found, respectively. The results confirm that our conclusions are valid for different task distributions, and they extend to Adam as well as multi-step adaptation in the inner loop.