Online Decentralized Multi-Agents Meta-Learning With Byzantine Resiliency

Meta-learning is a learning-to-learn paradigm that leverages past learning experiences for quick adaptation to new learning tasks. It has a wide application, such as in few-shot learning, reinforcement learning, neural architecture search, federated learning, etc. It has been extended to the online learning setting where task data distribution arrives sequentially. This provides continuous lifelong learning. However, in the online meta-learning setting, a single agent has to learn many varieties of related tasks. Yet, a single agent is limited to its local task data and must collaborate with neighboring agents to improve its learning performance. Therefore, online decentralized meta-learning algorithms are designed to allow an agent to collaborate with neighboring agents in order to improve learning performance. Despite their advantages, online decentralized meta-learning algorithms are susceptible to Byzantine attacks caused by the diffusion of poisonous information from unidentifiable Byzantine agents in the network. This is a serious problem where normal agents are unable to learn and convergence to the global meta-initializer is thwarted. State-of-the-art algorithms, such as BRIDGE, designed to provide robustness against Byzantine attacks are slow and cannot work in online learning settings. Therefore, we propose an online decentralized meta-learning algorithm that works with two Byzantine-resilient aggregation techniques, which are modified coordinate-wise screening and centerpoint aggregation. The proposed algorithm provides faster convergence speed and guarantees both resiliency and continuous lifelong learning. Our simulation results show that the proposed algorithm performs better than state-of-the-art algorithms.


I. INTRODUCTION
Meta-learning is a process of extracting experiences from multiple related tasks over a series of learning episodes to improve learning performance on new tasks. This is similar to human brains capable of leveraging past experiences to learn a new task. Thus, meta-learning is a learning-to-learn paradigm that leads to improvement in data and computational efficiency. Meta-learning overcomes the weaknesses of conventional deep learning algorithms that do not leverage past experiences for quick adaptation [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Ramakrishnan Srinivasan .
Model Agnostic Meta-learning (MAML) proposed by Finn et al. [12] is a seminal work on the application of stochastic gradient descent in meta-learning. The work aims at learning a suitable meta-initialization model at training time that can generalize to multiple related tasks during test time. From this work, meta-learning has grown to become a hot and interesting research area in machine learning. Refer to [1], [13], [14], and [15] for in-depth survey work on metalearning. The majority of published work in meta-learning focuses on settings where the task distribution is fixed. This does not model the lifelong learning capabilities of humans.
Online meta-learning (OML) was developed to enable lifelong learning capabilities similar to humans. It connects two distinct fields of machine learning -meta-learning and online learning. In online learning, the task data arrive sequentially and it is drawn from a time-varying task data distribution [3], [16]. Thus, an agent learns a good meta-initialization in a sequential manner that can quickly adapt to new tasks [17]. Most variants of OML algorithm are based on online convex optimization approaches, such as online mirror descent and online gradient descent [18], [19]. Although these algorithms can support continual lifelong learning, a single agent has to learn so many tasks leading to a cold-start problem [19].
Decentralized meta-learning allows multi-agents spatially distributed within a task environment to collaborate among themselves. Each agent only needs to learn from its local task data and share its local model with other agents in its neighborhood [20], [21], [22], [23], [24]. This way, learning is fast and the cold-start problem is eliminated [19]. Convergence is better in multi-agent meta-learning than in single-agent metalearning. Similar to decentralized networks are distributed networks based on a server-client architecture. Decentralized learning avoids the problem of congestion and a single point of failure that occur in distributed (or centralized) learning where there is a central coordinator coordinating the interactions among the agents [25]. There are few works on distributed meta-learning, some of which are applied to federated learning [26], [27], [28]. Since these works do not incorporate online learning, they are not capable of continuous lifelong learning even though they address the cold-start problem.
Online decentralized meta-learning algorithms can allow multiple agents to cooperatively learn a meta-initializer in a sequential manner and can also address the cold-start problem [19]. However, these algorithms do not provide resiliency against malicious attacks from Byzantine agents in the network [29]. When each agent receives task-related information from its neighbors, the agent does not screen the information before using it to update its local model. Thus, the network becomes susceptible to malicious attacks. Therefore, the existing online decentralized meta-learning approach does not accurately model the human brain, which acts like a natural filter, screening out fake information received from the environment while engaging in continuous lifelong learning.
One of the interesting areas of research in distributed and decentralized networks is providing resiliency against Byzantine attacks [30], [31], [32], [33], [34], [35]. Non-Byzantine agents are said to be normal or non-faulty. A single Byzantine agent can hazardously disrupt a network and preclude convergence. Byzantine agents are hard to decipher because they act uncertainly. Their behavior is difficult to predict [25]. Some state-of-the-art Byzantine-resilient algorithms have been designed to provide resiliency against Byzantine attacks in decentralized learning. These algorithms are ByRDiE [36] and BRIDGE [37]. They work by screening each coordinate of the received local model vectors for outliers during model aggregation. Thus, these algorithms are adaptations of the scalar-based coordinate-wise screening technique [38]. ByRDiE and BRIDGE are not computationally efficient for high-dimensional data, since they screen a coordinate at a time. However, in BRIDGE, the computational efficiency can be improved by screening every coordinate simultaneously with parallel computing [39]. It is worth noting that, ByRDiE and BRIDGE are not meta-learning algorithms, and cannot adapt quickly to new tasks with a small-sized test dataset. Moreover, ByRDiE and BRIDGE algorithms cannot work when agents' data distributions are both heterogeneous and time-varying. To the best of our knowledge, there is no existing online meta-learning algorithm that can overcome Byzantine attacks in time-varying environments, such as autonomous vehicles, where the model must be continuously updated with fewer sample sizes as data keeps arriving.

A. OBJECTIVE AND MOTIVATION FOR THIS RESEARCH
In this research work, our overarching research objective is to design a Byzantine-resilient online meta-learning algorithm for multi-agent systems with better performance than stateof-the-art online meta-learning algorithms. The motivation for this research is discussed as follows. Distributed and decentralized learning algorithms give a good training model when agents have independent and identically distributed (i.i.d) data distributions [40]. However, in many real-world applications, such as autonomous vehicles, intelligent robots, etc., agents have non-i.i.d data distributions and must collaborate in a time-varying environment. Also, some of these agents may be malicious. Therefore, we ask the question can we develop a robust decentralized learning algorithm that converges quickly to the optimal model for real-time applications with time-varying and heterogeneous data distribution in the presence of Byzantine agents? This is an important question in machine learning, faced with the urgent need to provide fast adaptation, continuous lifelong learning, and security. We answer in the affirmative by proposing a Byzantine-resilient online decentralized metalearning algorithm that works with two Byzantine-resilient aggregation techniques.

B. NOVELTY OF THIS RESEARCH
This work aims to overcome the limitation of the coordinate-wise screening used in state-of-the-art Byzantine resilient algorithms, such as ByRDiE [36] and BRIDGE [37]. Therefore, we propose a Byzantine-resilient online decentralized meta-learning algorithm that can work with two superior Byzantine-resilient aggregation techniques. The first Byzantine-resilient aggregation technique is a modification VOLUME 11, 2023 of the coordinate-wise screening technique. It improves performance by allowing every normal agent in a directed graph network topology to scale its local model with its self-loop weight rather than scaling the local model with a uniform weight. The intuition behind our approach is that since a normal agent cannot identify a Byzantine neighbor, then it should rely more on its local model and less on the received local models from its neighbors, even after the removal of outliers. This will minimize the effects of undetected Byzantine infiltration that escapes the screening process. The second Byzantine-resilient aggregation technique is the centerpoint aggregation. Unlike coordinate-wise screening, centerpoint aggregation is vector-based. It can completely remove the vectorial contributions of Byzantine agents. It has proven to be an effective method for removing outliers in discrete mathematics [41], [42], [43], yet it is still unknown in machine learning. The centerpoint theorem is a generalization of the median to data in higher dimensions. We have successfully applied centerpoint aggregation in our recent work to provide Byzantine resiliency in online federated learning, but with only empirical results provided [44]. However, the performance of the centerpoint aggregation in online decentralized meta-learning settings is yet to be investigated.
Our proposed algorithm does not need to be aware of the identities of the Byzantine agents or the exact number of Byzantine agents in the network, and there is no restriction on the behavior of the Byzantine agents. The only available information, similar to other works in this research area, is to know the upper bound on the number of Byzantine agents in the network [25]. Succinctly, the contributions of this paper are the following • This work proposes a Byzantine-resilient online decentralized meta-learning algorithm that works with two Byzantine-resilient aggregation techniques to provide fast adaptation, continuous lifelong learning, and security against Byzantine attacks for real-time applications with time-varying data distributions.
• The first Byzantine-resilient aggregation technique is a modification of the coordinate-wise screening used in state-of-the-art decentralized learning algorithms, such as BRIDGE [37]. This modification scales the local model of a normal agent with its self-loop weight during the update process. By utilizing the self-loop weight, each normal agent is allowed to trust its own local model more than what it receives from its screened neighbors. This reduces the effects of undetected Byzantine infiltration and improves convergence.
• The second proposed Byzantine-resilient aggregation technique is the centerpoint aggregation based on the centerpoint theorem in discrete mathematics. Centerpoint aggregation can completely remove Byzantine local model vectors, unlike scalar-based coordinate-wise screening. Thus, the centerpoint aggregation is computationally efficient and performs better than coordinatewise screening.
• This work provides both theoretical and simulation results to show that the proposed Byzantine-resilient online meta-learning algorithm performs better than existing algorithms. The remainder of the paper is organized as follows: Section II discusses related works; Section III discusses the preliminaries; Section IV presents the problem formulation; Section V discusses the proposed algorithm; Section VI presents the theoretical results; Section VII discusses the simulation setup and results; Section VIII provides an elaborate discussion on other urgent problems faced in decentralized learning; and Section IX concludes the findings in the paper.

II. RELATED WORKS
Decentralized meta-learning was proposed to allow multiple agents with local data scattered in different locations to collaboratively learn a good meta-initializer. Learning performance improves over a single agent with access to only its local data. Due to a diffusion of the learning parameters across the agents, the cold-start problem is eliminated [19]. Some of the work on decentralized meta-learning include [22], [23]. Decentralized meta-learning was extended to the online decentralized meta-learning setting to also provide continuous lifelong learning [19]. However, both the offline and online decentralized meta-learning algorithms are vulnerable to security threats common to any decentralized network. These algorithms are not designed to withstand security threats. This is a serious concern because cyberattacks impact the performance of machine learning algorithms [52].
There are some techniques proposed in the literature to provide resilience against Byzantine attacks in machine learning. These techniques include Krum, Multi-Krum [32], geometric mean [31], geometric median [45], coordinate-wise screening [38], Zeno [46], Zeno++ [47], etc. These techniques were originally developed for distributed learning based on a worker-master architecture. The worker-master architecture is liable to congestion and a single point of failure at the coordinator [53]. Aside from coordinate-wise screening, the other techniques have not been extended to decentralized multi-agent settings, where multiple agents communicate in a peer-to-peer fashion. This is because these other techniques do not guarantee good resilience or provide good convergence analysis in decentralized learning [25]. In coordinate-wise screening, each normal agent screens the local models it receives from its neighbors and trims out small and large values per coordinate.
Recently, two coordinate-wise screening-based algorithms, ByRDiE [36] and BRIDGE [37] were proposed to guarantee Byzantine resiliency in decentralized learning. In the BRIDGE algorithm, the screened operation occurs in parallel, which results in better computational efficiency than the ByRDiE algorithm. However, both ByRDiE and BRIDGE algorithms use scalar-wise operations per coordinate. A Byzantine contribution can be trimmed in a coordinate but passes the screening test in other coordinates. Thus, both ByRDiE and BRIDGE cannot guarantee the total removal of Byzantine contributions in every coordinate. Hence, they cannot guarantee better convergence towards the global optimal model. Moreover, ByRDiE and BRIDGE are not computationally efficient with high dimensional data. BREDA was also proposed using coordinate-wise screening for Byzantine-resilient resource allocation in decentralized networks [48]. It inherits the limitations of coordinate-wise screening. Aside from coordinate-wise screening, a total variation norm-penalized (TV-norm) approximation method was proposed to handle Byzantine attacks in [49]. However, the limitation is that the optimal local models of the agents are not necessarily consensual.
The Byzantine-resilient algorithms discussed in the previous two paragraphs are distance-based because they identify Byzantine contributions by finding outliers with large distances from a reference point. These algorithms are mostly designed for i.i.d data distribution. Although, BRIDGE and ByRDiE have shown good performance with non-i.i.d data distribution. In order to handle non-i.i.d data distribution while guaranteeing Byzantine resiliency in decentralized networks, UBAR was proposed in [50]. UBAR is a performance-based approach that screens out Byzantine local model vectors based on their performance on a validation dataset. Similarly, BASIL was proposed for non-i.i.d data distribution and achieved a better performance than UBAR [51]. However, BASIL trains in a sequential manner over a logical ring, so it suffers from high latency that scales with the number of agents in the network.
It is noteworthy that none of the aforementioned Byzantine-resilient algorithms can work in a time-varying environment. Hence, this limits their application to many real-world applications, such as autonomous vehicles and traffic monitoring systems. Moreover, none of these algorithms can quickly adapt to a new task with a small dataset. Therefore, we propose a novel Byzantine-resilient algorithm that can work with two superior Byzantine-resilient techniques in a time-varying environment with non-i.i.d data distribution. The first Byzantine-resilient technique is a modification of the coordinate-wise screening but with better performance. The second algorithm uses a superior screening technique called centerpoint aggregation. Centerpoint aggregation is a vector-wise operation, that can screen out Byzantine vectorial infiltration. The centerpoint aggregation is based on the centerpoint theorem in discrete mathematics known for a long time [41], [42], [43]. However, its application in machine learning to provide resiliency against Byzantine attacks is new. Recently, it was applied to provide Byzantine resiliency in robotics [54], [55]. Similarly, we applied centerpoint aggregation in online federated learning for the first time to provide Byzantine resiliency, though with no rigorous experimental and theoretical results [44]. Table 1 shows a comparison of our proposed algorithm with other existing algorithms with K as the total number of agents, b as a known upper bound on the number of Byzantine agents, d as the dimension of the decision set, and n as the unknown number of normal agents. It can be seen from Table 1 that our centerpoint-based proposed meta-learning algorithm is the only Byzantine-resilient algorithm that works in the online setting and can also guarantee quick adaptation when the size of the dataset is small. Table 2 defines the notation used in the paper.

A. NOTATIONS AND DEFINITIONS
This implies that the gradient is bounded, i.e., ||▽f (ω)|| ≤ L and L > 0. VOLUME 11, 2023 where L ′ > 0. Definition 3: (co-coercivity) A convex and continuously differentiable function f is said to undergo the co-coercivity condition if Definition 4: (general position) A set containing at least d + 1 points in d−dimensional affine space is said to be in a general position if no hyperplane contains more than d points, i.e., the points do not satisfy any more linear relations than they must.
there exist an open ball centered at x which is completely contained in S.
Definition 6: (half-space) A half-space is either of the two parts into which a hyperplane divides an affine space.
\{0}. Definition 1, Definition 2, Assumption 1, and Definition 3 are standard for convex optimization and are necessary for the proof of convergence in this work. For a better understanding of convex optimization, we refer readers to [39] and [56]. Definition 4, Definition 5, and Definition 6 are well known in computational geometry but are necessary here to analyze centerpoint aggregation [54].

B. OFFLINE DECENTRALIZED META LEARNING
Assume a decentralized multi-agent setting where the agents are connected to form a directed graph network G = (K, E) with K = {1, . . . , K } representing the set of agents and E representing the set of edges. The edge (j, k) ∈ E if agent j can send one-way information directly to agent k and the edge (k, j) ∈ E if agent k can send one-way information directly to agent j. The adjacency matrix A for the graph network is formed from the edge weights between any two agents. That is, [A] jk is the weight of edge (j, k) and [A] kk is the self-loop weight of agent k. The matrix is column-stochastic, which means that the sum of its column must be one. The neighbors of agent k are members of the set N k := {j ∈ K : (j, k) ∈ E}. The objective of the agents is to collaboratively learn the best model as follows [57] min where is the penalization of the model ω on data sample d k drawn randomly from data distribution D k for agent k. The expectation is over the randomness of d k .
In offline decentralized MAML, the goal is not to find the best model for the agents. Rather, MAML exploits the relatedness among the underlying task data distributions of the agents to find a meta-initializer that can quickly adapt to new tasks arriving for each agent after one or more gradient descent updates [12]. We will restrict our analysis to a single gradient descent update, although, this can be extended easily to multiple gradient descent updates. Using a single gradient descent update, the agents' objective is formulated as follows [27]: where meta function J k (ω) := f k (ω − α▽f k (ω)) and α > 0 is the step-size. Any agent can take the solution of Equation (6) as an initial point and updates it with respect to its own task data. Equation (6) can be seen as the sum of the meta functions J 1 , . . . , J K . The gradient ▽J k (ω) is given as follows

IV. PROBLEM FORMULATION
In this section, we will start by formulating the problem of online decentralized meta-learning, which differs from the offline decentralized meta-learning discussed in the preliminaries. Then, we will extend our formulation to the case where there are Byzantine agents in the graph network.

A. ONLINE DECENTRALIZED META LEARNING
In online learning, an agent k is faced with a sequence of global stochastic risk functions f 1 , . . . , f T over a period of T iterations. The risk function sequence is chosen adversarially from an unknown distribution. Each agent k must choose the local models ω k,1 , . . . , ω k,T that perform well on this risk function sequence. In essence, the goal of the agent is to minimize a performance metric called regret. A single agent's regret is defined as the difference between the cumulative risk of the agent from time t = 1 to t = T and the cumulative risk of an oracle with hindsight knowledge of the minimizer (or best model) from time t = 1 to t = T . Formally, a single agent's regret is represented as [58] The regret definition in (8) does not solve the online optimization problem in a decentralized fashion for the multi-agents K = {1, . . . , K }. The agents can leverage their graph connections to collaboratively learn the minimizer as follows where is the time-varying local risk function of the agent k, and the expectation is over a time-varying data sample d k,t drawn from a time-varying data distribution D k,t . It is often better to compute the expected regret of the agents instead of the instantaneous regret of the agents. Therefore, Equation (9) can be reformulated as follows The expectation is over the randomness of the local risk function f k,t . We can consider an online decentralized meta-learning setting where each agent performs a task-specific update to its local model before it is used for evaluation in each iteration. The goal is no longer to learn the minimizer but to learn a meta-initializer. This update is done using an update function such that U k,t : ω →ω. The update function takes ω and producesω which performs better on f k,t . The update function U k,t is a one step of stochastic gradient descent, i.e., U k,t (ω) = ω−α▽f k,t (ω) [3]. Therefore, equation (10) becomes The online meta function J k,t (ω) := f k,t (U k,t (ω)) and its gradient ▽J k,t (ω) = (I − α▽ 2 f k,t (ω))▽f k,t (ω − α▽f k,t (ω)).

B. BYZANTINE AGENTS AND ATTACKS
Cooperation among agents in the neighborhood is essential in multi-agent settings to converge to the minimizer. However, some of the agents in the neighborhood may deviate arbitrarily from the expected behavior during the iterative training process. Such agents are said to be Byzantine. Byzantine agents refer to any malfunctioning or malicious agents. Definition 7 ( [37]): An agent j ∈ K is said to be Byzantine at any iteration of the online decentralized meta-learning process, if it follows a different update rule, sayg j (·) instead of g j (·), or broadcasts an arbitrary summary of its local information to agents in its neighborhood that is different from the intended summary of its local information.
For the rest of this paper, we denote M ⊂ K as the set of normal agents and B ⊂ K as the set of Byzantine agents, such that M ∩ B = ∅ and M ∪ B = K. Let n represent the cardinality of M. As common in literature on Byzantine resiliency, it is assumed that the Byzantine agents are unknown and hard to identify [37]. Hence, the cardinality and members of B are unknown. However, a known upper bound on the number of Byzantine agents is denoted as b, i.e., 0 ≤ |B| ≤ b and n ≥ K − b.
The presence of Byzantine agents makes (11) unsolvable. Hence, we will focus on training only the normal agents. Thus, (11) is reformulated as follows where the global meta-initializer ω * : In order to develop an online decentralized meta-learning algorithm that can solve (12), the following assumptions will be made.
Definition 8 ( [37]): (source component) A source component of a graph is a subset of a graph nodes such that every node in the source component has a directed path towards every other node in the graph Definition 9 ( [37]): (reduced graph) A subgraph G red (b) of a graph G with parameter b is a reduced graph generated from (i) removing all Byzantine agents together with their incoming and outgoing edges, and (ii) removing b edges from each normal agents.
Assumption 2 ( [37]): (sufficient network connectivity) The reduced graphs G red (b) of the graph network G = (K, E) give rise to a sufficiently connected decentralized network in the sense that there exists at least one source component among the reduced graphs of cardinality greater than (b + 1).
Assumption 2 ensures that each normal node can still receive and send information to other normal nodes despite the removal of Byzantine nodes and edges from the graph network. Empirical findings show that this assumption is valid [37].

V. PROPOSED ALGORITHM
In practice, there is always a lack of information about the time-varying data distribution D k,t for any agent k in an online learning setting. This makes computing f k,t (ω), and in essence computing both ▽f k,t (ω) and ▽J k,t (ω) computationally infeasible. Hence, data realizations are usually collected and an empirical risk minimization is computed. Thus, we can estimate the gradient ▽f k,t (ω) as follows VOLUME 11, 2023 where S k,t is a time-varying independent mini-batch of data containing realizations {d n k,t } |S k,t | n=1 . The gradient estimate ▽f k,t (ω; S k,t ) is an unbiased version of the true gradient ▽f k,t (ω). Similarly, the Hessian in ▽J k,t can be approximated with its unbiased estimate ▽ 2f k,t (ω, S k,t ). Therefore, the approximate online meta gradient ▽J k,t (ω) is computed as follows (14) where S k,t and S ′ k,t are two independent mini-batch data. Computing the Hessian every iteration can be computationally challenging. Therefore, the Hessian part in ▽J k,t (ω; S k,t , S ′ k,t ) is often removed to obtain an approximate online meta gradient given as ▽f k,t (ω − α▽f k,t (ω; S k,t ); S ′ k,t ) [59]. It is shown in [12] that during the local model update, ▽f k,t (ω − α▽f k,t (ω; S k,t ); S ′ k,t ) can be implemented as two different stochastic gradient descent steps, i.e., ω k,t+1 ← ω k,t − α▽f k,t (ω − α▽f k,t (ω k,t ; S k,t ); S ′ k,t ) can be updated in two steps as follows:ω k,t+1 ← ω k,t − α▽f k,t (ω k,t ; S k,t ) and ω k,t+1 ← ω k,t − α▽f k,t (ω k,t+1 ; S ′ k,t ). However, updating the local model ω k,t+1 this way does not include collaborative learning with other agents in the neighborhood set N k .
To allow for collaborative learning, updating ω k,t+1 follows a diffusion learning process [60]. The conventional diffusion learning process uses the adapt-then-combine (ATC) technique. However, this diffusion learning process is not robust against Byzantine attacks. Hence we propose an adapt-meta-adapt-then-robust-combine (AMATRC) diffusion learning process with time-varying step sizes described as follows: Obtain the estimated riskf k,t (ω k,t ; S k,t ). 6: Computeω k,t+1 = ω k,t − α t ▽ ωfk,t (ω k,t , S k,t ).

7:
Compute θ k,t+1 = ω k,t − α t ▽ωf k,t (ω k,t+1 , S ′ k,t ). 8: Cooperate with neighbors to update the local meta-model The proposed Algorithm 1 works as follows: Step 1 shows the necessary inputs to the algorithm, which are the time duration T and time-varying step size α t . The use of time-varying step size rather than a fixed step size is to provide faster convergence.
Step 2 shows the expected output of the algorithm, which is the local model ω k,T of agent k at time T . This local model ω k,T is an estimate of the global meta-initializer ω * .
Step 3 initializes the local model ω k,1 at the start of the algorithm. From Steps 4 to 8, the algorithm iterates for round t = 1, . . . , T .
Step 5 computes the estimated risk.
Step 6 is the adapt step which computes a stochastic gradient descent update on the local model. Step 7 is the meta-adapt step, which is another stochastic gradient descent update.
Step 8 is the robust-combine step, which uses a Byzascreen protocol to screen out Byzantine contributions during the aggregation of the local model of client k with the contributions of its neighbors. Byzascreen protocol is Byzantine-resilient unlike the conventional ATC or the consensus approach in decentralized optimization, which is highly susceptible to Byzantine infiltration [25]. We propose two independent Byzascreen protocols. The first is the modified coordinatewise screening, which is a modified version of the conventional coordinate-wise screening technique used for removing outliers in distributed and decentralized networks [38]. The second is the centerpoint aggregation based on the centerpoint theorem in discrete geometry but still unknown in machine learning [42].

A. MODIFIED COORDINATE-WISE BYZANTINE SCREENING
The conventional coordinate-wise screening trims the b largest and the b smallest values in each coordinate of the intermediate models {θ j,t+1 } j∈N k received from neighbors of the normal agent k at time t. Then, it finds the average of the remaining values to update each coordinate of ω k,t and forms a new local model ω k,t+1 [37]. In our work, we use a modified updating approach that allows each normal client to leverage its self-loop weight for faster convergence. Specifically, in each iteration, the proposed coordinate-wise screening technique computes the following in parallel for each coordinate point i = {1, . . . , d} of the normal agent k [37]: N i k,t := arg min It should be noted that the member of sets N i k,t and N i k,t are not fixed but are time-dependent due to the arbitrary behavior of Byzantine attacks. It is possible for a neighbor of agent k to belong to set N i k,t or N i k,t at coordinate i but does not belong to any of these sets in other coordinates at a given iteration. After the completion of (15), (16), and (17), the update process is done in parallel for ∀i = {1, . . . , d} as follows The normal agent k leverages its self-loop weight in (18) unlike in the conventional coordinate-wise screening technique. The edge weight for each member of the set C i k,t is . . , d}. The modified coordinate-wise Byzantine screening process requires that each agent k has at least 2b + 1 neighbors. Since the modified coordinate-wise screening process is not vectorwise, there is no guarantee that this screening process will completely screen out a Byzantine neighbor's local model contribution in all coordinates at any iteration.

B. CENTERPOINT AGGREGATION
The centerpoint aggregation is a vector-wise screening technique for outliers, well known in discrete geometry. It is a generalization of the median to higher dimensions. It states that for any point set P of z points in R d , there exists a point c such that the halfspace containing c contains not less than z d+1 points of P [61]. Recently, the centerpoint aggregation technique was applied to provide resilient aggregation against Byzantine agents in the consensus ATC algorithm [54]. Also, our recent work applied centerpoint aggregation technique to provide Byzantine resiliency in online federated learning for the first time [44]. However, we provided only empirical results without experimentation or convergence analysis. Moreover, our previous work does not discuss meta-learning. Hence, we provide rigorous analysis and experimentation for the online decentralized meta-learning setting discussed in this paper.
A safe point, which is an interior centerpoint, exists when the number of Byzantine agents in the neighborhood of any normal agent k is less than b = |N k | d+1 , where |N k | is the size of neighborhood set N k , and d is the dimension of the local model [55]. In our work, we show that the centerpoint aggregation technique ensures that a centerpoint lies in the convex hull of points corresponding to the received intermediate model vectors of the screened neighbors of agent k. This does not require a knowledge of the location and behavior of the Byzantine agents. However, computing the centerpoint can be challenging in practice for large dimensions. There are approximate algorithms in the literature to compute a centerpoint that lies in the interior of the convex hull of the normal agents within computational time O((rd) d ), given that there are at most |N k | d r/r−1 Byzantine agents, where r is an integer greater than 1. Increasing the value of r makes the approximation better [62]. A more feasible approximate algorithm proposed is the Tverberg partitioning that finds an approximate centerpoint [63], [64].
Definition 10 ( [44]): Given a set P of z in R d in general position, where z ≥ d + 1, a centerpoint c is a point, not necessarily a member of P, such that any closed half-space containing c will also contain at least ⌈ z d+1 ⌉ points from P. The value of z is same as the cardinality |N k |.
Remark 1: The centerpoint divides the set P into roughly two equal halves, such that sufficient points exist on both sides of the centerpoint.
Theorem 1 ([44], [65]): There exists a centerpoint for any given point set in general positions in any arbitrary dimension. This is the centerpoint theorem.
Remark 2: There are more than one centerpoints. The centerpoint is not unique. The set of these centerpoints is called the centerpoint region, and it is closed and convex.

VI. THEORETICAL RESULTS
In this section, we will discuss the theoretical guarantee for the proposed Algorithm using the centerpoint aggregation scheme. Lemma 2: Algorithm 1 is said to be resilient convergent if for every t ∈ T and k ∈ M we have that ||ω k,t+1 − ω * || ≤ ||ω k,t − ω * ||. Proof: Let the set of normal clients in the neighborhood of agent k with k inclusive, be given as N + k ⊆ N k ∪ {k}. It holds from the Algorithm 1 that Let us simplify ||θ j,t+1 − ω * || before returning back to (19). From step 7 of Algorithm 1 (20) substituting step 6 of Algorithm 1 in (20) gives (20) we have the empirical online meta functionf j,t (U j,t (ω j,t , S j,t ), S ′ j,t ) is an approximation of stochastic online meta function J k,t (ω k,t ) := f j,t (U j,t (ω j,t )). Thus, in approximation, ||θ j,t+1 − ω * || can be written as We can express ||ω j,t − α t ▽ ω f j,t (U j,t (ω j,t )) − ω * || 2 as VOLUME 11, 2023 From the co-coercivity of f j,t (·) in Definition 3, we have However, ▽ ω f j,t (U j,t (ω * )) = 0, hence (22) becomes Substitute (23) into (21). This becomes Now, substitute (24) into (19) to obtain Since Remark 3: Lemma 2 shows that every round of iteration of Algorithm 1 improves convergence accuracy by reducing the distance to the global meta-initializer.
Theorem 3: The expected regret incurred by a normal agent k ∈ M running Algorithm 1, with f k,t as a λ−strongly convex function and the stepsize α t , such that α t = β γ +t for some β > 1 λµ , γ > 0 and α 1 = µ Lσ 2 k . Then, the regret is given as where ν := max Proof: It can be obtained from Theorem 4.7 in [67] that where in (a), the sum of the second and third terms in the left-hand side of the above equation is negative from the definition of ν and can be removed. Moreover, in the first

Remark 5:
The regret bound of Algorithm 1 per round of iteration is of order O( 1 t ) as long as the stepsize parameter β > 1 λµ .

A. SIMULATION OVERVIEW
For the simulation, we will use a synthetic dataset similar to existing work [48]. We will show first that the conventional ATC aggregation technique used in decentralized algorithms will help agents converge when there is no Byzantine attack in the network. Then, we will show that the ATC aggregation will cause an oscillation but no convergence in the presence of one Byzantine agent. By increasing the number of Byzantine agents to two, we will show that the ATC aggregation technique will give irrational behavior and never converge. This indicates the need for Byzantine-resilient aggregation techniques. Next, we will compare the convergence of the coordinate-wise aggregation used in state-of-the-art BRIDGE and ByRDiE algorithms with both the proposed modified coordinate-wise screening and centerpoint aggregation for both the case of one and two Byzantine agents in the network. Then, we determine the effect of the learning rate on the speed of convergence. Finally, we compare the regret bound of our proposed meta-learning algorithm with existing online metalearning algorithms.

B. SIMULATION SETUP
We consider online linear regression where the loss function is F k,t (ω k,t ; d k,t ) := 1 2 (⟨ω k,t , d k,t ⟩−y k,t ) 2 + λ 2 ||ω k,t || 2 , where d k,t ∈ R d is a data sample and y k,t ∈ R is its label. The timevarying mini-batch data S k,t = (d n k,t , y n k,t ) N n=1 , where N is the cardinality of S k,t . The element of each coordinate of the data sample d k,t is drawn from the interval (−1, 1) and the label is computed as follows where [ω 0 ] i = 1 for 1 ≤ i ≤ ⌊ d 2 ⌋ and 0 otherwise; Z ∼ N (0, 1) is Gaussian noise drawn from a Gaussian noise distribution with a mean of 0 and variance of 1. We use a fully connected directed graph network with 10 agents for the simulation. Let the self-loop weight [A] kk = 1 2 ∀k ∈ K and the edge weight [A] jk = 1 18 ∀j ∈ N k . The graph network is strongly connected and the adjacency matrix is column stochastic. The adjacency matrix is shown below.
The parameters of the simulation are as follows: the step sizes α t = 1 t , dimension d = 3, and the time duration T = 100. This is also shown in Table 3.

C. SIMULATION RESULTS
First, we will show that the conventional ATC aggregation technique [23], will converge when there is no Byzantine agent in the graph network. To achieve this, we will replace Step 8 in Algorithm 1 with the ATC aggregation technique to enable its application for the online decentralized metalearning problem. In the presence of Byzantine agents, the ATC online diffusion meta-learning algorithm is expected not to converge. Figure 1(a)-(c) shows the plot of the Euclidean norm of ω k,t ∀k ∈ M with the iteration time t using the ATC online diffusion meta-learning algorithm. It can be seen from Figure 1(a) that when there is no Byzantine agent in the network, the agents converge at iteration time t = 9. In Figure 1(b), we let an agent become Byzantine, say agent 2, such that it uses a different update equation to produce its intermediate local model θ 2,t+1 , i.e., θ 2,t+1 = [5 sin t, 3 cos t, 5 tanh t] T , where T denote transpose operation. It can be seen from Figure 1(b) that there is an oscillation but no convergence. In Figure 1(c), we let two agents become Byzantine, say agents 2 and 6. Agent 6 update its intermediate local model θ 6,t+1 using [cot t, 5 sin t, 3 tan t] T . It can be seen from Figure 1(c) that the presence of two Byzantine agents invades the graph network leading to irrational behavior that never converges. Now, we will compare the performance of the conventional coordinate-wise screening technique [36], [37] when applied in Algorithm 1 with our proposed modified coordinate-wise screening and the centerpoint techniques. It is of note that existing algorithms such as BRIDGE and ByRDiE cannot work directly in online settings. Therefore, for the sake of comparison, we apply the conventional coordinate-wise screening used in BRDIGE and ByRDiE in Step 8 of our proposed algorithm. The conventional coordinate-wise screening, modified coordinate-wise screening, and centerpoint aggregation techniques can all guarantee convergence in the presence of some number of Byzantine clients but with different speeds of convergence. In both the conventional and modified coordinate-wise screening techniques, each normal agent must have at least 2b + 1 neighbors. From the fully connected graph network, each normal agent has 9 neighbors, which means that the number of Byzantine agents cannot exceed four to provide resiliency and guarantee convergence. For the centerpoint aggregation b < ⌊ |N k | d+1 ⌋, which means that the centerpoint aggregation can provide resiliency against at most two Byzantine agents. We will compare the speed of convergence of the conventional coordinate-wise aggregation technique, the modified coordinate-wise aggregation technique, and the centerpoint aggregation technique for only one and two Byzantine agents.
For the case of one Byzantine agent in Figure 2(a)-(c), the centerpoint aggregation technique converges the fastest, while the modified coordinate-wise screening technique converges faster than the conventional coordinate-wise screening technique. The convergence time for the conventional coordinate-wise screening technique in Figure 2(a) is 27, the convergence time for the modified coordinate-wise screening technique in Figure 2(b) is 16, and the convergence time for the centerpoint aggregation technique in Figure 2(c) is 11 in Figure 2(c). It should be noted that a uniform weight is used instead of the self-loop weight in the conventional coordinate-wise aggregation technique, which is responsible for its slow convergence.
We repeat the simulation with two Byzantine agents. Again, the centerpoint aggregation technique converges faster than both the conventional and modified coordinate-wise VOLUME 11, 2023 FIGURE 1. Convergence plots using diffusion meta learning Algorithm for 10 agents in the presence of 0, 1, and 2 Byzantine agents with α t = 1 t .

FIGURE 2.
Convergence plot using conventional coordinate-wise screening, modified coordinate-wise screening, and centerpoint aggregation for 10 agents in a fully connected graph network which includes 1 Byzantine agent with α t = 1 t . aggregation techniques. This is shown in Figure 3(a)-(c), where the convergence time for centerpoint aggregation is 12 while the convergence time for the conventional and modified coordinate-wise screening techniques are 16 and 28 respectively. As the number of Byzantine agents increased from 1 to 2, it can be seen that it has a negligible effect on the speed of convergence for all three techniques.
We adjust the learning rates of the simulation from α t = 1 t to α t = 2 t to observe its influence on the speed of convergence. This result is shown in Figure 4(a)-(c), where the convergence time for the conventional coordinate-wise screening is 42, the convergence for the modified coordinate-wise screening is 29, and the convergence time for the centerpoint aggregation is 20. It is surprising that increasing the learning rate slows down convergence instead of speeding it up. We deduce that finding the optimal hyperparameters is not straightfoward.
We compare the sublinearity of our regret bound with the online decentralized meta-learning algorithm proposed with regret bound O( 1 √ t ) for each agent [19], and the OML algorithm proposed with a regret bound ofÕ(ln t) [3], wherẽ O(·) hides constant parameters. Sublinearity is defined as lim t→∞ regret t = 0. This is shown in Figure 4. It can be seen from Figure 4 that our proposed algorithm performs better,

VIII. DISCUSSION
In this work, we developed a Byzantine-resilient online decentralized meta-learning algorithm that can mitigate Byzantine attacks, while coping with the data (or statistical) heterogeneity of the agents, and providing quick adaptation of the trained model to some specific tasks in a time-varying environment. Yet, there are many other challenges in distributed and decentralized learning not covered in this work but require urgent attention. We discuss some of them:

A. THE CHALLENGE OF SYNCHRONIZATION
Synchronization is the de-facto communication mode in distributed and decentralized learning. It has given rise to many synchronization strategies, such as diffusion learning, consensus, adapt-then-combine, etc. However, it may be challenging to achieve synchronous aggregation due to the spatial distribution of agents, and due to system and statistical heterogeneity of the agents. Agents may have different power limitations, bandwidth, and processing speed which are referred to as system heterogeneity, and different data distributions referred to as statistical heterogeneity. This can introduce straggling effects. Therefore, an asynchronous mode of communication can be an effective way to address straggling, although, the use of stale updates defect convergence [68], [69]. Other techniques such as knowledge distillation [70] and quantization [71] can help to conserve network resources during a synchronous mode of communication by scaling down the size of the training models.

B. CONFLICTING OBJECTIVES
In many real-world problems, agents may have conflicting objectives, and seeking a common global optimal model or global meta-initializer may be infeasible. Therefore, agents must aim to achieve Pareto-optimality [72]. Algorithms for achieving Pareto-optimality in distributed and decentralized learning are not fully developed.

C. INCOMPLETE INFORMATION
In reality, agents may be faced with bad communication networks that may corrupt the data and limit the information accessible. Also, there may be a delay in transmitting and receiving information. Most distributed and decentralized learning algorithms are not designed to handle such cases. A potential solution would be to develop bandit convex and non-convex optimization algorithms to provide a worst-case regret. We refer interested readers to our published paper that addressed this problem in the non-convex setting [24]. However, obtaining a good regret bound, that matches convex optimization algorithms for the non-convex settings, is still a challenge.

D. PRIVACY
It is possible for some of the normal agents to have a particular interest in the global model or the local models of some other agents. Such normal agents are not malicious but can act as spies. Therefore, it is important to protect the local and global models before and after the aggregation process. Differential privacy is a rigorous mathematical definition of privacy that is well-developed in distributed learning but still underdeveloped in decentralized optimization. However, we take an early step to address this problem in our previous work [44], [53].

E. DEFENSE AGAINST AN ARBITRARY NUMBER OF BYZANTINE AGENTS
Currently, many Byzantine-resilient algorithms have convergence bounds, beyond which these algorithms fail to provide resiliency. As cyberattacks are increasing and attacks are becoming more sophisticated, there is an urgent need to develop algorithms that are robust against an indefinite number of Byzantine agents in the network.

F. FAIRNESS OF THE GLOBAL OBJECTIVES
Due to statistical and system heterogeneity, the trained global model may be fair to some agents but biased against some other agents. This calls for approaches, such as meta-learning, multi-task learning, constrained optimization, etc., to address this pressing issue [68]. This can be more concerning when the bias is against some protected groups, such as race and gender.

G. COMPUTATIONAL SPEED AND COMPLEXITY
Although meta-learning algorithms are able to quickly adapt to a specific task with a small dataset, training meta-learning algorithms can be computationally intensive. This is because the number of stochastic gradient descent steps is more compared to traditional deep learning algorithms. Therefore, with a series of multiple iterations occurring in training, there will be increased computational time complexity [73]. However, the justification for meta-learning is realized during meta-test time.

IX. CONCLUSION
Meta-learning is a process of distilling task-agnostic knowledge from multiple related tasks over a series of learning episodes to improve learning performance on new tasks from the same task family. Meta-learning has been extended to the online learning setting to provide a continuous lifelong learning experience. However, in a decentralized setting, a single agent is limited to its local task data and must collaborate with other agents in its neighborhood to improve its learning performance. Therefore, online decentralized meta-learning algorithms are designed to allow multiple agents to learn faster and better from shared experiences. On the other hand, online decentralized meta-learning algorithms are susceptible to Byzantine attacks and failure, which lead to poor performance. Although there are some stateof-the-art Byzantine-resilient techniques applied in offline decentralized settings to withstand Byzantine attacks, these techniques are not applicable in online settings and their performance for meta-learning is unknown. Therefore, this paper developed an online decentralized meta-learning algorithm that works with two Byzantine-resilient aggregation techniques to provide better Byzantine resiliency and guarantee higher convergence speed and accuracy. The first proposed Byzantine-resilient aggregation technique is a modification of the coordinate-wise screening used in state-of-theart Byzantine-resilient algorithms. The second proposed Byzantine-resilient aggregation technique is the centerpoint aggregation based on the centerpoint theorem in discrete geometry. The centerpoint aggregation is both novel and superior to the coordinate-wise screening technique. Simulation results showed that the proposed algorithm is indeed superior to existing algorithms.