Selective Trimmed Average: A Resilient Federated Learning Algorithm With Deterministic Guarantees on the Optimality Approximation

The federated learning (FL) paradigm aims to distribute the computational burden of the training process among several computation units, usually called agents or workers, while preserving private local training datasets. This is generally achieved by resorting to a server–worker architecture where agents iteratively update local models and communicate local parameters to a server that aggregates and returns them to the agents. However, the presence of adversarial agents, which may intentionally exchange malicious parameters or may have corrupted local datasets, can jeopardize the FL process. Therefore, we propose selective trimmed average (SETA), which is a resilient algorithm to cope with the undesirable effects of a number of misbehaving agents in the global model. SETA is based on properly filtering and combining the exchanged parameters. We mathematically prove that the proposed algorithm is resilient against data and local model poisoning attacks. Most resilient methods presented so far in the literature assume that a trusted server is in hand. In contrast, our algorithm works both in server–worker and shared memory architectures, where the latter excludes the necessity of a trusted server. The theoretical findings are corroborated through numerical results on MNIST dataset and on multiclass weather dataset (MWD).

Abstract-The federated learning (FL) paradigm aims to distribute the computational burden of the training process among several computation units, usually called agents or workers, while preserving private local training datasets.This is generally achieved by resorting to a server-worker architecture where agents iteratively update local models and communicate local parameters to a server that aggregates and returns them to the agents.However, the presence of adversarial agents, which may intentionally exchange malicious parameters or may have corrupted local datasets, can jeopardize the FL process.Therefore, we propose selective trimmed average (SETA), which is a resilient algorithm to cope with the undesirable effects of a number of misbehaving agents in the global model.SETA is based on properly filtering and combining the exchanged parameters.We mathematically prove that the proposed algorithm is resilient against data and local model poisoning attacks.Most resilient methods presented so far in the literature assume that a trusted server is in hand.In contrast, our algorithm works both in server-worker and shared memory architectures, where the latter excludes the necessity of a trusted server.The theoretical findings are corroborated through numerical results on MNIST dataset and on multiclass weather dataset (MWD).
Mauro Franceschelli is with the Department of Electrical and Electronic Engineering, University of Cagliari, 09122 Cagliari, Italy (e-mail: mauro.franceschelli@unica.it).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TCYB.2024.3350387.
Digital Object Identifier 10.1109/TCYB.2024.3350387units, referred to as agents in the following, for example, computers, smartphones, possessing local training datasets that may be heterogeneous [1], [2].Although agents generally do not want to disclose their local training datasets to preserve their privacy [3], they are interested in jointly learning a globally optimal model.The conventional architecture in FL is the server-worker structure, as depicted in Fig. 1.In this architecture, starting from a global model, each agent i moves toward the optimal local model for its training dataset, D i , through a stochastic gradient descent (SGD) algorithm and communicates the resulting parameters w i to a server.In the next step, the server updates the global model by aggregating the received parameters w i , and sends the updated global parameters w back to agents.The average of received parameters is the most conventional aggregation rule in nonadversarial environments [4], [5], [6], [7].However, all networked systems are threatened by cyberattacks [8], [9] and FL is not an exemption.Adversarial agents can perform two categories of attacks: 1) data poisoning [10], [11], [12] and 2) local model poisoning [13] attacks.In data poisoning attacks, the adversaries inject malicious data into the local training set of compromised agents, while the latter run the learning process and honestly send the results to the server.In model poisoning attacks, the adversaries intentionally exchange malicious parameters to corrupt the global model.Recent results [14], [15], [16] show that even a single misbehaving agent can arbitrarily manipulate the global model if the average aggregation rule is implemented.
To overcome this issue, alongside methodologies that aim to detect and isolate misbehaving agents, such as [17] and [18], several aggregation rules, for example, trimmed average or median [14], Krum [15], Bulyan [19], Byrd-SAGA [16], Zeno [20] and RSA [21], have been proposed to make distributed learning resilient against adversaries in serverworker architectures.The common idea behind most existing resilient distributed learning algorithms is that outlier local parameters must be filtered out so as not to have any influence on the global model.For instance, the trimmed average aggregation rule [14] computes the coordinatewise average of the vectors of model parameters, discarding β percentage of the highest and lowest values, where β is a design parameter.A similar idea was recently proposed in [22], wherein a trimmed average is applied to the estimate of the global gradient vector by agents, discarding β percentage of the highest Fig. 1.Server-worker architecture where n agents locally update their models and send the computed parameters w i to a server that aggregates them into a global parameter vector w. and lowest-coordinatewise values.The median aggregation rule [14] considers the coordinatewise median of the vectors of model parameters received from agents.A comparison between trimmed average and median aggregation rules can be found in [14].In Krum [15], the local parameter vector having the lowest distance to others is selected as global model.An extension of Krum is provided by Bulyan [19] where parameters are updated according to a two-stage approach: 1) the set of local parameters with the lowest distance from others is recursively determined and 2) then, they are combined by discarding the farthest values from the coordinatewise median.A geometric median-based robust aggregation on corrected stochastic gradients is proposed in Byrd-SAGA [16] reducing the stochastic gradient-induced noise from regular agents, that is, not adversarial.A further approach based on computing and exchanging redundant gradients by the workers is proposed in [23] to overcome the computational complexity of medianbased approaches.In addition, the algorithm Zeno presented in [20] exploits the knowledge of a training dataset by the server to score the gradients received by the workers.The approach is extended in [24] to handle asynchronous communication and an arbitrary number of adversaries.Finally, RSA in [21] is based on introducing a regularization term in the objective function to robustify the learning task and forcing the workers' local parameters to be close to the server's one.
A key assumption in the resilient aggregation rules presented so far is the availability of a trusted server, which represents a server unit capable of aggregating the parameters in a fault-free and attack-free manner.However, in case of attacks to this unit, the entire learning process is at severe risk since malicious global models can be transferred to all workers.Furthermore, many of them, for example, [14], [15], [19], [20], and [24], assume independent and identically distributed (IID) local datasets which are unrealistic in FL where each agent holds a private dataset.In this article, we propose a resilient algorithm that is also applicable when all agents communicate according to a virtual shared memory architecture, as depicted in Fig. 2, in addition to the server-worker one.In the shared memory architecture, we consider that each agent i holds a local private memory, where the dataset D i is stored, and shares the estimated local parameters w i (k + 1) in a memory shared with the other agents.In this shared memory, each agent can read the memory areas of all others and can write only on its own.This enables the omission of the trusted server unit and is equivalent to letting all the agents communicate with each other and exchange local parameters.Note that this architecture ensures privacy of local data in alignment with the FL paradigm.However, in this way each agent has complete knowledge of the network, the local parameters of all agents, as well as the update rules.This implies that adversaries can exploit this information to craft serious attacks, as detailed in [13], for which we demonstrate resilience.More specifically, we look at the resilient FL problem from the perspective of resilient distributed optimization [25], [26], [27], [28], aiming to find suboptimal solutions that are not affected by adversarial agents, without the need to explicitly detect and isolate them as done, for instance, in [29].We present a resilient algorithm, called selective trimmed average (SETA), based on a coordinatewise trimmed average of local parameters to update the parameters of each regular agent.For each coordinate, trimmed values are selected based on whether the coordinate itself is evaluated as an outlier or not.We assume the general case where adversarial agents can collude with each other and can decide whether to perform data poisoning or local model poisoning attacks.
To the best of our knowledge, this is the first work that provides deterministic formal guarantees for resilient FL, also relaxing the server-worker architecture.Furthermore, SETA achieves better performance with non-IID datasets compared to most state-of-the-art baselines under different types of attacks.Note that a relaxation of the architecture can also be found in [30] and [31] where distributed protocols are designed but IID datasets are required.
The trimmed average is the closest existing approach to the one in this article.However, we identify the following fundamental differences.
1) In the trimmed average, agents have the same global model to evolve at each epoch.In contrast, in our algorithm, the starting model at each epoch may be different for the agents.
2) It is possible to execute our algorithm in an architecture with shared memory, which relaxes the need to have a trusted server unit.3) We provide deterministic theoretical guarantees as opposed to statistical results published for trimmed average and many other aggregation rules.The proposed algorithm is built on our previous work [32] in resilient distributed optimization of scalar functions.In particular, the results of our previous work are extended to resilient FL and multidimensional problems in this article as well as a more detailed mathematical analysis is provided.
In summary, the main contributions of this article are as follows.
1) We propose a resilient FL algorithm, called (SETA), aimed to cope with both data and local model poisoning attacks with non-IID local datasets.2) We extend resilient FL to the case of shared memory architecture while guaranteeing that the additional information shared by regular agents, and accessible by adversaries, does not threaten the learning process.3) We provide deterministic mathematical analysis, as well as simulations on the MINST dataset and the realistic multiclass weather dataset (MWD) [33], to confirm the effectiveness of our algorithm.The remainder of this article is organized as follows.In Section II, preliminary notions in multiagent systems and distributed optimization are reviewed.FL problem in nonadversarial setting is introduced in Section III.Section IV is devoted to the problem statement and introducing the proposed resilient FL algorithm.Section V focuses on the mathematical analysis.Numerical results are provided in Section VI to validate the approach and Section VII concludes this article.

A. Notation
Table I reports the main notation of this article.Unless specified otherwise, we denote scalar values with small regular font, vectors with bold font, and matrices with capital letters.In addition, 1 n (0 n ) is an n-element vector all equal to 1 (0), |.| denotes the cardinality of a set, . is the floor function.

II. PRELIMINARIES
Consider a network composed of n computation units which can interact with each other.In the remainder of the manuscript, we refer to each computation unit as an agent or node.Such a network is modeled as a graph G = (V, E), where V = {1, 2, . . ., n} represents the set of agents in the network and E ⊆ {V × V} the set of communication links (or edges) among the agents, that is, if agent i sends information to j, then (i, j) ∈ E. We denote the set of the in-neighbors of agent i as the communication links are bidirectional, that is, if (i, j) ∈ E implies that (j, i) ∈ E, and is defined directed otherwise.In addition, a graph G is called complete if there exists an edge between all the pairs of agents, that is, N i = V\{i} ∀i ∈ V.A path π i,j between nodes i and j is a sequence of consecutive edges, starting from node i and ending in node j, that is, it is composed of the edges A directed graph G is defined as strongly connected, if there exists a directed path between each pair of nodes (i, j) in V.If there exists π i,j between nodes i and j, node j is said to be reachable from node i.Furthermore, if there exists an agent i ∈ V such that all agents in V are reachable from i, the graph G is said to be rooted.In case each edge (j, i) ∈ E is associated with a positive weight, a ij > 0, the graph G is called weighted.The matrix A = [a ij ] ∈ R n×n collecting the weights is defined as adjacency matrix, that is, a ij > 0 if (j, i) ∈ E and a ij = 0 otherwise.A square matrix A ∈ R n×n with non-negative entries and with each row (column) summing to 1 is called row (column) stochastic.Moreover, A is called doubly stochastic if it is jointly row and column stochastic.If the edge weights a ij (k) are time-varying, the weighted graph is time-varying as well and is denoted by A time-varying graph G(k) is defined as jointly strongly connected, if there exists a finite positive integer B such that the graph (V, E B (k)) is strongly connected for all finite k 0 .

A. Transition Matrix
Let w i (k) ∈ R m be the state vector of agent i at time step k, that is, in our case w i (k) is the vector collecting the model parameters to optimize.The most common update rule of an agent i in consensus-based multiagent systems [34] consists of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
a weighted summation over its own and its in-neighbors state vectors, that is, ( According to the definition of adjacency matrix, (1) can be rewritten in matrix form as where T ∈ R n×m is the matrix containing the agents' states at time step k.If the adjacency matrix is row stochastic, the weighted summation becomes a weighted average over each agent in-neighbors' values and its own state value at time step k.By virtue of (2), we can define the following equation: If we repeat this procedure, for all s < k it holds To compact (4), the transition matrix is defined as for all s and k with k ≥ s, and (k, k) = A(k), leading to From ( 6), we observe that if all the rows of the transition matrix asymptotically converge to the same stochastic vector, then agreement among the agents is reached, that is, it holds lim We recall a lemma providing a condition for convergence in terms of connectivity and adjacency matrix weights.
Lemma 1 [35]: Consider a communication graph G. Assume that there exists a scalar τ ∈ (0, 1), such that ∀i ∈ V, it holds a ii (k) ≥ τ , and for all i = j, it holds either a ij (k) = 0 or a ij (k) ≥ τ .If G is rooted and the adjacency matrix A(k) is row stochastic ∀k, then there exist two positive scalars B > 0 and ξ ∈ (0, 1) and a stochastic vector

B. Ancillary Definitions and Lemmas
We introduce the following definitions that will be used in the theoretical analysis.
Definition 1 [25]: A subset S ⊂ V of agents is said to be r-reachable, with r ∈ N, if there exists an agent i ∈ S such that |N i \S| ≥ r.
Definition 2 [25]: For r ∈ N, graph G is said to be r-robust if for all pairs of disjoint nonempty subsets, S 1 , S 2 ⊂ V, at least one of S 1 or S 2 is r-reachable.
Definition 2 implies that a complete graph with n agents is (n + 1)/2 -robust.We additionally consider the following lemmas.
Lemma 2 [36]: Suppose a graph G is r-robust.Let G be a graph obtained by removing r − 1 or fewer incoming edges from each node in G.Then, G is rooted.

III. FEDERATED LEARNING PROBLEM IN NONADVERSARIAL SETTINGS
Traditional centralized learning algorithms require all training samples to be available to a central processing unit computing the optimal model.However, such algorithms may not be suitable for certain scenarios, where, for instance, 1) the owners of the training samples prefer not to disclose private information with a central processing unit or 2) the number of samples is too large, and it is impractical or even impossible to process them with a single processing unit.FL overcomes these limitations by distributing the learning process among multiple agents that hold private local datasets.This ensures privacy preservation and enables the processing of large datasets in a distributed fashion.
We now present the FL problem in a nonadversarial setting.Consider n agents that communicate with a server unit or share a memory and aim to collaboratively learn the parameters w of a global model.Each agent i has access to a local training dataset D i and its local objective is to find the optimal model parameters w ∈ R m , obtained by solving the following optimization problem: where f i , generally referred to as loss function, depends on D i , for example, the mean square error (MSE) function can be chosen.Assumption 1: The objective functions f i (w) are convex, and their gradients are continuous and bounded for bounded A sensible global model can be obtained as follows: where each agent i holds a copy of the decision vector w i and these copies are required to agree in (8b).The formulation in ( 8) can be viewed as a distributed optimization problem.Remark 1: Assumption 1 is not too restrictive and several loss functions exist that satisfy it [38], [39].Moreover, as discussed in [38], the main advantage of nonconvex loss functions is their ability to reduce the effects of outlier samples.Since our algorithm is resilient against manipulated samples, the importance of using these functions diminishes.

IV. RESILIENT FEDERATED LEARNING PROBLEM AND ALGORITHM DESIGN
As mentioned in the Introduction, it is possible to apportion all kinds of attacks in two categories, depicted in Fig. 3.
1) Data Poisoning Attack (e.g., [40]): The adversary injects deceptive samples in the local datasets D i of some agents, while these agents run the learning process to solve (7) and honestly send the results to the server.Since the local objective f i (w) depends on the dataset, the effect of this attack is that a compromised agent i will try to locally optimize a different objective function f a i (w).Note that it is usually difficult to detect this kind of attack [41] since agents execute the update algorithm correctly, while their dataset is corrupted.Next, we will refer to compromised agents as adversarial ones.2) Model Poisoning Attack (e.g., [13]): The adversary manipulates the local model parameters w i that are sent during the learning process.As a result, the adversaries do not solve (7).Instead, they generally optimize an adversarial objective aiming to mislead all the regular agents.This leads to the possibility of adversarial agents sending arbitrary model parameters w i to other agents, with no constraints on their behavior being imposed.Consider n agents among which n r ≤ n are regular and aim to solve the FL problem in Section III and n a ≤ F are adversarial, that is, they perform either data or model poisoning attacks, with F a positive constant and n = n r + n a .A sensible solution to (8) in an adversarial setting is to ignore the adversaries and find the optimizer among the regular agents min where V r represents the set of regular agents with nominal behavior.In the following, without loss of generality, we model shared memory architectures as complete graphs in which each agent receives the model parameters from all the other agents, that is, all the agents share the respective model parameters.Similar to (8), ( 9) can be viewed as a distributed optimization problem and, in particular, as a resilient distributed optimization problem.

A. Selective Trimmed Average Algorithm
We propose a resilient FL algorithm, referred to as (SETA) algorithm, aimed to solve (9).The basic idea behind SETA is that each regular agent filters out coordinatewise outlier values received from other agents and updates the parameter vector averaging the remaining values.Algorithm 1 summarizes the proposed SETA protocol, which is composed of three main phases.
For each time step k, in the first phase, each regular agent i gathers the local parameters w j (k) from the other agents in the network and, for each coordinate z ∈ {1, . . ., m}, runs a clustering algorithm which builds the sets

. , m} do
Cluster the values w z j (k), ∀j ∈ V in 3 sets: respectively, as well as the set V z n (k) comprising the remaining agents, that is, In the second phase, the weights a z ij (k) for the in-neighbors are assigned.More in detail, if the agent i belongs to V z n (k), then the weights a z ij (k) are set to 1/q ∀j ∈ V z n (k), and are set to 0 otherwise.In case the agent belongs to any of the outlier sets, that is, Next, the weights a z ij are set to 1/q for i = j and ∀j ∈ V z n (k)\{i z r (k)}, and are set to 0 otherwise.
An illustration of the effect of Phase 2 of SETA is provided in Fig. 4. Starting from the complete graph with n agents (in the circle on the left) representing the communication network, Phase 2 leads to m graphs, one for each component z ∈ {1, . . ., m}.Specifically, each graph (as depicted in the zoom on the right) can be viewed as a combination of a complete graph composed of q = n − 2F agents (in the circle) and additional 2F agents (marked in gray) receiving information, but which do not have any influence on others, that is, the values of such 2F agents are filtered out by the other agents.This implies that the edges of the graph considered for each component are time-varying and depend on the identified clusters at each time step.
At this point, the third phase updates the states following the distributed subgradient optimization algorithm introduced in [37] and using the graphs obtained in Phase 2 for each component z, that is, the update rule for agent i is: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
with c(k) the step size defined as The driving force behind SETA is the principle of self-trust.In Phase 2 of SETA, an agent may discover that its coordinate z significantly deviates from those of other agents, making it an outlier belonging to V z l (k) or V z h (k).Nevertheless, each agent maintains confidence in its own estimation and includes its own zth coordinates in the aggregation process, that is, a z ii = (1/q), regardless of its perceived outlier status.
Remark 2: SETA is also suitable for a server-worker architecture in which, in Phase 1, all the local parameters w i (k) are communicated to the central server, that builds the clusters Based on these, the server computes the variables a z ij (k) defined in Phase 2 and updates the states w i (k + 1) as in Phase 3. Finally, the server sends back the updated states to each agent i according to the architecture in Fig. 1.Without loss of generality, we consider the shared memory architecture in our mathematical treatments.
The main difference compared to previous aggregation rules for server-worker architecture is that with SETA each agent i receives a different parameter update w i (k + 1), and these parameters converge to the same vector for all regular agents.

V. MATHEMATICAL ANALYSIS
We now focus on proving that SETA is resilient to all types of attacks described in Section IV.To this aim, we first prove that if n and F satisfy the inequality for the complete graph to be 2F + 1 robust, that is, then the agents reach consensus with SETA protocol and the consensus value is not dependent on adversarial agents.Next, we provide a deterministic bound on the convergence to the optimal solution of the resilient FL problem.

A. Agents Consensus
As discussed above, SETA produces m different adjacency matrices corresponding to m different graph topologies.We notice that each agent i filters out at most 2F of its inneighbors' states to update each coordinate of the parameter vector w i .Considering Lemma 2, the (2F + 1)-robustness property in (12) ensures that the resulting m graphs after implementing SETA are rooted.
Lemma 4: Consider n agents with (2F + 1)-robust network in a shared memory architecture.Then, under Assumption 1, all agents executing SETA in Algorithm 1 converge to a constant consensus vector, w ∈ R m , that is, and w is not influenced by adversarial agents.
Proof: To prove this result we consider two steps: 1) by assuming that regular agents are at consensus, we demonstrate that this is not affected by adversaries and 2) then we prove that regular agents actually reach a constant consensus vector w.
Regarding the first step, in case all the adversaries belong to V z l (k) ∪ V z h (k), they are filtered out by all regular agents according to SETA and cannot deviate the consensus value.We thus consider the case where K z adversarial agents a i belong to V z n (k), i = 1, . . ., K z .This implies that there must also exist K z regular agents belonging to V z l (k) and K z regular agents belonging to V z h (k).The component z of the state of adversarial agents a i can be written as a linear combination of two filtered regular agents as follows: for i = 1, . . ., K z and 0 < σ z i (k) < 1 where w z h i (k) and w z l i (k) represent the components z of the regular agents that belong to V z h (k) and V z l (k), respectively.Therefore, A z (k) is equivalent to an additional row stochastic adjacency matrix, A z (k), where two filtered regular agents are considered in place of the remaining adversarial agents in V z n (k).It follows that the case of K z adversarial agents in V z n (k) is mathematically equivalent to the situation in which the adversarial agents do not communicate their state, but rather send the states of the respective two regular agents according to (13).In this equivalent network graph, the regular agents filter out at most 2F of their incoming edges as well.Thus, since the graph representing the network is 2F + 1-robust, we can conclude, by virtue of Lemma 2, that the equivalent adjacency matrix A z (k) is rooted and stochastic.Furthermore, by recalling that the adversarial agents are filtered out in A z (k), we obtain that, if the regular agents reach a consensus, this value cannot be influenced by the adversaries.
At this point, we focus on proving that the regular agents reach a consensus vector w.From (10), we have We can rewrite the z th value of w i (k + 1) in ( 14) as follows: According to the definition of l z i (k) in ( 10), ( 15) is equal to Then, using the transition matrix defined in (5), we obtain ∀i ∈ V and k > s where z is the transition matrix associated with [a z ij ].Since by Assumption 1 gradients are bounded, it holds lim k→∞ c(k)d z i (k) = 0 according to (11).Therefore, iii) (17) Considering that the resulting graph after implementing SETA is equivalent to a row stochastic and rooted graph, in which the adversaries are filtered out, from Lemma 1 it follows ∀i ∈ V r : which yields to Since any s < k can be selected for which w z j (s) is bounded, we observe that the term 1) of ( 17) tends to zero.Regarding term 2), Lemma 1 leads to with B > 0 and 0 < ξ < 1.According to Assumption 1, we can write the following inequality for term iii): Thus, the overall term given by multiplying ii) and iii) is bounded in the interval In view of (11) and Lemma 3, both the extremes of the interval tend to zero.Therefore, it holds proving that regular agents reach consensus.At this point, we prove that the consensus value is a constant vector, w.Recall that the adjacency matrix of the mathematically equivalent graph achieved by (13), where adversarial agents are filtered out, is row stochastic.Therefore, from (22) and the definition of l z i (k) in (10), ∀i ∈ V r , it follows: On the other hand, the update rule in (10) yields to Since in view of (11) it holds lim k→∞ c(k)d z i (k) = 0, by combining ( 23) and (24), one obtains which holds ∀z, z ∈ {1, . . ., m} and shows that consensus vector of regular agents is constant, proving the desired result.

B. Resiliency Analysis
In the previous section, we demonstrated that the regular agents reach a consensus that is not influenced by adversaries.We now need to prove that the agents converge to a sensible solution of the optimization problem in (9).To demonstrate this result, we first consider the following auxiliary lemma.
Lemma 5: To prove the result, first note that according to (10), the difference l z i l (k) − l z i n (k) can be written as Given the definition of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Therefore, l z i l (k) − l z i n (k) < 0. Similarly, we can prove that Next, recall that according to (10), the update rules for agents i l and i n are Then, if 27) and ( 28) it follows: leading to With the same argumentation, one obtains At this point, since in view of Assumption 1 the functions f i (.) are differentiable ∀i, and their gradients d i (k) are continuous, we observe that, by virtue of Lemma 4, the gradients converge to different constant vectors according to the consensus value of states.This implies that there exists a large enough time step index s such that the ordering of gradients does not change for ∀k ≥ s, that is, it holds ∀i, j, that if In view of ( 26), the difference Thus, the difference w z i n (k) − w z i l (k) can be written as Considering now the difference where i z r (k − 1) ∈ V z n (k − 1) by construction.Let us define for which it holds 0 < C < ∞, since regular agents reach consensus (Lemma 4).
We first analyze the case in which we observe that in this case it holds . Otherwise, with a similar argumentation to (30), we would obtain w z i n (k) < w z i z r (k) (k), which is a contradiction given the definition of i z r (k) for i l ∈ V z l (k).We want to prove that if d z i l (s) < d z i n (s), then the agents i l and i n switch at a finite time step T, that is, there exists s < T < ∞ such that w z i n (T) < w z i l (T).From (34) we can recursively find out that, the following inequality is verified for k > s: To prove that there exists a time step T < ∞ such that w z i n (T)− w z i l (T) < 0, we show that the first term in the right-hand side of (37) tends to zero faster than the second one.In view of (37), we obtain We now consider the case in which In this case, according to (30), there must exist an agent i z s with higher gradient, that is, This implies that the switching is a step toward sorting the agents with respect to their gradients.We can then reinitialize s = k and go forward again.Since the number of agents is bounded and all the replacements are toward sorting the agents, there exist finite time steps in which it holds for all k > s, the same case as in the above is obtained and it is proven that there exists a time step T < ∞ such that w z i n (T) − w z i l (T) < 0. We can write a similar argument for d z i n (s) < d z i h (s) and w z i n (s) < w z i h (s).Moreover, from (30) and (31) we observe that if the regular agents are ordered they will not change their sets.This concludes the proof.
At this point, we can formally state our main theorem regarding the resiliency of SETA in case of attacks.We define the powerset P(V r ) of the set of regular agents V r and its subset S equal to Theorem 1: Consider n agents among which up to F can be adversarial.Assume F satisfies (12) and Assumption 1 holds.Then, by implementing SETA in Algorithm 1 in a shared memory architecture, the states w i of regular agents i ∈ V r converge in the smallest hypercube containing all w * j defined as for j ∈ S, j = 1, . . ., |S|.Proof: By Lemma 5, it follows that there exists a time step T in which the regular agents do not swap their position Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
between sets Let us consider k > T and define the sets z h and z l containing q regular agents with highest and lowest-state values w z i ∀i, respectively.Let us introduce the following auxiliary vectors given the component z: From Lemma 4, it follows that the regular agents reach consensus despite the presence of adversaries in the network.
Let w be the consensus vector, that is, w = w i (k) ∀i ∈ V r as k → ∞.To prove the theorem, we show that the equivalent following inequality is satisfied: Assume by contradiction that w z = M z + , with > 0. We introduce the following variables: By considering that there are maximum F adversarial agents and that it holds | z h | = q = n − 2F, we deduce that there must exist at least F regular agents with lower states w z i (k) which are not included in z h .This implies that V z l (k)∩ z h = ∅ for all k > T. By applying a similar reasoning, we obtain Therefore, by virtue of the update procedure in (10), the following inequality holds true: which leads to Considering ( 42) and ( 41) for M z (k + 1), it follows: which can be generalized for M z (k + Z) as Since we assumed w z = M z + , there exists a time step k 0 > T such that the following inequality is verified for k ≥ k 0 and i ∈ V r : which, summing for all j ∈ h and considering the definition in (41), leads to In view of (39) and the convexity of the loss functions, we observe that if it holds w z i (k) = M z for all i ∈ z h , then the sum of respective gradients are null, that is, it holds Therefore, in order to fulfill (45) it must hold , we obtain from (44) that, for a large enough Z, it holds M z (k 0 + Z) < q(M z + (1/2) ), which is in contradiction with (46).By applying a similar reasoning, one can reach m z ≤ w z .This concludes the proof.Remark 3: The adversarial agent's behavior is not constrained in the above proof.Therefore, SETA is resilient against both data and model poisoning attacks described in Section IV.
Remark 4: The challenges of training FL algorithms are significantly increased with non-IID datasets, as extensively discussed in [42].In some extreme cases, such as when each agent possesses data samples, containing only one class in a multiclass classification problem, the non-IID distribution can lead the global model to fail in achieving satisfactory performance.In the mathematical analysis presented in this article, no assumptions about the properties of the datasets are made.This means that Algorithm 1 leads the states of all regular agents to a consensus vector, which is not influenced by adversaries and belongs to the hypercube defined in Theorem 1, regardless of whether the dataset is IID or non-IID.However, as noted in [43], the local decision vectors are closer in the IID case than in the non-IID case.Therefore, the convergence hypercube is a smaller neighborhood around the global optimal solution in the IID case compared to the non-IID case.Consequently, we can expect more accurate outcomes with IID datasets than with non-IID datasets.
The above theorem provides a region of convergence of the regular agents.A condition for reaching the optimal solution to (9) can be defined using the concept of redundancy in distributed optimization.In particular, the work in [44] proves that a necessary condition for finding a solution to (9) with up to F adversarial agents is that the cost functions f i (w) fulfill the 2F-redundancy property, that is, that for any subset of agents, , where | | = n − 2F the following holds true arg min where W * is the convex set of optimal solutions of (9).Based on (47), we identify the condition for SETA to converge to an optimal solution of (9).Corollary 1: Assume that the conditions of Theorem 1 hold and the 2F-redundancy property is fulfilled.Then, by implementing SETA in Algorithm 1, the states w i of regular agents i ∈ V r converge to an optimal solution of (9).
Proof: The proof easily follows considering that, if the 2Fredundancy property is verified, then it holds m, M ∈ W * .According to Assumption 1, the loss functions are convex and from Theorem 1 m z ≤ w z ≤ M z , therefore, it holds w ∈ W * .Note that reaching the exact optimal solution in 2Fredundant problems can be viewed as a metric to evaluate resilient optimization algorithms.In particular, since 2Fredundancy is a necessary condition to find the exact optimal solution of ( 9), a well-designed algorithm should converge to this solution when the 2F-redundancy property is met.

VI. SIMULATION RESULTS
In this section, we validate the resiliency of SETA against several attack types using two datasets with different levels of complexity and compare it against different baselines.

A. Setting
Datasets: We validate the proposed algorithm with two datasets: 1) the MNIST dataset [45], [46], collecting gray scale images of digits with resolution 28 × 28 and 2) the MWD in [33], collecting colored images of different weather conditions, that is, cloudy, sunny, rainy, and sunrise, with variable resolution.In both cases, our FL objective is to perform classification [of digits for 1) and weather condition in 2)] considering that each agent has access to a private local dataset D i .The choice of these two datasets is motivated by the fact that the former a widely used standard dataset in literature, allowing us to conduct a relatively simple initial validation of our method; the latter presents a more challenging case study, which is valuable for validating the algorithm in realistic contexts that have practical applications.Specifically, as we operate within the context of precision agriculture robotics as for the European project CANOPIES, 1 the classification of weather conditions can help robots operate more safely in their environment.For instance, if the robots detect cloudy or rainy weather, they may decide to move to a sheltered location to avoid damage or to speed up their activities.The MNIST dataset is composed of 60000 training samples and 10000 testing samples, while MWD includes 1125 samples in total which we resize to a resolution of 50 × 50 and randomly split into 80% for training and 20% for validation.
In the following, we consider both IID and non-IID distribution of the training datasets among the agents.In the IID case, the training samples are uniformly distributed among the agents.In the non-IID case, the local datasets D i are composed of random samples associated with k classes of the dataset.In the MNIST dataset, we select k ∼ U (3,4), that is, each agent has samples of either three or four digits in the local dataset, while for MWD case, we consider k ∼ U (1, 2), that is, each agent has samples of one or two weather conditions (out of four).The testing samples are used to evaluate the performance.
Agents: In the MNIST case study, we analyze a system comprising n = 100 agents, out of which n a = F = 20 agents are randomly designated as adversarial agents.In contrast, for the MWD case study, we consider a system consisting of n = 50 agents, out of which n a = F = 10 agents are randomly selected as adversarial agents.Each agent has a two-layer fully connected neural network with 100 hidden units in the MNIST case and 256 units in the MWD case.Leaky ReLU and Softmax activation functions are used for the hidden and the output layers, respectively.Note that, although the leaky ReLU is not differentiable at the origin, continuous pseudo derivatives of leaky ReLU are proposed in the literature, for example, [47], which can be used to satisfy Assumption 1 and the mathematical soundness of the backpropagation learning 1 https://canopies.inf.uniroma3.it/procedure.Weights are initialized by each agent according to a uniform distribution U (0 m , 0.01 • 1 m ) and models are trained for 1000 steps with a step size c(k) evolving as which fulfills conditions in (11) and where s are warm-up steps set to s = 300, γ is a positive constant γ = 500 and c(0) = 1 for the MNIST case and c(0) = 0.01 with MWD case.
Attacks and Baselines: To validate the resiliency of SETA, we implement three local model poisoning attacks and one data poisoning attack.Regarding the former, we consider: 1) Gaussian attack, as reported for example in [21] and [30], where each adversarial agent i sends a parameter vector w i obtained according to a Gaussian distribution, that is, w i ∼ N (0 m , 1 m ); 2) model flipping attack, as used for example in [30], where each adversary flips the sign of weights computed according to SETA, and communicates the flipped parameters; and 3) optimization-based attack, introduced in [13], where each adversary determines the local parameter vector by solving an optimization problem.In particular, given the direction along which the global parameter vector would be updated in the absence of attacks, the optimization objective is to deviate the global parameter vector as much as possible toward the inverse of this direction.Regarding the data poisoning attack, a label flipping attack [40] is considered.For the MNIST case, each label is exchanged with the previous digit, that is, 1 is set instead of 2, 2 instead of 3 and so on, while 9 is used in place of label 0. Similarly, for the MWD, we exchange each label with the previous one in the ordered list consisting of cloudy, rainy, sunny, and sunrise.Hence, samples originally belonging to the class rainy are assigned the label cloudy, and so on.For all the attacks, we consider that the adversarial agents begin the attack from the start of our simulations.
To compare results, we consider the following baselines.1) Centralized SGD, representing the ideal case where data is not distributed among agents and a single server computes the parameters without any attack, that is, no aggregation rule is used and no adversaries are present.Therefore, this baseline provides an upper bound for the accuracy of SETA and the other FL baselines.2) Average aggregation, which is the typical aggregation rule where the parameters are updated by performing a simple coordinatewise average of all agents' local parameters.3) FedProx [48], that introduces a penalty term in the optimization objective to mitigate the impact of stragglers and non-IID data.It encourages local models at each agent to be close to a global model while considering the differences in local datasets.4) Median aggregation, where the coordinatewise median is computed to update the parameters.) Krum [15], where the local parameter vector with lowest distance to its n − F − 2 closest local parameter vectors is used.7) Bulyan [19], where n − 2F local parameter vectors are recursively selected by resorting to an aggregation rule, and then they are combined by discarding, for each coordinate, the 2F values that are furthest from the median and by averaging the remaining ones.As aggregation rule, we resort to Krum as done in [19].

MNIST Case Study:
The performance reached with different aggregation rules and attacks is reported in Table II for the MNIST dataset.Percentage accuracy on the test set using the weights at the last training step is shown.In particular, on the left IID distribution for the local datasets D i is considered, while on the right non-IID distribution is used.Results in case of Gaussian (first column of each block), model flipping (second column), optimization-based (third column) and label flipping (forth column) attacks are provided.Accuracy equal to 97.1% in case of no attack (not reported in the table) is achieved by the centralized SGD, representing the performance to aim with FL approaches.Starting from the case of IID distribution (in the left part of the table), we can observe that a maximum decrease in performance equal to ≈7% is achieved by SETA compared to the centralized in case of model flipping, while a decrease lower than 3% is reached with the other attacks.Similar performance is obtained by median and trimmed average aggregation rules.Significantly lower performance is reached instead using average aggregation rule achieving < 14% with most attacks.Poor performance (lower than 16%) in case of optimization-based attack is also obtained with FedProx, Krum and Bulyan methods, although Bulyan achieves the highest accuracy in case of model flipping attack, reaching 93.54% compared to 90.62% of SETA.In the case of non-IID distribution (right part of the table), we can notice that a significant drop in performance is recorded with most attacks when using average and FedProx aggregation rules, both achieving, for instance, only ≈ 10% with optimizationbased attack.An overall performance decrease is also recorded with median, Krum, and Bulyan, achieving performance lower than 30% in the respective worst cases.Best accuracy is achieved instead with most of the attacks using the proposed SETA.In particular, a maximum decrease in performance, with respect to the ideal centralized case, equal to ≈ 15% is achieved in the case of model flipping, while a decrease lower than 5% is reached with the other attacks.Slightly lower performance is obtained in general by trimmed average compared to SETA.Fig. 5(a) and (b) show the accuracy on the test set during the learning process with IID and non-IID distributions, respectively.The four different attacks are reported, that is, Gaussian (in the top left), model flipping (in the top right), optimization-based (in the bottom left), and label flipping (in the bottom right) attacks.Centralized SGD results are shown with dotted dark green lines, while average, FedProx, median, trimmed average, Krum, Bulyan, and the proposed SETA algorithm are reported with dark blue, light red, yellow, purple, light green, light blue, and dark red solid lines, respectively.First, the plots confirm that the results in Table II are also observed during the entire learning process.More specifically, we can observe that in all the cases SETA outperforms the others approaching more closely the centralized results and, as expected, shows comparable performance with respect to the trimmed average.Second, the plots show the robustness of the proposed algorithm compared to the other baselines toward chattering phenomena that are induced by the model flipping and the optimization-based attacks.
MWD Case Study:   case apply to the MWD case study.In this case, classification accuracy equal to 83.93% is achieved in the ideal scenario of no attack and centralized SGD (not reported in the table).For the IID distribution setting, we can notice that SETA achieves a maximum performance decrease of approximately 4.5%, compared to the centralized approach, with the model flipping attack.However, both Gaussian and optimization-based attacks result in a performance decrease of less than 3% using SETA.Similar performance is obtained when using the trimmed average and median aggregation rules.As for MNIST, lower performance is obtained in general with average, FedProx, Krum, and Bulyan aggregation rules.Specifically, average, FedProx, and Krum rules achieve performance lower than 51% for Gaussian and optimizationbased attacks, while Bulyan obtains the best performance equal to 81.25% for the model flipping attack, but performs poorly on Gaussian and optimization-based attacks, achieving accuracy lower than 42%.For the non-IID distribution setting (on the right), an overall performance decrease is obtained with average, FedProx, median, Krum, and Bulyan aggregation rules, except for the label flipping attack with average method, reaching best performance ≈ 82% and the model flipping attack with Bulyan method, reaching best performance ≈ 81%.In the remaining attacks, the best accuracy, similar to the trimmed average, is achieved by SETA.Specifically, a maximum performance decrease, compared to the centralized case, equal to ≈ 17% is achieved in the case of model flipping, while a decrease lower than 5.4% is reached with the other attacks.Similar to Fig. 5, we show the accuracy on the MWD test set during the learning process in Fig. 6(a) and (b), for the IID and non-IID distributions, respectively.As for the MNIST case study, the figure shows that the results in Table III are also recorded during the entire learning process.Furthermore, the figure makes evident that the learning process is significantly more challenging in this case study compared to the MNIST one.However, in all tests, SETA is able to closely approach the centralized results without any chattering phenomena.
Consensus of the Agents: Fig. 7 reports the evolution of the norm of the state vectors w i ∀i ∈ V r associated with regular agents that is achieved using SETA.For instance, the case of non-IID distribution and optimization-based attack with MNIST dataset is considered, but similar trends are observed in the other cases.The figure shows that, coherently with Lemma 4, during the initial training steps (reported in the zoom figure), different norm values are recorded, while as the training advances, the agents reach consensus on the weights.

Impact of Varying Number of Adversaries:
We analyzed the performance of SETA when varying the number of adversaries within the set {5, 10, 15, 20, 24}, where each value satisfies the 2F + 1 robustness condition with n = 100.Fig. 8 depicts the accuracy achieved by SETA under various attacks, that is, Gaussian, model flipping, optimization-based, and label flipping attacks, using the MNIST dataset and non-IID distribution.The figure shows that stable results are obtained with Gaussian and optimization-based attacks as the number of adversaries increases.In contrast, a progressive decrease in performance is observed in the case of the model flipping attack, making it the most severe attack for SETA in our tests.Finally, a stable behavior is observed with the label flipping attack until F = 20, while a noticeable drop in performance is evident at F = 24.This can be motivated by the fact that, with non-IID data distribution, increasing the number of adversarial agents might cause the number of adversarial flipped samples of a specific digit to exceed the number of benign samples of it.As a result, correct classification of that digit becomes challenging, leading to a sudden 10% drop in accuracy.

VII. CONCLUSION
In this article we proposed a resilient FL algorithm, namely, SETA, to tackle the presence of adversarial agents that can compromise the distributed learning process.Given local models of the agents, SETA first performs a coordinatewise clustering of the local parameters.Then, it applies a coordinatewise trimmed average, in which the trimmed values are selected according to the respective cluster.SETA enables FL both in standard server-worker architecture and in shared memory settings, where a trusted server is not needed.We formally proved the convergence bounds of the algorithm against model and data poisoning attacks and validated the approach using MNIST and MWD datasets.We compared the performance with respect to average, median, trimmed average, FedProx, Krum, and Bulyan aggregation rules in case of different attack types.Simulation results confirmed the effectiveness of SETA in adversarial settings, providing generally better performance than average, median, FedProx, Krum, and Bulyan aggregation rules as well as comparable results to trimmed average.
As part of future work, we aim to evaluate SETA's performance on additional real-world and heterogeneous datasets and extend it to fully distributed settings, eliminating both the shared memory and the need for a trusted server.

Manuscript received 12
April 2023; revised 12 August 2023 and 5 December 2023; accepted 23 December 2023.Date of publication 23 January 2024; date of current version 23 July 2024.This work was supported in part by the Knowledge Foundation (KKS) through Safe and Secure Adaptive Collaborative Systems (SACSys) under Grant 20190021, and in part by the Swedish Agency for Innovation Systems (Vinnova) through GREENER: Intelligent Energy Management in Connected Construction Sites under Grant 2019-05877.This article was recommended by Associate Editor P. Shi. (Mojtaba Kaheni and Martina Lippi are co-first authors.)(Corresponding author: Mauro Franceschelli.)

Fig. 2 .
Fig.2.Shared memory architecture where n agents locally update their models and share the computed parameters.

Algorithm 1
Selective Trimmed Average (SETA) Protocol Require: F, Cost function f i related to dataset D i ∀i ∈ V r Each agent i runs indefinitely the following: Phase 1 -Parameters clustering Gather local parameters w j (k), j ∈ V\{i} for each z ∈ {1, . .

Fig. 4 .
Fig. 4. Depiction of the graph reshape for each component z ∈ {1, . . ., m} given by the implementation of Phase 2 of SETA (Algorithm 1).Complete graphs are depicted by enclosing the agents in circles.

Fig. 5 .
Fig. 5. Accuracy on the MNIST test set during the learning process using IID (left) and non-IID (right) data distributions.Results achieved by average (dark blue), Fedprox (light red), median (yellow), trimmed average (purple), Krum (light green), Bulyan (light blue), and SETA (thick red lines) as well as by the centralized architecture (dashed dark green lines) are reported.Three model poisoning attacks (top part and bottom left of each figure) and one data poisoning attack (bottom right of each figure) are considered.(a) MNIST case study: IID data distribution.(b) MNIST case study: Non-IID data distribution.

Fig. 6 .
Fig. 6.Accuracy on the MWD test set during the learning process using IID (left) and non-IID (right) data distributions achieved by the average (in dark blue), Fedprox (light red), median (yellow), trimmed average (purple), Krum (light green), Bulyan (light blue), and SETA (thick red lines) as well as by the centralized architecture (dashed dark green lines) are reported.Three model poisoning attacks (top part and bottom left of each figure) and one data poisoning attack (bottom right of each figure) are shown.(a) MWD case study: IID data distribution.(b) MWD case study: Non-IID data distribution.

Fig. 7 .
Fig. 7. Evolution of the norm of the state vectors w i ∀i ∈ V r of regular agents using SETA.

Fig. 8 .
Fig. 8. Accuracy of SETA with varying number of adversaries using MNIST dataset and non-IID distribution.
Selective Trimmed Average: A Resilient Federated Learning Algorithm With Deterministic Guarantees on the Optimality Approximation Mojtaba Kaheni , Member, IEEE, Martina Lippi , Member, IEEE, Andrea Gasparri , Senior Member, IEEE, and Mauro Franceschelli , Senior Member, IEEE 5) Trimmed average, where parameters are updated by filtering out the F highest and smallest values in a coordinatewise fashion and computing the remaining average values.

TABLE II PERCENTAGE
ACCURACY ON THE MNIST TEST SET ACHIEVED BY ALL THE CONSIDERED AGGREGATION RULES USING IID (LEFT) AND NON-IID (RIGHT) DATA DISTRIBUTIONS.GAUSSIAN, MODEL FLIPPING, OPTIMIZATION-BASED, AND LABEL FLIPPING ATTACKS ARE CONSIDERED.BEST RESULTS IN BOLD 6 Table III summarizes the results obtained with MWD using the different aggregation rules and attacks.Results with IID (left) and non-IID (right) distributions of the dataset are shown.Similar observations to the MNIST Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III PERCENTAGE
ACCURACY ON THE MWD TEST SET ACHIEVED BY ALL THE CONSIDERED AGGREGATION RULES USING IID (LEFT) AND NON-IID (RIGHT) DATA DISTRIBUTIONS.GAUSSIAN, MODEL FLIPPING, OPTIMIZATION-BASED, AND LABEL FLIPPING ATTACKS ARE CONSIDERED.BEST RESULTS IN BOLD