A Greedy Agglomerative Framework for Clustered Federated Learning

Federated learning (FL) has received widespread attention for decentralized training of deep learning models across devices while preserving privacy. Industrial Big Data in key applications such as healthcare, smart manufacturing, autonomous driving, and robotics is inherently multisource and heterogeneous. Recent studies have shown that the quality of the global FL model deteriorates in the presence of such non-IID data. To address this, we present a novel clustered FL framework called federated learning via Agglomerative Client Clustering (FLACC). FLACC greedily agglomerates clients or groups of clients based on their gradient updates while learning the global FL model. Once the clustering is complete, each cluster is separated into its own federation, allowing clients with similar underlying distributions to train together. In contrast with existing methods, FLACC does not require the number of clusters to be specified a priori, can handle client fractions, and is robust to hyperparameter tuning. We demonstrate the efficacy of this framework through extensive experiments on three benchmark FL datasets and an FL case study simulated using industrial mixed fault classification data. Qualitative clustering results show that FLACC accurately identifies clusters in the presence of various statistical heterogeneities in the client data. Quantitative results show that FLACC outperforms vanilla FL and state-of-the-art personalized and clustered FL methods, even when the underlying clustering structure is not apparent.

learning.This approach raises two main concerns [1], [2], [3].First, there is a huge communication cost in pooling data at the central server due to the scale of industrial Big Data (the number of industrial nodes and the amount of data generated at each node).Second, there are serious privacy threats as the data may contain sensitive or proprietary information.
These concerns of scalability and privacy can be addressed by using federated learning (FL), which is a decentralized framework that collaboratively learns a global model through distributed training across multiple industrial nodes [4].The participants in FL are called clients, the central entity is called the server, and their collection is called the data federation.FL enables clients to learn a globally optimized model without revealing their private data to the server or to each other [5].
Albeit its popularity, recent studies have shown that the global model learned by FL may not be suitable for all clients [6].While FL can handle non-IID data, it yields suboptimal results when the client data distributions diverge.Since industrial Big Data is inherently heterogeneous, local data distributions of individual clients are rarely identical.For instance, in healthcare applications, the distributions of the users' data vary depending on their behavioral habits and physical characteristics [7] (feature distribution skew).In quality control of 3-D printed parts using computer vision, the types of defects and their pixel intensities across different printers are different [8] (label distribution skew).In autonomous driving, the images of parked cars look different in different conditions such as day versus night and cloudy versus sunny [9] (concept drift).Finally, participating IoT devices can hold vastly different quantities of data [10] (quantity skew or unbalancedness).The impact of such statistical heterogeneity is exacerbated in FL as the server has no visibility of the clients' data.
In this article, we propose a novel client clustering framework termed as Federated Learning via Agglomerative Client Clustering (FLACC) to address this issue of divergent client data distributions in FL.We assume that the clients in the federation can be partitioned into disjoint clusters, where clients from the same clusters are likely to have the same underlying data distributions.FLACC begins with all clients belonging to their own singleton cluster.After each communication round, the server computes a similarity measure between gradient updates sent by the clients and greedily agglomerates the most similar clients or groups of clients.Once the clustering halts, each group is separated into its own federation.This allows clients with the same underlying distributions to benefit by training together and restricts negative information transfer between dissimilar clients.
Our framework can be readily deployed in a realistic federation because it does not require the number of clusters to be known a priori, can seamlessly handle arbitrary client fractions, requires little to no hyperparameter tuning, and identifies hidden clustering structures in seemingly IID data to outperform vanilla FL.
We demonstrate the broad scope of our formulation through case studies using three common FL datasets-MNIST, CIFAR-10, and EMNIST [11], [12].To demonstrate the applicability of FLACC to industrial IoT data, we formulate a case study to classify mixed faults in rotating machinery.We provide quantitative comparisons with state-of-the-art personalized and clustered FL methods along with qualitative client clusters identified by FLACC for different data distributions.Our results show that FLACC outperforms vanilla FL and state-of-the-art methods even when they use optimal hyperparameters, and correctly identifies client clusters for all data distributions.Additionally, we perform over 100 experiments to highlight FLACC's robustness to hyperparameter selection.

A. FL and Non-IID Data
FL was proposed with the federated averaging algorithm (FedAvg), which was envisioned to work with non-IID distributions unlike previous distributed training approaches [5].However, subsequent studies revealed the limitations of FL in the presence of diverging client distributions [9].To counter these limitations, three main lines of research are prevalent.The first line of research focuses on improving the single global model learned by FedAvg to accommodate client-level statistical heterogeneity.Examples include sharing a small subset of data across clients [13], adding a regularization term to the local optimization to restrict divergence between global and local models [14], and optimizing for a mixture of client distributions [15].The second line of research is pluralistic and learns individually personalized models for each client.This approach is termed as personalized federated learning and can be achieved via multitask learning [16], model regularization [17], contextualization [18], local fine tuning [19], and meta-learning [20].The third line of research is clustered FL, which assumes that groups of clients share one data distribution.In this case, a personalized model is learned for each group rather than each individual client.Since our framework falls in this category, we review the related work for clustered FL in more detail.

B. Client Clustering in FL
The clustered FL setting allows for groups of clients to share one data distribution, thereby allowing fewer underlying distributions than clients.The objective is then to train one model for each distribution.To solve the clustered FL problem, a distance-based hierarchical clustering algorithm (FL+HC) is formulated in [21], which clusters clients using agglomerative clustering after an arbitrary number of FedAvg rounds.A multicenter aggregation mechanism is presented in [22] using stochastic expectation maximization (FeSEM), which optimally matches clients to a predetermined number of centers.A client grouping algorithm (FedGroup), which is developed in [23], performs clustering by decomposing the client gradients via singular value decomposition.A training-loss-based iterative federated clustering algorithm (IFCA) is presented in [24], where each client is greedily assigned to the cluster which yields the lowest loss on its local data.A similar hypothesis-based clustering algorithm (HypCluster) is proposed in [25] for which generalization guarantees are provided.The clustered federated learning algorithm (CFL) developed in [26] recursively splits clients into optimal bipartitions based on the cosine similarities between their gradient updates and then checks for heterogeneity in the bipartitions using gradient norms.More recently, soft clustering methods such as FedEM [27] and FedSoft [28] have been developed, which allow client data to follow a mixture of distributions.

C. Research Gap and Our Contributions
Almost all existing clustered FL algorithms such as IFCA, HypCluster, FeSEM, and FedGroup require the number of clusters to be known a priori.FL+HC does not require the number of clusters directly, but requires an arbitrary cutoff distance that governs the number of clusters.The performance of these algorithms is predominantly contingent upon the accuracy of the prespecified number of clusters.In real-world applications, there is little to no information available about each client's data to the server.Moreover, clients can not only be clustered using their raw data but also using more nuanced features identified during training.Thus, it is challenging to accurately determine the number of clusters a priori.Among existing methods, only CFL does not require the number of clusters to be prespecified.Rather, it automatically identifies k clusters after k − 1 optimal bipartitions.However, to achieve this, CFL requires all clients to participate in each training round, which is not scalable in a realistic federation with hundreds or potentially thousands of participating clients.
The proposed FLACC framework bypasses these limitations using a greedy agglomerative strategy.Our algorithm computes the cosine similarity between the selected clients' updates and greedily merges the most similar clients or groups of clients after each communication round.We formulate stopping criteria for this merging based on inter-and intra-cluster similarities to ensure optimal client clustering.We also introduce the novel idea of system memory in our framework to deal with client fractions, which is potentially useful in other branches of FL research.
Our main contributions are summarized as follows.
1) We propose a framework for clustered FL, which does not require the number of clusters a priori, but rather automatically identifies optimal clusters.This improves utility in realistic industrial applications, where an underlying cluster structure is rarely known.2) Our framework seamlessly incorporates client fractions during training.3) Our framework is empirically seen to require minimal hyperparameter tuning, i.e., clustering performance is robust to the choice of hyperparameters.

4)
We demonstrate the effectiveness of our framework, both quantitatively and qualitatively, using extensive experiments on three benchmark FL datasets and an FL case study simulated using industrial mixed fault classification data.

A. Preliminaries
In FL, a central server and a set of m clients form a data federation wherein the ith client has a private dataset {(x l i , y l i )} n i l=1 of n i data points.Each client has a local supervised learning model parameterized by θ i ∈ Θ.The empirical risk associated with client i's finite data can be written as where (θ; x, y) : Θ → R ≥0 is the loss function associated with a point (x, y).In the FL setting, we assume that there exists θ * ∈ Θ, which can minimize the empirical risk on all m clients simultaneously and thus solve the minimization where n = m i=1 n i is the total number of data points across all clients.
This minimization can be solved using the FedAvg algorithm, which involves several communication rounds between the clients and the central server [5].In communication round t, the clients first download the most recent server model θ (t) .Each client then trains this model on its local data for a few epochs.The model updates are sent back to the server, which averages them to obtain the global server state θ (t+1) .
When clients train locally for one epoch, FedAvg can equivalently perform the global update by averaging either the gradient updates or weight updates sent by the clients ) is the gradient of client i's empirical risk computed at θ (t) , θ (t+1) i is client i's local weight update, and η is the learning rate.For more than one local epoch, gradient and weight averaging are no longer equivalent, and FedAvg uses weight averaging to allow more client-level computation and enhance communication efficiency.However, gradient averaging can still be used with methods like unbiased gradient aggregation and controllable meta-updating [29].
Finally, FedAvg generally only uses a small fraction c • m of the total clients in each communication round where client fraction c ∈ (0, 1].

B. Clustered FL Problem
In the clustered FL setting, we consider a data federation with m clients, which can be partitioned into s disjoint clusters C 1 , . . ., C s based on s underlying data distributions D 1 , . . ., D s , such that 2 ≤ s ≤ m and where [m] is the set of integers {1, 2, . . ., m}.A client i ∈ C k has all its data points (x l i , y l i ) ∼ D k .The true risk for cluster C k is the expected loss of the data following D k : (3) Our goal in clustered FL is to minimize In this setting, it is easy to see that the FedAvg assumption of a single optimum θ * will not hold and will be suboptimal for most individual clusters.Thus, we wish to find s distinct optimal solutions {θ k, * } s k=1 such that Since each cluster has clients with identically distributed data, and assuming for any client i ∈ C k , the empirical risk F i (θ) approximates the true cluster risk F k (θ) arbitrarily well, the above minimization can be solved optimally by first assigning clients to their respective clusters and then minimizing each cluster risk separately using FedAvg.Stated differently, we want to find an optimal s-partitioning among m clients, which is the many-one-onto map such that ∀ i, j, we have P s (i) = P s (j) iff D i = D j .We want to compute the optimal s-partitioning in (5) when s is also unknown, i.e., the number of clusters is not supplied to the system as a hyperparameter.This is a more realistic setting for FL as very little information about client distributions is available in practice.
Consider the FedAvg iterate θ (t) at communication round t.Assuming that the gradient updates from all clients are available at round t, we can define a similarity measure between two clients i, j using the cosine similarity between their gradient updates Cosine similarity computes the directional similarity between gradients and is not affected by gradient magnitudes.In FL, gradient magnitudes between clients may have disproportionate differences and adversely affect measures utilizing L1 or L2 distances [30].As is conjectured in most clustered FL frameworks, gradient updates from clients having the same underlying distribution have higher directional similarity relative to clients having dissimilar distributions despite the batch gradients being noisy approximations of the true gradient, thereby resulting in higher values of α i,j [21], [22], [23], [26].
The optimal s-partitioning problem at round t can then be formulated using the cosine similarities between client updates.The optimal number of clusters and the corresponding clustering structure can be obtained by minimizing the average maximum Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

similarity between clients belonging to different clusters
where {C} s represents one of many possible ways to create the clustering map P s .For a fixed number of clusters s, the set {C} s can be generated in m s different ways, each of which satisfies . Remark 1: A brute force solution search for ( 7) scans all possible number of partitions s ∈ {2, . . ., m} and all m s clusters for each s, with each evaluation computing max α i,j over s(s − 1)/2 tuples.This is intractable when the number of clients m is large.
Remark 2: Equation ( 7) yields meaningful results only when the cosine similarity α (t) i,j between all client pairs is available at round t, which is not true in practice.
Remark 3: On fixing s * = 2 and θ (t) = θ * , we obtain which is exactly the optimal bipartitioning used in CFL [26] and can be solved in O(m 3 ).Thus, the bipartitioning in CFL can be interpreted as a special case of (7).

A. Intuition
A brute force search for s * and C * in ( 7) is nontractable and since only a small fraction of clients participate in any round, the cosine similarity for a majority of client pairs is not available after each round.To accommodate this, rather than minimizing the maximum similarity between clients in different clusters, we propose to greedily maximize the minimum similarity to group clients in similar clusters.We illustrate this simple agglomerative strategy below.
Consider the gradient updates of seven clients from three underlying distributions as shown in Fig. 1.At step 0, all clients are treated as individual "entities."In step 1, the minimum similarity across all entities is maximized.In this case, the two most similar clients viz.client 1 and client 2 are clustered together to form a new "subserver."For all future comparisons, this newly formed subserver is treated as one entity.Similarly, in steps 2 and 3, two new subservers are formed with clients 6, 7 and clients 4, 5, respectively, by maximizing the minimum inter-entity similarities.In step 4, the only remaining client 3 is to be added to one of the three subservers.Client 3 has the least similarity to clients 1, 7, and 5 in subservers 1, 2, and 3, respectively.We thus compute arg max{α 1,3 , α 7,3 , α 5,3 } and add client 3 to subserver 1.The three subservers formed thus represent the optimal clustering solution viz.s * = 3 and C * = {(1, 2, 3), (4, 5), (6, 7)}.
This idea of successively maximizing the minimum similarity across entities is central to the FLACC framework.Since only a limited number of entities are greedily agglomerated at each step, clients can be clustered "as they come" rather than clustering in a one-shot manner.This notion implicitly handles client fractions and allows clients to be grouped from the first communication round rather than after FedAvg converges to θ * (which involves a nontrivial communication cost).

B. Algorithm
The FLACC algorithm follows the FedAvg procedure with an additional computation at the server after each communication round wherein similar clients or similar groups of clients are indexed to the same clusters.Once all clients have been assigned to clusters or no new clusters are formed for a few FedAvg rounds, each cluster is separated into its own independent federation.
Let θ (t) indicate the FedAvg iterate at the beginning of round t and M (t) indicate the random subset of clients chosen at round t.Let α (t) ∈ [−1, 1] m×m be the similarity matrix whose (i, j)th element is the cosine similarity between the gradient updates of client i and client j.Let E (t) be the entity set whose elements are either individual clients or subservers of clients, which have been agglomerated together.Both α (t) and E (t) contain information from the past communication rounds 0, . . ., t − 1, and are updated after every communication round.
The algorithm begins with all clients as their own singleton entities.Thus, E (0) contains all m clients and α (0) = −1 m×m .In round t, after the gradient updates from all clients in M (t) are sent back to the server, first update the similarity matrix for all pairs of clients in We identify the pair of entities for which the minimum interentity similarity is maximized using (9).The two entities are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.combined together to update the entity set ( 10) Based on the identities of E 1 and E 2 , three types of combinations are possible.and E 2 to be combined.In the first condition, we restrict the combination of two subservers to when their cross-subserver similarity is higher than the within-subserver similarity.We quantify this intuitive notion by defining and where Φ(E 1 , E 2 ) is termed as the subserver combination potential.We only combine subservers E 1 and E 2 if In the second condition, we place a lower bound on the minimum cross-entity similarity required between the entities E 1 and E 2 for them to be combined.Consider the value of α (t) i,j , which solves (9) Then, we only combine E 1 and E 2 if where α 0 ∈ (−1, 1] represents the minimum allowable crossentity cosine similarity for combination and can be set as a hyperparameter depending on the federation.In our experiments, we show that realistically, when no prior information about the federation is available, setting α 0 = 0 is sufficient.The condition restricts the gradient updates of the combining entities to the "more-similar" half-plane, not allowing entities with conflicting gradients to merge.Equation ( 17) acts as a contingency condition preventing undesirable entity combinations in rare cases, where more similar entities are not available for merging.This notion has been used in multitask learning literature to mitigate conflicting gradients [31], [32], [33] and in FL literature to improve fairness [30].This geometric interpretability of α cross min > α 0 = 0 is another advantage of using cosine similarity as formulating a stopping criterion other distance-based similarity measures would not be equally interpretable and require careful tuning of an arbitrary constant.A visual representation of the merging conditions for all entity pairs is illustrated in Fig. 2.
At each communication round, the cosine similarities are computed between the selected client pairs, which takes O(|M (t) | 2 ) time.This is followed by entity merging based on ( 9) and (10), subject to (17) when merging any two entities and ( 14) when merging two subservers.In practice, separate "entity matrices" can be utilized to store the maximum and minimum similarities between entity pairs, which are updated after each round in O(1) time.Using these matrices, (9) can be computed in O(|E (t) | 2 ) time.Overall, at each communication round, the FLACC server performs O(|M (t) | 2 + |E (t) | 2 ) additional computation over FedAvg.
Once there is no entity merging for a predetermined number of consecutive rounds (N sep ), each remaining entity is separated into its own federation.FL then proceeds independently for each cluster for the remaining communication rounds.

C. Practical Considerations
In this section, we discuss some practical aspects for implementing the FLACC algorithm.When all clients participate in local training, the cosine similarity between all client pairs is updated in every round.This is not the case when only a small fraction of clients is selected in each round.Consider the two clients shown in Fig. 3 belonging to different underlying distributions, which are selected for local training in M (0) , i.e., the first communication round.α (0) i,j for these clients is updated at round 0 and leaks through all future rounds until the two clients are selected together again.At a future round t, the true similarity between the two clients is significantly smaller than the value of α (t) i,j , which may lead to incorrect agglomeration.To mitigate this, we introduce the idea of a system "memory" for Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.updating the cosine similarity matrix.Let T be the memory of the system and τ be the latest round when clients i, j appeared in training together.At round t ≥ τ , Thus, the similarity matrix "remembers" only those client pairs, which were selected for training in the last T communication rounds, thereby preventing incorrect clustering due to similarity values leaked from older rounds.We account for this numerically in the algorithm by computing ( 9), ( 11), (12), and ( 15) subject to α (t) i,j > −1.FLACC computes the similarity measure between clients at each communication round using the accumulated model gradient, which is a stochastic variable.It is commonly conjectured that the accumulated gradients in distributed SGD approximate the full-batch gradient of the objective function [34].For a sufficiently locally smooth loss function and low learning rate, a weight update over one epoch can be shown to approximate the direction of the true gradient [26].Moreover, using weight updates for computing similarity is empirically shown to have a better signal-to-noise ratio in other clustered FL frameworks [21], [26].We thus use weight updates Δθ (t) instead of gradient updates ∇ θ F (θ (t) ) to compute the cosine similarities between clients Correspondingly, we use model averaging at the server instead of gradient averaging, which also better aligns with the original formulation of the FedAvg algorithm [5].In Section VII, we show an empirical comparison between using weight updates and gradients for different data distributions.When the number of clients is very large, the agglomeration process can be expedited in practice.Rather than merging only one entity pair after each communication round, the best 1 < n merge < |M (t) |/2 pairs can be merged greedily.For federations with no information about the underlying clustering structure, using n merge = 1 is recommended.
Finally, note that an unusually small client fraction especially in the initial communication rounds might hamper the overall clustering results despite ( 14) and (17).For s estimated underlying clusters in the federation, it is recommended to have at least |M (t) | = s clients participating in each round, with |M (t) | ≥ 2 s to ensure ideal qualitative clustering.
The FLACC algorithm is formally presented in Algorithm 1. for k = 1, . . ., n merge do 14: 15: if α cross min > 0 and Φ > 0 (for sub-servers) then 16: end for 19: for each mini-batch of size B do 32: end for 34: end for
Rotated MNIST: The MNIST digit recognition dataset contains 60 000 training examples of 28 × 28 grayscale images and 10 classes.We first randomly assign between [200,800] data points to m = 100 clients in an IID manner.We then apply rotations of 0 • , 90 • , 180 • , and 270 • to 10, 20, 30, and 40 clients, respectively, resulting in four underlying distributions.This federation has unbalanced data within each client and an unbalanced number of clients from each underlying distribution.Moreover, this represents a concept drift where the conditional distribution P (x|y) diverges between clients, i.e., the same label corresponds to different features in different clients.
Grouped CIFAR-10: The CIFAR-10 dataset contains 50 000 training examples of 32 × 32 × 3 images and 10 classes.We separate m = 100 clients uniformly into five groups such that clients from each group contain training examples with two randomly assigned labels only.For instance, all clients from group 1 have labels {0, 4} only, from group 2 have labels {2, 3} only, and so on.Each client is randomly assigned between [300, 600] data points.This represents a label distribution skew, where the marginal distribution P (y) varies between clients, i.e., data labels are not evenly distributed.
Label-swapped CIFAR-10: We first randomly assign between [300, 600] data points m = 100 clients in an IID manner.We then separate the 100 clients uniformly into five groups.For clients in each group, two labels are randomly swapped.For instance, all clients from group 1 have labels 0 and 5 swapped, from group 2 have labels 6 and 8 swapped, and so on.This represents a concept shift where the conditional distribution P (y|x) diverges between clients, i.e., different labels are assigned to the same feature in different clients.
Grouped + Rotated EMNIST: The EMNIST dataset extends MNIST to include lower and upper case alphabets along with digits [11].Of these, we use the 52 alphabet classes (26 lower + 26 upper case).We first assign between [200,800] data points to m = 100 clients such that 50 clients contain lower case alphabets only and 50 clients contain upper case alphabets only.From both groups, we randomly select 15 clients and rotate their images by 90 • resulting in a total of four underlying distributions.This federation has concept drift along with label distribution skew.
FEMNIST: The FEMNIST dataset is a federated version of EMNIST, where the data are prepartitioned based on authors of the characters [12].A random subset of m = 195 clients (authors) is selected from this dataset, where each client has between [40, 525] data points.Although there are no fixed underlying distributions, there might be an ambiguous cluster structure due to a feature distribution skew (e.g., authors having similar writing styles may be clustered).The goal here is to examine whether FLACC, by identifying hidden clusters, outperforms vanilla FL even when the data distribution among clients is very similar.

B. Experimental Setup
For each data distribution, five independent runs are performed using FLACC.Each independent run comprises the following steps: First, the data are randomly split into clients as described in the previous section.Then, each client's data are randomly partitioned into an 85%-15% train-test split.Finally, FLACC is run using a random selection of clients at each communication round.Such a shuffled selection of clients helps evaluate the robustness of the clustering algorithm.For each client, the test accuracy is evaluated on its local test data and the average accuracy over all clients is stored for each independent run.In the results section, we report the mean and standard deviation of the test accuracies over the five independent runs.
For all distributions, we use a simple CNN with two 5 × 5 convolutional layers with 32 and 64 channels respectively, and ReLU activations.Each convolutional layer is followed by a 2 × 2 max pooling layer.The output from the convolutions is passed through a fully connected layer with 512 units and ReLU activations.The network outputs a softmax classification over ten classes for MNIST and CIFAR-10, and 62 classes for EMNIST.SGD is used as the local solver with learning rate η = 0.1, 0.075, 0.05 for MNIST, CIFAR-10, and EMNIST, respectively.
Unless otherwise stated, the following parameters are used throughout all experiments: Client fraction of c = 0.2, i.e., 20 clients out of 100 (39 out of 195 for FEMNIST) are randomly chosen for training in each round.Local models are trained for E = 5 rounds with a batch size of B = 32.n merge = 2 entities are merged in each round and a system memory of T = 10 rounds is used.Each server is trained for N server = 100 rounds (N server = 200 for FEMNIST) with N sep = 10 nonmerging rounds before entity separation.

C. Results
The average test accuracy and qualitative clustering results for one independent run are illustrated in Fig. 4. In the entity merging stage, FLACC keeps track of client clusters by indexing clients and subservers at each communication round.In this stage, the test accuracies for FLACC and FedAvg are identical.Once there is no entity merging for N sep consecutive rounds, the entities identified by FLACC are separated into their own federations.The single optimum θ * learned by FedAvg is suboptimal for all clients due to the mutually competing objectives of underlying distributions.On separation, each cluster learns the optimal θ k, * for its own underlying distribution.This causes a distinct and immediate jump in the client test accuracy as seen at approximately round 60 in all four cases in Fig. 4 (top).
FLACC correctly separates clients into their respective clusters in each case despite unbalanced data among clients and an unbalanced number of clients in different clusters.The clustering, as seen in Fig. 4 (bottom), is found to be exact every time for all five independent runs.Fig. 5 shows the test accuracy comparison for the FEMNIST case.FLACC separates the 195 clients into 11 or 12 clusters depending on the initialization.There is a small yet distinct increase in test accuracy after separation at round 140, similar to Fig. 4.This result shows that FLACC outperforms FedAvg by identifying an underlying cluster structure even when client separation is not obvious.
A numerical comparison of test accuracy over five independent runs is shown in Table I.FLACC is compared to NoFed, FedAvg, two state-of-the-art personalized FL methods pFedMe [17], Per-FedAvg [20], and three state-of-the-art clustered FL methods IFCA [24], FL+HC [21], and CFL [26].For NoFed, each client learns only from its local data, and model averaging is not performed.For IFCA, the number of clusters (k) and for FL+HC, the cutoff distance (d) to stop hierarchical clustering must be prespecified.Test accuracy for both methods is reported for the optimal value of the respective hyperparameters along with two other values (one higher and one lower).CFL does not require the number of clusters but requires all 100 clients to participate in each round.The regularization λ and global model update parameter β for pFedMe, and the step size α for Per-FedAvg, are carefully tuned and the results with the best parameters are reported.All other hyperparameters are kept consistent across methods.
Table I shows that vanilla FL (FedAvg) is clearly not suitable for most cases due to divergent client distributions.Moreover, FLACC matches or outperforms other methods even when those methods use optimal hyperparameters.The performance of IFCA (and FL+HC) is suboptimal when the number of clusters (and cut-off distance) is specified incorrectly.CFL sometimes matches the performance of FLACC; however, each CFL run incurs five times more communication cost than FLACC.Both personalized FL models report better performance than Fe-dAvg, but considerably worse performance than FLACC as they are constrained in personalization by limiting deviations from a single global model.The datasets analyzed have multiple (sometimes mutually exclusive) latent contexts across local data distributions, which the personalized FL models may not capture.This has also been shown to degrade overall performance due to negative information transfer in personalized models [35].Finally, in the FEMNIST case when the underlying distributions are not obvious, FedAvg outperforms other methods due to the

VI. MIXED FAULT CLASSIFICATION
In this section, we examine the effectiveness of FLACC using a simulated case study on an industrial machine fault dataset.This dataset is collected on a specialized rotating machinery fault simulator equipped with an electric motor, two rotor disks (A and B), and a shaft supported on two bearings.Lateral vibration signals are collected using an accelerometer placed on the shaft.Controlled simulations are performed on various fault conditions associated with the rotors and the bearings.To simulate a realistic scenario, data for combinations of rotor and bearing faults, i.e., mixed faults are collected.
In particular, 4 rotor and 3 bearing conditions are simulated resulting in a total of 12 machine health conditions.Rotor conditions are simulated by attaching additional standard weights to the rotor disks A and B. The four rotor conditions are: one weight on A, one weight diametrically opposite on B (A1B1O); one weight on A, one weight on B aligned (A1B1A); two weights on A diametrically opposite (A2O); three adjacent weights on A (A3A).The three bearing conditions are: slight ball wear, severe ball wear, and outer race damage.The labels for the 12 conditions are shown in Table II.
Each fault label in Table II has 1920 data points (signals) resulting in 23 040 data points in total.We form a federation with m = 48 clients separated into six groups with eight clients per group.Clients from groups 1 to 6 have fault labels (0, 1), (2,3), (4,8), (5,6), (7,11), and (9, 10), respectively.Each client is randomly assigned between [200, 600] signals from its respective group labels.Such client grouping is done to emulate a realistic federation, where all clients do not have all fault types.Moreover, there is feature overlap between clients from different groups.For instance, clients from groups 1 and 2 both have slight ball wear and can only be distinguished based on their rotor faults.Similarly, clients from each group have a feature overlap with clients from (at least) four other groups.This makes the mixed fault classification task using FL very challenging.
We construct a 1-D CNN having three 1-D convolutional layers with ReLU activations, each followed by a max pooling layer.The convolutional layers have 64, 128, and 256 filters, respectively, of size 5.The output from the convolutions is passed through two fully connected layers (128 and 64 units, respectively) with ReLU activations.The output is a softmax classification over the 12 fault classes.Adam is used as the Fig. 6 shows the test accuracy comparison between FedAvg and FLACC along with the clusters identified by FLACC.Vanilla FL is not well suited for the realistic client grouping in this case, as shown by the FedAvg accuracy which stagnates at ∼20%.Contrarily, FLACC identifies and clusters similar entities based on their fault signals and separates them into independent federations, leading to a 4.5 times increase in the test accuracy to ∼90%.Unlike the qualitative results in Fig. 4, the clustering, in this case, is not exact.While the number of identified clusters is always correct, two or four clients are incorrectly clustered in each trial depending on the random initialization.However, the realistic data distribution in this case is also more challenging than in previous studies.
Table III shows the numerical comparison between test accuracy for FLACC and other state-of-the-art methods over five independent runs.FLACC outperforms other methods even when optimal hyperparameters are used for the other methods.Overall, this case study proves the usefulness of FLACC in complex real-world industrial applications.

A. Effect of Hyperparameters
In this section, we analyze the effect of two main hyperparameters used in the FLACC framework-the minimum allowable cross-entity cosine similarity for merging two entities (α 0 ) and the system memory (T )-and empirically demonstrate that FLACC requires minimal hyperparameter tuning.
1) Effect of α 0 : Experiments for rotated MNIST, grouped CIFAR-10, label-swapped CIFAR-10, and grouped+rotated EMNIST are run using three values of α 0 viz.+0.1, 0, −0.1, keeping all other parameters identical to Section V-B.The test accuracy and the number of clusters identified by FLACC in five independent runs are shown in Table IV.
When α 0 = 0, FLACC identifies the underlying clusters perfectly in all five independent runs for all four datasets.When α 0 = +0.1,FLACC identifies more than the actual underlying number of clusters, whereas, when α 0 = −0.1,fewer clusters are identified.This is expected, as higher values of α 0 place stricter similarity requirements resulting in fewer merges and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.more clusters.Lower values of α 0 place loose similarity requirements resulting in more merges and fewer clusters.This idea is illustrated using representative clusters identified for Grouped + Rotated EMNIST in Fig. 7.

TABLE III TEST ACCURACY COMPARISON BETWEEN FLACC AND STATE-OF-THE-ART METHODS FOR MIXED FAULT CLASSIFICATION
Contrary to the number of clusters, the test accuracy is affected in one direction only.For α 0 = +0.1,despite more clusters being identified, the clients in each cluster are still similar.Thus, on separation, each group learns a good model for their underlying data and the test accuracy is consistent with α 0 = 0 case.However, for α 0 = −0.1,clusters contain dissimilar clients which deteriorate performance.
We emphasize that α 0 = 0 yields consistently accurate results across all four data distributions.While α 0 does implicitly control the number of clusters identified by FLACC, unlike existing methods, α 0 need not be tuned differently for different data distributions.
2) Effect of T: Keeping all other parameters unchanged, experiments are run for different system memories T chosen across a broad range.When T = 2, the system has a very short-term memory and has access to the cosine similarities from the last two communication rounds only.When T = ∞, the system has access to the similarities from all previous rounds, which is equivalent to not using a memory formulation.The results for five independent runs are shown in Table V.
The test accuracy and number of clusters identified by FLACC remain optimal and nearly consistent across a broad range of T .However, when T = ∞, the number of clusters is no longer optimal due to cosine similarities computed in early communication rounds causing incorrect merges in latter rounds.While the test accuracy using T = ∞ is not severely affected, the variability in the number of identified clusters may lead to incorrect inferences regarding the true underlying clustering structure.
These experiments show that the use of a system memory T ensures consistent and optimal results across different data  distributions; however, careful tuning of T is not necessary to ensure such results.

B. Weight Updates Versus Gradients
In this section, we compare the performance of FLACC when using gradients from the individual client models in order to compute the cosine similarity to the results obtained in Section V using weight updates.The test accuracy and number of clusters obtained are shown in Table VI.Using gradients instead of weight updates does not hamper the clustering performance for Rotated MNIST and Grouped CIFAR-10; however, leads to occasional incorrect clustering in Label-swapped CIFAR-10 and Grouped+Rotated EMNIST.
Both the gradients and weight updates are stochastic in nature; thus, the similarity measure built upon them is a noisy one.The signal required from these noisy measures is a meaningful value of α cross min , which solves (9) to yield the entities to merge after each round.Fig. 8 plots the values of α cross min when using weight updates and gradients for the four data distributions.α cross min decreases with communication rounds, which is expected owing to the greedy nature of the algorithm.While the values of α cross min are relatively consistent between the weight updates and gradients, the variability in the gradient-based cosine similarity is higher across all four distributions.This higher noise leads to the occasional incorrect clustering in the Label-swapped CIFAR-10 and Grouped+Rotated EMNIST distributions.

C. Limitations and Future Work
The FLACC formulation is inherently stochastic and relies on extracting useful signals from noisy SGD updates.While we show compelling empirical results, a theoretical analysis to study the interplay between client selection, SGD updates, hyperparameters, and the corresponding clustering results would be insightful.Sharing gradient or weight updates directly with the central server compromises the privacy of individual clients.In practice, privacy mechanisms are always utilized with FL to protect individual clients' data.An important next step is exploring the utility of FLACC with privacy mechanisms like differential privacy and gradient encryption.We study the clustered FL problem assuming that clients come from disjoint generative distributions.A more general formulation for a mixture of generative distributions is an interesting direction for future research.

VIII. CONCLUSION
The performance of FL deteriorates when the client data distributions are non-IID.This issue is commonly encountered in numerous industrial applications, where client data are inherently heterogeneous, thereby critically limiting the successful deployment of FL.This article develops FLACC-a novel framework for clustering similar clients in FL.FLACC agglomerates clients or groups of clients while learning the global FL model and then separates each cluster to be trained independently.Case studies on widely used FL datasets show that FLACC outperforms FedAvg and state-of-the-art clustered FL methods.Moreover, using a mixed fault diagnosis dataset, we demonstrate the efficacy of FLACC in more complex and realistic industrial federations.Our framework is readily applicable to industrial settings because it is autonomous (automatically identifies the optimal clustering structure) and scalable (incorporates client fractions during training).

Fig. 2 .
Fig. 2. Visual representation of different merging conditions for all entity pairs.

Fig. 3 .
Fig. 3. Schematic to illustrate the need for a system memory in FLACC.

Fig. 4 .Fig. 5 .
Fig. 4. Average test accuracy comparison between FedAvg and FLACC (top) and client clusters identified by FLACC (bottom) for 100 clients in (a) rotated MNIST, (b) grouped CIFAR-10, (c) label-swapped CIFAR-10, (d) grouped+rotated EMNIST.Each circle represents a client-circle color indicates the underlying distribution of the client's data and circle size represents the number of data points.

Fig. 6 .
Fig. 6.Average test accuracy comparison between FedAvg and FLACC (left) and client clusters identified by FLACC (right) for mixed fault classification.
1) If both E 1 and E 2 are clients, a new subserver is created with E 1 and E 2 .2) If E 1 is a client and E 2 is a subserver (or vice versa), client E 1 is added to subserver E 2 .3) If both E 1 and E 2 are subservers, a new subserver is created, which contains all the clients of both E 1 and E 2 .Next, we formulate two necessary conditions for entities E 1 Number of clients m, each parameterized by {θ i } m 1 , client fraction c, local learning rate η, local epochs E, batch size B, memory T , number of entities to merge per round n merge , number of rounds before separation N sep , server rounds N server , server initialization θ (0) .

TABLE IV TEST
ACCURACY AND NUMBER OF CLUSTERS FOR DIFFERENT MINIMUM ALLOWABLE CROSS-ENTITY COSINE SIMILARITIES (α 0 ) TABLE V TEST ACCURACY AND NUMBER OF CLUSTERS FOR DIFFERENT SYSTEM MEMORIES (T ) Fig. 7. Representative clusters for Grouped + Rotated EMNIST using different minimum allowable cross-entity cosine similarities.

TABLE VI TEST
ACCURACY AND NUMBER OF CLUSTERS USING WEIGHT UPDATES VERSUS GRADIENTS FOR COSINE SIMILARITY