Self-organizing Democratized Learning: Towards Large-scale Distributed Learning Systems

Emerging cross-device artificial intelligence (AI) applications require a transition from conventional centralized learning systems towards large-scale distributed AI systems that can collaboratively perform complex learning tasks. In this regard, democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems. The outlined principles are meant to study a generalization in distributed learning systems that goes beyond existing mechanisms such as federated learning. Moreover, such learning systems rely on hierarchical self-organization of well-connected distributed learning agents who have limited and highly personalized data and can evolve and regulate themselves based on the underlying duality of specialized and generalized processes. Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper. The approach consists of a self-organizing hierarchical structuring mechanism based on agglomerative clustering, hierarchical generalization, and corresponding learning mechanism. Subsequently, hierarchical generalized learning problems in recursive forms are formulated and shown to be approximately solved using the solutions of distributed personalized learning problems and hierarchical update mechanisms. To that end, a distributed learning algorithm, namely DemLearn is proposed. Extensive experiments on benchmark MNIST, Fashion-MNIST, FE-MNIST, and CIFAR-10 datasets show that the proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms. The detailed analysis provides useful observations to further handle both the generalization and specialization performance of the learning models in Dem-AI systems.


I. INTRODUCTION
Nowadays, AI has grown to be successful in solving complex real-life problems such as decision support in healthcare systems, advanced control in automation systems, robotics, and telecommunications.Numerous existing mobile applications incorporate AI modules that leverage user's data for personalized services such as Gboard mobile keyboard on Android, QuickType keyboard, and the vocal classifier for Siri on iOS [1].By exploiting the unique features and personalized characteristics of users, these applications not only improve the personal experience of the users but also helps to better control over their devices.Moreover, the rising concern of data privacy in existing machine learning frameworks fueled a growing interest in developing distributed machine learning paradigms such as federated learning frameworks (FL) [1]- [11].FL was first introduced in [2], where the learning agents coordinate via a central server to train a global learning model in a distributed manner.These agents receive the global learning model from the central server and perform local learning based on their available datasets.Then, they send back the updated learning models to the server for updating the global model via an aggregation operation without revealing the private training data to the others.
In practice, the private dataset collected at each agent is unbalanced, highly personalized for some applications such as handwriting and voice recognition, and exhibit non-i.i.d (nonindependent and non-identically distributed) characteristics.Therefore, the iterative process of updating the global model improves the generalization of the model, but also hurts the personalized performance at the agents [1].Hence, existing FL algorithms cannot efficiently handle the underlying cohesive relation between generalization and personalization (or specialization) abilities of the trained learning model [1].To the best of our knowledge, the work in [9] was the first attempt to study and improve the personalized performance of FL using a personalized federated averaging (Per-FedAvg) algorithm based on a meta-learning framework (MLF).Furthermore, in a recent work [10], the authors propose an adaptive personalized FL framework where a mixture of the local and global model was adopted to reduce the generalization error.However, similar to [9], the cohesive relation between generalization and Fig. 1: Analogy of a hierarchical distributed learning system.personalization was not adequately analyzed.Recently, [12] proposed pFedMe algorithm by studying a bi-level learning optimization problem such as global problem and personalized problems.
To better analyze the personalized and generalized learning performance for learning models in FL framework, the Dem-AI philosophy, discussed in [13] introduces a holistic approach and general guidelines to develop distributed and democratized learning systems.The approach refers to observations about the generalization and specialization capabilities of biological intelligence, and the hierarchical structure of society and swarm intelligence in large-scale distributed learning systems.Fig. 1 illustrates the analogy of the Dem-AI system and the hierarchical structure in an organization.The specialists from different domain knowledge are grouped into teams to perform common products or targets.These groups in an organization need to collaborate towards the common goals under the supervision of a board of directors.Similarly, learning agents in different groups perform the collaborative learning for group models.The outputs of these groups in a Dem-AI system are the specialized learning models that are created by group members.In this paper, inspired by Dem-AI guidelines, we develop a novel distributed learning framework that can directly extend the conventional FL scheme for collectively solving a common learning task at learning agents.Different from existing FL algorithms for building a single generalized model (a.k.a global model), we maintain selforganizing hierarchical group models.Accordingly, we adopt the agglomerative hierarchical clustering [14] and periodically update the hierarchical structure based on the similarity in the learning characteristic of users.In particular, we propose the hierarchical generalization and learning problems for each generalized level in a recursive form.To solve the complex formulated problem due to its recursive structure, we develop a distributed learning algorithm, DemLearn.The proposed algorithm uses the bottom-up scheme to iteratively performs the local learning by solving personalized learning problems and hierarchical update the generalized models for groups at higher levels.With extensive experiments, we validate both specialization and generalization performance of all learning models on benchmark MNIST, Fashion-MNIST, Federated Extended MNIST (FE-MNIST), and CIFAR-10 datasets.
To that end, we discuss the preliminaries of democratized learning in Section II.Based on the Dem-AI guidelines, we formulate hierarchical generalized, personalized learning problems, and propose a novel distributed learning algorithm in section III.We validate the efficacy of our proposed algorithm for both specialization and generalization performance of the client, groups, and global models compared to the conventional FL algorithms in Section IV.Finally, Section V concludes the paper.

II. DEMOCRATIZED LEARNING: PRELIMINARIES
Different from FL, the Dem-AI framework [13] introduces a self-organizing hierarchical structure for solving common single/multiple complex learning tasks by mediating contributions from a large number of learning agents in collaborative learning.Moreover, it unlocks the following features of democracy in the future distributed learning systems.According to the differences in their characteristics, learning agents form appropriate groups that can be specialized for similar agents to deal with the learning tasks.These specialized groups are selforganized in a hierarchical structure and collectively construct the shared generalized learning knowledge to improve their learning performance by reducing individual biases due to the unbalanced, highly personalized local data.In particular, the learning system allows new group members to: a) speed up their learning process with the existing group knowledge, and b) incorporate their new learning knowledge in expanding the generalization capability of the whole group.In Dem-AI systems, learning agents are free to join any of the appropriate groups and exhibit equal power in the construction of their groups' generalized learning model.Here, the power of each group can be represented by the number of its members which varies over the training time.We introduce a brief summary of Dem-AI concepts and principles [13] in the following discussion.
Definition and goal: Democratized Learning (Dem-AI in short) studies a dual (coupled and working together) specializedgeneralized processes in a self-organizing hierarchical structure of large-scale distributed learning systems.The specialized and generalized processes must operate jointly towards an ultimate learning goal identified as performing collective learning from biased learning agents, who are committed to learning from their own data using their limited learning capabilities.As such, the ultimate learning goal of the Dem-AI system is to establish a mechanism for collectively solving common (single or multiple) complex learning tasks from a large number of learning agents.
Specialized Process: This process is used to leverage specialized learning capabilities at the learning agents and specialized groups by exploiting their collected data.By incorporating the generalized knowledge of higher level groups created by the generalization mechanism, the learning agents can update their model parameters so as to reduce biases in their personalized learning.Thus, the personalized learning objective has two goals: 1) To perform specialized learning, and 2) to reuse the available hierarchical generalized knowledge.
Generalized Process: The generalization mechanism encourages group members to share knowledge when performing learning tasks with similar characteristics and construct hierarchical levels of generalized knowledge.The hierarchical generalized knowledge helps the Dem-AI system maintain the generalization ability for reducing biases of learning agents and efficiently dealing with environment changes or performing new learning tasks.
Self-organizing Hierarchical Structure: The hierarchical structure of specialized groups and the relevant generalized knowledge are constructed and regulated following a selforganization principle based on the similarity of learning agents.In particular, this principle governs the union of small groups to form a bigger group that eventually enhances the generalization capabilities of all members.Thus, specialized groups at higher levels in the hierarchical structure have more members and can construct more generalized (less biased) knowledge faster adaptation to new environments in [15].
Transition in the dual specialized-generalized process: The specialized process becomes increasingly important compared to the generalized process during the training time.As a result, the learning system evolves to gain specialization capabilities from the learned tasks but also loses the capabilities to deal with environmental changes such as new learning agents, and new learning tasks.Meanwhile, the hierarchical structure of the Dem-AI system is self-organized and evolved from a high level of plasticity to a high level of stability, i.e., from unstable specialized groups to well-organized specialized groups.The transition of the Dem-AI learning system is illustrated in Fig. 2 with three iterative sub-mechanisms such as generalization, specialized learning and hierarchical structuring mechanism.Accordingly, the transition of the dual specialized-generalized process represents the steps in a typical democratized learning framework [13].In that transition, the learning agents are grouped according to the similarities of their learning tasks at the early stage.Then, the generalized process helps in the construction of a hierarchical generalized knowledge for the specialized groups from bottom-up and encourages the group members to be close together.In the meantime, the specialized learning processes leverage personalized learning to exploit their biased datasets by incorporating higher-level generalized group knowledge from top-level to lower-level groups.In doing so, the group members deviate from the common generalized knowledge.After that, the hierarchical structure will be updated according to the new learning models.
In the next section, we develop a democratized learning design that results in a hierarchical generalized learning problem.To that end, we propose a novel democratized learning algorithm, DemLearn to realize as an initial implementation of Dem-AI philosophy.

III. DEMOCRATIZED LEARNING DESIGN
Dem-AI philosophy and guidelines in [13] envision different designs for a variety of applications and learning tasks.In this work, we focus on developing a novel distributed learning algorithm that consists of the following hierarchical clustering, hierarchical generalization, and learning mechanisms with a common learning task for all learning agents.

A. Hierarchical Clustering Mechanism
To construct the hierarchical structure of the Dem-AI system with relevant specialized learning groups, we adopt the commonly used agglomerative hierarchical clustering algorithm (i.e., dendrogram implementation from scikit-learn [14], [16]), based on the similarity or dissimilarity of all learning agents.The dendrogram method is used to examine the similarity relationships among individuals and is often used for cluster analysis in many fields of research.During implementation, the dendrogram tree topology is built-up by merging the pairs of agents or clusters having the smallest distance between them, following the bottom-up scheme.Accordingly, the measured distance is considered as the differences in the characteristics of learning agents (e.g., local model parameters or gradients of the learning objective function).Since we obtain a similar performance implementing clustering based on model parameters or gradients, in what follows, we only present a clustering mechanism using the local model parameters.Additional discussion for gradient-based clustering is provided in the supplementary material.
Given the local model parameters w n = (w n,1 , . . ., w n,M ) of learning agent n, where M is the number of learning parameters, the measure distance between two agents φ n,l is derived based on the Euclidean distance such as φ n,l = w n − w l .In addition, we consider the average-linkage method [17] for distance calculation between an agent and a cluster using the Euclidean distance between the model parameters of the agent and the average model parameters of the cluster members.Accordingly, the hierarchical tree structure is in the form of a binary tree with many levels.In consequence, it will require unnecessarily high storage and computational cost to maintain and be also an inefficient way to maintain a large number of low-level generalized models for small groups.As a result, we keep only the top K levels in the tree structure and discard the lower-levels structure.Therefore, at the top level K, the system could have two big groups that have a large number of learning agents.

B. Hierarchical Generalization and Learning Mechanism
The K levels hierarchical structure emerges via agglomerative clustering.Accordingly, the system constructs K levels of the generalization, as in Fig 2 .As such, we propose hierarchical generalized learning problems (HGLP) to build these generalized models for specialized groups in a recursive form, starting from the global model w (K) construction at the top level K as follows: HGLP problem at level K min where is the loss function of subgroup i given its collective dataset D i .The objective function is weighted by a fraction of the number of learning agents N (K−1) g of the subgroup i, and the total number of learning agents N (K) g in the system.Hence, the subgroups which have more learning agents have higher impact to the generalized model at level K.The hard constraints in (2) enforce these subgroups to share a common learning model (i.e., a global variable w (K) ).To preserve the specialization capabilities of each subgroup, these constraints (2) could be relaxed by using additional proximal terms in the objective.In this way, the problem encourages the subgroup learning models to become close to the global model but not necessarily equal.Thus, the relaxed problem HGLP' is defined as follows: HGLP' problem at level K min where µ K denotes the trade-off between the learning loss and the generalization constraint enforcing the group learning models to be close to the global model w (K) .Since the dataset is distributed and only available at the learning agents, the problem (3) at the top level K can be solved starting from its members problem first.Accordingly, the hierarchical generalized structure is emerged naturally following the bottomup scheme where the learning models at lower levels are updated before solving the higher level generalized problems of its upper-group.Specifically, the problem (3) can be decentralized and solved by the following problem of each subgroup i at the level K − 1. HGLP problem for each group i at level K − 1 min where W (K−1) = (w ).Therefore, we make a general approximation form of the generalized learning problem for the group i at the level k given the prior higher generalized model w (k+1) as follows: HGLP problem for each group i at level k min where g,i is the number of learning agents of group i and w (k+1) the learning model of the upper-group at level k+1 in which group i belongs.Since there exists coupling between the upper and lower levels, and the training dataset is decentralized, the learning problem (4) of the group i at level k cannot be solved directly.Therein, similar to FL, the learning loss of the group can be distributed amongst the group members [2].As a result, the objective of the group problem has the remaining proximal terms forcing the learning models in different levels to be close to each other.Therefore, the learning model is constructed with the model of upper-group at level k + 1 and group members at level k − 1 models by solving the following problem: The closed form of the optimal solution of the problem (5) can be handily derived by setting the gradient to zero as follows: Thus, the learning model of group i can be updated as where α = µ k+1 /(µ k + µ k+1 ).The trade-off parameter α can be tuned later in the experiments to control the contribution from the learning models of upper-group and group members.
Given the closed form solution (6), the coupling between different levels can be approximated by splitting the model updates at each level via the bottom-top, and then, topbottom scheme.In Fig. 3, we show the illustration of the proposed hierarchical updates.In particular, the lower groups and members are updated first before the upper-groups.And then, the updated upper-groups broadcast the parameters to its group members to finish one update cycle.
At the lowest level, each learning agent n can actually perform the local training process to fit its private data with the personalized learning problem using the latest hierarchical generalized models as follows: PLP at level 0 Algorithm 1 Democratized Learning (DemLearn) 1: Input: K, T, τ.
2: for t = 0, . . ., T − 1 do 3: for learning agent n = 1, . . ., N do 4: Agent n uses the upper-group model w Agent n sends updated learning model to the server; 7: end for 8: if (t mod τ = 0) then Server reconstructs the hierarchical structure by the clustering algorithm; Hierarchical Update: Each group i at each level k performs an update for its learning model from bottom to top for updating the contribution of group members: After the top-level model is updated, the lower-levels starts updating top to bottom for the contribution of the upper-group as follows: The updated learning models at level 1 (i.e., w t+1 ) are then broadcast to all agents to update their local models following equation (10).12: end for where L (0) n is the personalized learning loss function for the learning task (e.g., cross entropy loss for classification [18]) given its personalized dataset n,g is the number of learning agents of the level-k group in which the agent n belongs.Solving the PLP problem, the learning agent can update their personalized model w (0) n belonging to the parameterized deep learning model set W. In this personalized level, the number of group member is 1.

C. Democratized Learning Algorithm
Inspired by the FedAvg [2] and FedProx [19] algorithms, we adopt the aforementioned recursive analysis and hierarchical clustering mechanism to develop a novel democratized learning algorithm, namely DemLearn.The details of our proposed algorithm are presented in Alg. 1.Each agent n uses the uppergroup model at level 1 (i.e., w n,t ) as the initial learning model.Thereafer, the agent iteratively solves the PLP problem in the equation ( 8) based on the gradient method.The updated client model will be sent to the central server to perform hierarchical clustering and update from the generalized level 1 to the level K.After every τ global rounds, the hierarchical structure is reconstructed according to the changes in the  personalized learning model of agents.The generalized learning models of groups are updated, respectively, in bottom-top, and then top-bottom fashion, following the equations ( 9) and (10) as the approximation of the closed form solution (6).This allows the lower level subgroups to contribute their knowledge for updating the group model.In return, they receive (and incorporate) the better generalized knowledge from the uppergroups that enhances the generalization capacity of their local learning models.Additionally, we introduce an amplification trick in the bottom-top update for the first 5 rounds to speed up the initial stage of learning process.Accordingly, the update from group members (i.e., w t+1 in the equation ( 9)) is multiplied with a scaling constant 1.15 in the first 5 rounds.

A. Setting
In this section, we validate the efficacy of the DemLearn algorithm with the MNIST [20], Fashion-MNIST [21], Federated Extended MNIST [22], and CIFAR-10 [23] datasets for handwritten digits and fashion images recognition, and objects recognition, respectively.We conduct the experiments with 50 clients, where each client has median numbers of data samples at 64, 70 and 785 with MNIST, Fashion-MNIST dataset, and CIFAR-10, respectively.Different from these three datasets, we  Using these datasets, 20% of data samples on each client are used for evaluating the model testing performance.We divide the total dataset such that each client has a small amount of data from two specific labels amongst the overall ten in both datasets.
In doing so, we replicate a scenario of biased personal datasets, i.e., highly unbalanced data and a small number of training samples can be collected at agents.The learning models consist of two convolution layers followed by two pooling and two fully connected layers whereas three convolution layers are used in the CIFAR-10 dataset.We set the update period τ = 1, and validate the performance of the proposed algorithm with K = 4 generalized levels.Our implementation is developed based on the available code of FedProx in [19].datasets are available at https://github.com/nhatminh/Dem-AI.

B. Results
Existing FL approaches such as FedProx and FedAvg focus more on the learning performance of the global model rather than the learning performance at clients.Therefore, for forthcoming personalized applications, we implement DemLearn and measure the learning performance of all clients and the group models.In particular, we conduct evaluations for specialization (C-SPE) and generalization (C-GEN) of learning models at agents on average that are defined as the performance in their local test data only, and the collective test data from all agents in the region, respectively.Accordingly, we denote Global as global model performance; and G-GEN, G-SPE are the average generalization and specialization performance of group models, respectively.In addition to the standard C-SPE performance for local models, the introduced C-GEN performance is an important metric that shows the generalized capabilities of local models.Even though the biased local models can achieve high C-SPE values from very early, particularly due to their small local datasets, they still have a very low generalized capabilities which can help to produce good predictions according to the frequent changes of users.Meanwhile, the global and group models have the highest generalized capabilities, but a lower specialized capabilities during deployment at the clients.
In Fig. 4, 5, we conducted performance comparisons of our proposed methods, DemLearn with the three FL methods, FedAvg [2], FedProx [19], and pFedMe [12] as baselines on four benchmark datasets, MNIST, Fashion-MNIST, FE-MNIST and CIFAR-10.Fig. 4a depicts the performance comparison of DemLearn with FedProx and FedAvg with the MNIST dataset.Experimental evaluations show that the proposed approach outperforms the baselines in terms of the convergence speed, especially to obtain better client generalization performance.We observe that the local model requires only 40 rounds to reach the C-GEN performance level of 80% using the proposed algorithm, whereas existing FL algorithms such as FedProx, FedAvg, and pFedMe take more than 80 global rounds to achieve a near-competitive level of performance as ours.Furthermore, after 100 rounds, DemLearn obtains better average client generalization performance (i.e., 88.77%) accross client models and comparable C-SPE and Global performance as of FedAvg and pFedMe.
Following similar trends, Fig. 4b, Fig. 4c, Fig. 5 depict that DemLearn algorithm perform better than FedProx and FedAvg with the Fashion-MNIST, FE-MNIST and CIFAR-10 datasets in terms of the client generalization performance.In Fig. 4c, the pFedMe algorithm found difficulties in tuning parameters to obtain a comparable performance and showed lower performance and more fluctuated rather than that of the other algorithms for the FE-MNIST dataset.Thus, pFedMe obtains a slow improvement of the client models in both specialization and generalization.Meanwhile, our proposed algorithm exhibits stable convergence speed and efficiency to achieve consistently high performance of learning models at all levels.In Fig. 5, we observe DemLearn suffers a slight degradation in C-SPE and Global performance to gain a high C-GEN performance.After 100 rounds, DemLearn demonstrates good trade-off learning capabilities of client models with high C-SPE (79.09%) and C-GEN (57%) performance, while other baseline algorithms only produce biased local models with low generalized capabilities.
In Fig. 6, we evaluate and compare the performance of the proposed algorithm for a fixed and a self-organizing hierarchical structure via periodic reconstruction (i.e., τ = 1).We observe that DemLearn benefits from the self-organizing mechanism, and can provide slightly better generalization capability of client models.In addition, the amplification in the first 5 rounds help the proposed algorithm can speeds up the initial performance of the DemLearn algorithm.

C. Tuning Parameters of the proposed algorithm
In this subsection, we show the impact of parameter α in the testing accuracy of MNIST and Fashion-MNIST datasets, as shown in Fig. 7.As can be seen in the results, when α is large, we obtain high performance of the client generalization while slight degradation in their specialized capabilities.As such, increasing the value of α enhances generalization, but also reduces the specialization performance of client models to a marginal extent.Since α = µ k+1 /(µ k + µ k+1 ) controls the contribution from the learning models of upper-group and group members, tuning α produces the impact in disseminating the generalized knowledge from upper-groups to the lower level groups and clients.At each level k, the higher value of α signifies the objective is more focused on minimizing the gap with the upper-group model at k + 1 than the lower-group models at k − 1.Furthermore, we evaluate the impact of other parameters, such as µ, K, τ , but they do not show any clear effects on the performance of DemLearn, likely due to the small-scale simulation settings.In addition to which, reducing the frequency of cluster updates by increasing the parameter τ can help the algorithm obtain comparable performance with a lower reorganizing cost of the hierarchical structure.However, due to the limited scope of the simulated datasets and settings, we advocate the impact of defined parameters on the performance of the proposed algorithm is better realized when evaluating the algorithm with a more practical/experimental data.In particular, we keep it as our future work to evaluate our algorithm on dataset that has a hierarchical structure based on groups of similar users, or exceedingly large and different users.

D. Clustering Approach
In addition to the euclidean distance between learning parameters in the hierarchical clustering algorithm, we can also evaluate the measured distance φ (11) Our experimental results demonstrate the clients and groups models show almost similar learning performance (Fig. 8) and different trends of cluster topology (Fig. 9) using DemLearn algorithm.As shown in the Client-GEN subfigure, some outliers contribute to the drop in client-GEN performance throughout the training.At the same time, most clients have comparable generalized capacities with the global model.Accordingly, the illustration further helps us figure out the exceedingly different clients with low C-GEN performance.That means, in each round, the outliers are detected (and classified) by a clustering mechanism that is exceedingly different from other client model parameters.However, we note that the outliers are not always the same and change throughout their local model updates.But some appear several times that provide meaningful information for the learning system.Furthermore, as shown in Fig. 9, the difference in cluster topology does not affect much on the overall learning performance.

E. Discussion
Compared to the FedAvg and FedProx, we have a similar on-device computational cost for solving PLP problem using the gradient descent method.Different from the others, pFedMe requires extra steps for the θ approximation; hence, requires more computation time on the client device.On the other hand, for the computation time requirement at the server, the DemLearn algorithm requires extra computation cost for hierarchical clustering and a negligible cost for hierarchical updates.We note that the cost of the hierarchical clustering depends on the model size and the number of learning agents.In principle, the standard algorithm for hierarchical agglomerative clustering has a time complexity of O(n 3 ) and requires O(n 2 ) memory [24].In this work, we perform evaluation of the algorithm with n = 50 users and obtain a running time of 0.0015s per step on average when using a computer with CPU i7-7700K and memory of 32 GB.For the communication cost, all algorithms require similar costs to send the model parameters.
In practice, to deploy a typical hierarchical structure the Dem-AI systems can include three entities: • A cloud server that handles the global model (root node) and its sub-level generalized groups; • Distributed regional servers, which are edge servers deployed within each region and whose role is to manage the subgroups and learning agents.• Learning agents.Our implementation for hierarchical clustering is in centralized manner in this paper.However, the agglomerative hierarchical clustering mechanism merges the learning agents and groups in a bottom-up manner, then it is possible to implement it in a decentralized manner.Therefore, we can operate the hierarchical clustering in the cloud server for top levels and decentralized implementation for regions at  edge servers for lower-level groups.Also, this grouping mechanism can mitigate the negative effects in the aggregation of exceedingly different learning agents to construct hierarchical generalized models.In that way, we maintain three or more levels of generalized models instead of a single global model.
To the best of our knowledge, this design has not been considered yet in recent studies of personalized FL.

V. CONCLUSION AND FUTURE WORKS
The novel Dem-AI philosophy has provided general guidelines for specialized, generalized, and self-organizing hierarchical structuring mechanisms in large-scale distributed machine learning systems.Inspired by these guidelines, we have formulated the hierarchical generalized learning problems and developed a novel distributed learning algorithm, DemLearn.In this work, based on the similarity in the learning characteristics, the agglomerative clustering enables the selforganization of learning agents in a hierarchical structure, which gets updated periodically.Detailed analysis of experimental evaluations has shown the advantages and disadvantages of the proposed algorithm.Compared to conventional FL, we show that DemLearn significantly improves the generalization performance of client models without largely compromising the specialization performance of clients models.As a result, DemLearn enables good trade-off learning capabilities of client models with high C-SPE and C-GEN performance while other algorithms can only produce biased local models with low generalized capabilities.These observations benefit for a better understanding and improvement in specialization and generalization performance of the learning models in future Dem-AI systems.
Democratized learning provides unique ingredients to develop future distributed personalized intelligent systems.To that end, the learning design could be further studied with personalized datasets, extended for multi-task learning capabilities, and validated with actual generalization capabilities in practice for new users and environmental changes.We advocate our current design has not coped with multiple learning tasks [3] and the adaptability of the system due to environmental changes as general intelligent systems.Turning the general distributed learning systems into reality, we need to profoundly analyze the Dem-AI from a variety of perspectives such as robustness and diversity of the learning models and novel knowledge transfer and distillation mechanisms [25], [26].Also it is possible to incorporate our flexible design with current approaches such as meta-learning and optimization-based methods, to further improve the personalization in FL.We consider our work as an orthogonal contribution to the design architecture of distributed
as an in-exact minimizer (i.e., gradient based) of the following problem: Experiment with Federated Extended MNIST dataset.

Fig. 6 :
Fig. 6: Comparison of algorithms for a fixed and a self-organizing hierarchical structure.
DemLearn, FedAvg and FedProx use the common learning rate η = 0.05, local epoch E = 2, batch size B = 10.For FedProx, we set the parameter µ = 0.5.Meanwhile, pFedMe needs detailed tuning to obtain a competitive accuracy for different datasets.The Python implementation of our proposed algorithm using Pytorch and Experiment with Fashion-MNIST dataset.
models based on the cosine similarity as follows:φ (cos) n,l = cos(w n , w l ) = M m=1 w n,m w l, Experiment with Fashion-MNIST dataset.