FeLebrities: A User-Centric Assessment of Federated Learning Frameworks

Federated Learning (FL) is a new paradigm aimed at solving data access problems. It provides a solution by moving the focus from sharing data to sharing models. The FL paradigm involves different entities (institutions) holding proprietary datasets that, contributing with each other to train a global Artificial Intelligence (AI) model using their own locally available data. Although several studies have proposed methods to distribute the computation or aggregate results, few efforts have been made to cover on how to implement FL pipelines. With the aim of accelerating the exploitation of FL frameworks, this paper proposes a survey of public tools that are currently available for building FL pipelines, an objective ranking based on the current state of user preferences, and an assessment of the growing trend of the tool’s popularity over a one year time window, with measurements performed every six months. These measurements include objective metrics, like the number of “Watch,” “Star” and “Follow” available from software repositories as well as thirteen custom metrics grouped into three main categories: Usability, Portability, and Flexibility. Finally, a ranking of the maturity of the tools is derived based on the key aspects to consider when building a FL pipeline.


I. INTRODUCTION
Federated learning (FL) is a paradigm that aims to solve the data access problem. In the Artificial Intelligence (AI) domain, data represents the starting point for many research and development activities [1], [2], [3]. With increasing attention given to the field, data have also grown in demand and appreciation, redefining priorities in designing and building solutions for real-world applications. A clear demonstration of this growing importance is the creation of dedicated laws, such as the General Data Privacy Regulations (GDPR) [4] in place in the European Union, the Protection of Personal Information Act (POPIA) [5], and the Health Insurance Portability and Accountability Act (HIPAA) [6] in the USA, which is specific for accessing clinical data and medical records. From the AI perspective, this reflects the need to access data to advance the State of the Art (SOA) in a given environment The associate editor coordinating the review of this manuscript and approving it for publication was Fabrizio Marozzo . while fully complying with regulations. FL is an effective way to satisfy all these requirements. In a federation of collaborating institutions, what is shared is a common global model that is partially trained by every collaborator using local data. Historically, the approach of training AI models assumes that data would be collected and centralized in a unique infrastructure appropriately equipped with dedicated hardware and software to sustain the computation. High performance computing (HPC ) centers are great examples of this approach, as illustrated in Figure 1. In constrast, in an FL setting, data are expected to stay in the exact location where they were collected, while a copy of the AI global model is shared across all institutions participating in a federation. A generic example is shown in Figure 2.
The research community has already started investigating this emerging topic either for its privacy-compliant aspects [1], [7] or as a viable tool for addressing AI challenges in critical domains such as the biomedical context [8], [9], [10]. Although the domain is still relatively new, the literature 4) Propose a ranking based on objective metrics, including common indicators and the ability to match the needs highlighted in the previous point.
We genuinely believe that by providing a quantitative and qualitative survey of the FL tools, the research community will be able to: accelerate its activities, promote fairness by proposing an inclusive method to collect comparable studies, and help tool providers identify ways to improve their products. The availability of a ranking of FL tools will also boost their exploitation in the production environment, where such tools still need to be explored.

A. PAPER ORGANIZATION
This paper is composed of six Sections. In section II, we discuss FL implementation related works. Section III focuses on the list of tools currently available to the community, sharing a high-level overview of their popularity and adoption. Section IV augments the retrieved list of tools with the current state of adoption, including the growth trend observed over one year, and Section V discusses the key aspects that should be considered when implementing federated environments for research purposes. These factors are then translated into requirements that FL tools need to satisfy for successful exploitation and consolidated in a raking table. The results are discussed in Section VI and future directions and conclusions are finally addressed in VII.

II. RELATED WORKS
FL is a distributed machine learning (ML) approach that enables organizations to collaborate on projects without sharing sensitive data [12], such as patient records [13], [14] or financial data [15], or data that is not easily accessible, such those stored in remote locations such as satellites or space stations from high-resolution sensors [16]. The basic premise of FL [1], [2] is that the model moves to meet the data rather than the data moving to meet the model. Therefore, the only minimum data movement required across the federation is the model parameters and their updates.

A. FL SETTINGS
There are two essential components of an FL pipeline: one or multiple institutions owing data and a mechanism to orchestrate the process. Each institution must have local data and be accountable for hosting the training process on proprietary data. The orchestration mechanism may vary, but is mainly of two types: Synchronous or Asynchronous.
In a synchronous scenario, the idea is to have a central unit, often identified as an aggregator [12], [13], acting as a central pivot and determining when to start a new iteration. The aggregator is responsible for cloning the initial model to each collaborating institution, waiting to receive the locally trained copies, and finally merging them, as the name suggests. This type of FL pipeline is usually implemented in big data centers (cross-silo), such as those involved in medical environments [3], [17]. Data centers can store vast amounts of 96866 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. data and provide the computational power required to process them. In addition, big computing infrastructures, such as HPC  centers, can rely on fast and stable connections to the network,  simplifying the creation of a more reliable communication  channel to interact with a hypothetical aggregator unit. However, as soon as we move away from data centers towards edge devices, new challenges arise owing to the high variance in products and manufacturers. Devices with different latencies, working frequencies, and hardware features can have different computation times [18], [19]. These are the reasons for the need for an asynchronous FL pipeline. In this scenario, each collaborating institution can share its update at any time, either to a unique aggregator [18], [20], [21] or to other participants in an ''all-to-all'' setup [22], [23].
Another critical point to address is the difference between the horizontal (HFL) and vertical (VFL) federated learning. To understand this difference, we need to consider the space of the features and the model type.
In the examples shared thus far, we implicitly refer to the Horizontal FL, where the different collaborators have different data but contribute to the federation by sharing the feature space and training the same model. This is the case for institutions with offices distributed across different locations that would like to train a common model by leveraging the local data stored in each facility in a privacy-compliant manner. In Vertical FL, each collaborator is expected to contribute by providing different bits of information from the same sample. This leads to a scenario in which the feature space accessed by every collaborator may be different from the others. Therefore, each collaborator might train a different model in the vertical configuration. Aggregation, in this case, is represented by the interoperability between collaborators, where to update a model, information coming from the model of another collaborator might be required [24], [25]. For example, in a typical VFL setting, we can see a life insurance agency collaborating with hospitals to build a decision model to obtain more precise estimations of their affiliates. In this case, it is expected that the entities involved in the federation can provide different information about the same user. These two ways of articulating the data for a federation impact the choice of model and how the federation is orchestrated. While in the HFL, there is only one model, and all the collaborators are responsible for ensuring that data are normalized to feed it, the VFL brings some more complexity. In this case, to handle different data types from several institutions, each collaborator should have a local model that can accept the data from that specific institution as input. In addition, there must be a federated model that takes all outputs of the various local models as inputs. As illustrated by Chen et al. [25], the procedure for training Deep Learning (DL) models based on back-propagation [26], [27], needs to deal with the two-level training procedure represented by the different models that need to be managed: one at the collaborator level and the other at the aggregation point. This complexity is also reflected in the challenges that might arise in finding a satisfactory convergence point for the adopted DL model.

B. FL CHALLENGES
Regardless of which FL setting (Synchronous or Asynchronous) or configuration (Horizontal or Vertical) is adopted by a given federation, the research community currently adresses three main areas: 1) Aggregation functions and model convergence starting from different data distributions. 2) Privacy aspects and ways to build a secure FL pipeline for protecting the IP during experiments. 3) Communication efficiency and protocols to improve the FL base infrastructure. Protecting dataset ownership implies that in most cases, the assumption of dealing with independent and identically distributed (i.i.d.) samples across local nodes does not hold for FL setups [28], [29]. Data distribution can severely impact the training performance by affecting the total accuracy [30], convergence capability, authentication processes (especially in the case of different devices), and speed of the process intended as total time-to-train [31]. In summary, in this setting, the performance of the training process may vary significantly according to the imbalance of the local data samples and the particular statistical distribution of the training examples (i.e., features and labels) stored at the local nodes [2].
In the past few years, institutions have introduced FL deployments to address the need to train AI models. Sectors such as healthcare and finance would benefit from having a setting with greater access to more extensive and diverse datasets without violating privacy laws [32], [33], such as HIPAA, GDPR [4], and POPIA [5]. On the one hand, FL has been designed with security in mind [30], and the set up is just the beginning. Securing execution environments introduces many open challenges to the research field [34]. Key questions include finding a consolidated method to guarantee secure execution (encryption, key exchange, and hardware features) and validating the reliability of intermediate results and collaborators within the federation.
Massive amounts of data are usually stored in ''Data-Lake'' infrastructures. The more machines/institutions that participate in a federation, the more critical is the ability to scale. As mentioned in the previous section, a consolidated method for detecting scarse training contributions (coming from institutions with corrupted or redundant data) is still lacking, to the best of our knowledge. Aggregation functions are currently being evaluated by the research community [28], [33], [35]. Another implication when discussing big scales is the infrastructure and the connectivity chosen by the institutions for communication [2].

C. STUDY RELEVANCE
Several studies have proposed surveys to illustrate the advancement in the field [1], [11]; however, to the best of our knowledge, no one has provided a ranked list based on the ad hoc quality assessment criteria of all the (possible) tools available to the community to implement FL VOLUME 11, 2023 experiments. A comparison of five tools is provided in [36], which are accessible through a licensed service, without clarifying why or how these tools were precisely selected. Another study [37] provides an attractive comparison table. However, the main focus of this work is to promote an alternative tool specifically for FL benchmarks instead of providing a complete list of the available options to boost the exploitation of FL across the community. Even in this related work, it is unclear why and how the tools discussed were selected. Similarly, [38] proposed a complete benchmarking suite with a helpful decision tree to help users choose a tool based on their requirements. Their recommended ranking also includes some of the evaluation metrics proposed in this study with an even deeper level of detail. However, while we believe in the value of such an approach, the breadth of the offer in terms of tools that can be chosen might represent a constraint for end users. In fact, [38] centers its evaluation on nine tools, but the criteria for which those tools were identified and selected need to be clarified. As we discovered in this work, the list of open-source FL tools can exceed 30, and it is interesting to note how the most popular tool to date was not considered in their decision tree.

III. FEDERATED LEARNING TOOLS A. METHODS AND PREMISES
This article aims to provide an inclusive and informative list of the current FL tools available to the community for implementing research pipelines in any environment in which accessing distributed data is challenging. To better understand the present scenario, we performed three literature searches, H i , where i ∈ 1, 2, 3. H 1 was conducted on March 28, 2022, H 2 on September 28, 2022, and H 3 on April 10, 2023.
This activity was inspired by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. More specifically, we followed the Preferred Reporting Items for Systematic Review and Meta-Analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA): explanation, elaboration, and checklist [39]. In particular, the guidelines we followed were a selection of the guidelines described in the PRISMA 2020 checklist, accessible on the official PRISMA website: http://www.prisma-statement.org/. Below is a detailed description of the items extracted from the PRISMA guidelines that were identified as applicable to this collection. Items not included in the list below were either not sharing best practices on the ''Method'' (i.e., best practices for ''Title,'' ''Introduction,'' and ''Abstract'' for systematic reviews) or not directly relatable to this contribution as it does not fully match a ''Systematic Review.'' Some examples of discarded items are as follows: Cite studies that might appear to meet the inclusion criteria, but which were excluded, and explain why they were excluded.
• 23c) Discuss any limitations of the review processes used. Each of these items was used to frame the study. The following map illustrates how the single guidelines contributed to shaping the sections: • Items 5, 6, 7 and 8 were considered for building this Section; • Items 13a and 13d, were used to build the comparison table in Section IV; • Items 16a, 16b and 23c, were used to structure the discussion of the results provided in Section V.

B. EXPLORING TOOLS
To objectively build the list of tools, we performed three harvests, H 1 , H 2 , and H 3 , with roughly six months of cadence (184 and 194 days respectively). The collection method was the same as that described below. We used three different search engines: Google Scholar [40], Semantic Scholar [41], and the standard Google website.
For the first two, we developed a script to automatically query search engines using a collection of keywords on the topic. We built such a collection by combining each item p of a list of prefixes P with each element s in a list of suffixes S. The set of prefixes was populated with the ''federated learning'' keyword, and other synonyms or more related terms used in the literature to express similar concepts: P='federated learning', 'privacy-preserving machine learning', 'collaborative learning', 'collaborative machine learning'.
The This led to a prosperous and inclusive search of all the relevant articles and works in the domain.
Google Scholar helped capture all related works where a given keyword (or part of it) was mentioned in the paper and not only in the title. In H 1 , we identified a cumulative list of 420 related articles, of which 217 were unique. Despite the contribution, due to a service limitation, the reported numbers refer to H 1 : the article harvest completed in March '22 only.
To build a more robust and consistent set of related works, we leveraged the Semantic Scholar service [42]. The website allows users to perform queries and sort the outcome according to four metrics: ''Relevance'', ''Citations-count'', ''Most Influential Paper'', and ''Recency''.
We repeated searches using all the keywords for all the four sorting types mentioned above, obtaining 1121 (out of 8320) unique articles during H 1   Finally, we used the standard Google search engine to ensure we could capture all the relevant FL tools yet to be described in a published paper. To do so, we evaluated the first ten results obtained by querying the search engine with the same list of keywords used previously. This step allowed us to enrich the list with additional FL frameworks, such as Nvidia Flare [43], Tensorflow Federated [44], and IBM Federated [45].
Once we obtained the three lists of unique titles described above, we finally merged them, resulting in 1195 unique articles discovered in H 1 , 1292 retrieved in H 2 , and 1298 in H 3 . We then started pruning results by manually reviewing and labeling the list in three different buckets: ''relevant''(R), ''non-relevant'' (NR), and ''uncertain'' (TBD).
Articles considerably unrelated to the topic (e.g., work mentioning ML methods or collaborative learning platforms for schools) were discarded from the collection. After the first labeling cycle, we had 65 R, 980 NR, and 150 TBD for H 1 , 9 R, 227 NR, and 34 TBD for H 2 and 17 R, 198 NR, and 33 TBD for H 3 Figure 4 shows an example of articles captured by the three categories.
The ''uncertain'' category required us to conduct a deeper review of the work. All articles in this list underwent a second round of labeling. The objective was to review 150 papers on the TBD list and to allocate them to the R or NR. As a result, we obtained 83 R and 1112 NR for H 1 , 12 R and 258 NR for H 2 and 19 R with 229 NR for H 3 .
A summary of the adopted research pipeline and the results collected during each harvest is shown in Figure 3.
Ultimately, a deeper understanding of the relevant papers was performed to draw the final list of FL tools. This final review allowed us to identify 36 suitable tools during H 1 , two additional tools during H 2 , and one last tool added to H 3 . The complete list of tools retrieved in H 1 with the indicators from the Github and Gitlab repositories is presented in Table 1.

IV. TOOLS POPULARITY AND LEVEL OF ADOPTION
After retrieving the list of tools, our goal was to understand each item's popularity and adoption level from a community perspective. Each Git repository has public indicators, such as the number of watches (W), forks (F), and stars(S). The Watch indicator captures the number of users who actively watch the repository. These users receive updates when new actions are taken from a repository. The number of forks indicates the number of times a repository has been forked. It is a good indicator of how many interested users might develop code to extend the tool. Finally, the number of stars indicates the number of likes that the repository has received. This final indicator might need to be more accurate in capturing actual users, but it can provide a reasonable estimation of reach in terms of how many people have seen the tool at least once. For practicality, we aimed to combine these three aspects into 96870 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
one consolidated score to provide a popularity driven ranking of the tools. Since popularity would also depend on the time a given repository was made available to the community, we wanted to normalize all the values with a timing factor. This step ensured that newer repositories with less exposure to the community would not be affected by a low score. The scores are calculated as follows: where ET H i is the time elapsed between the day of the tool's first commit and the harvest date H i . Table 1 shows the scores associated with the tools retrieved in H 1 , the March harvest. An initial understanding of which tools were accessible to the community was helpful but could provide a limited view of the bigger picture. Indeed, while Git indicators can share important insights about user preferences in a given time frame, they do not necessarily capture community trends from a popularity growth rate perspective.
To access this information, we observed the list of tools over a time window to determine which tools were being considered by the community at a higher pace. Thanks to the second harvest H 2 held on September 22) and the third H 3 done on April 23, we discovered new tools to add to the list and updated the values of W, F, and S for each of the tools found in March. Knowing the differences between the indicator's value recorded in the various harvests, given two harvests, H i and H j where j > i, we computed the growth rate for each tool as follows: where ji , is the difference in days between Date(H j ) and Date(H i ). More precisely, 21 = 184 days, 32 = 194 days and 31 = 378 days. We calculated the growth rate GR for all possible combinations: GR 12 captures the growth of the repositories between H 1 and H 2 , GR 23 for the growth between H 2 and H 3 and GR 13 captures the yearly evolution between H 1 and H 3 . The results of these computations are shown in Figure 5.
Interestingly, the order of the tools based on the popularity level observed in H 1 and captured in Table 1 does not match the growth rate highlighted in Figure 5.

V. PRIORITIES IN FL TOOLS FOR RESEARCH
The previous Sections illustrated how we could retrieve a list of relevant tools for building FL pipelines and which are preferred by the community. In this Section, our goal is to provide a new ranking of tools to identify the most mature. We focused on the specific features of each tool, regardless of the popularity aspects, as outlined in the previous Sections.
The task consists of retrieving and evaluating FL tools that can be adopted to boost the exploitability. To perform this classification, we defined a set of measures based on the different needs and expectations that a tool should satisfy according to the application field and final objectives. As described in the SOA Section, there are several ways of implementing FL. From bridging different data-center institutions together at the production level to leverage the agile nature of IoT devices, FL pipelines must be shaped according to the needs, goals, and constraints to consider. However, in line with our purpose of identifying the most mature FL tools for research activities, there is no need to filter results based on data centers or edge devices as long as the tools will provide the possibility to simulate multiple decentralized abstracted computing hubs.
However, other considerations should be drawn around what has been highlighted in the FL challenges regarding data distribution, confidential computing, and communication efficiency.
Indeed, those aspects are essential because they can directly reflect requirements that FL tools need to fulfill to be considered. For example, they all highlight the need for a flexible and modular architecture to allow maximum research customization for aggregation functions, communication protocols, or privacy-preserving and security features. Another practical insight derived from Items 2 and 3 concerns the ability of a tool to scale out on multiple computing machines. As demonstrated in [16], the applicability of FL is not only related to the need to access data complying with regulations. It can also refer to data that are not readily retrievable, such as those from a satellite or space station. Furthermore, the Medical [3], or Geo-spatial [16] environments are usually sources of high-resolution data acquired by machines manufactured by different companies, which could map the need for dedicated pre-processing routines to be used to feed an AI model. FL approaches can be tested on multiple machines hosting different datasets (generated by different equipment) or by simulating multiple parallel instances running on the same computing node. The first setting is preferred because it enhances the reliability of the conclusions when investigating the privacy and communication efficiency aspects.
In addition to what has been outlined, other practical considerations related to the research environment may also VOLUME 11, 2023 apply. The easier the path to results in a research environment, the faster the deployment of this technology in real institutions. For example, being able to set up a federated environment quickly by leveraging friendly API, re-using common and well-established language (such as Python), and AI platforms (such as PyTorch or Tensorflow to mention two) having access to direct support channels or useful documentation can represent critical aspects for simplifying research activities in different domains.

A. EVALUATION METRICS
Based on these observations, we consolidated a list of evaluation parameters and guidelines organized as follows: 1) Usability • Easy integration with other tools We used these parameters to build an ''Evaluation Table'' 2 for the tools identified in the previous Section. The table was populated with information retrieved from publicly available resources for each tool (see ''Docs.'' column in Table 2). As can be seen, there is a mismatch between the tools listed in Table 1 and those listed in Table 2. This is mainly due to the following four reasons: 1) Tools not open-source, like Sherpa-ai [70].
2) Missing repositories: is the case of tools that have not yet released their codes after the paper publication: Chiron [72], FedHealth [73], FAE [74], GENO [75], FedTGan [76] and IPLS [78]. 3) Coherent but not suitable: is the case for LEAF [52], FL-Bench [77], and PyFed [64], which are positioned for benchmarking purposes and, therefore, might lack essential features for conducting more extensive research activities. FedGraphNN [53] is a sub-project of the more significant initiative known as FedML [48] already included in this survey. 4) New tools or new openings: is the case for Nvidia-Flare [43] (a sub-project of Nvidia-Clara) and FL-Pytorch [71] which opened their repositories at some point after H 1 harvest, as well as FLUTE [79], and PLATO [80], which were retrieved (and added) during H 2 and XFL [81] retrieved in H 3 . After pruning the 12 tools that did not qualify, adding 2 (despite the substitution with Nvidia-Flare [43], Nvidia-Clara was already captured by Table 1) at the end of April '23 harvest H 3 , we ended up with a list of 28 total tools.
In a second instance, a score was associated with each cell based on a quantitative assessment. Aiming at objective classification of the tools, we captured qualitative aspects in a very inclusive manner. These rewarding tools demonstrated additional development efforts for the community through the available material without penalizing new promising tools that might still be under development.
More precisely, the proposed scoring method rewards completeness rather than excellence. This means that a tool that supports 20 programming languages but needs other relevant features would not outperform a tool that supports only one programming language but has more features that simplify the user experience. This is achieved using scores with a reduced range of values to allow newer but promising tools to compete with more mature tools. More in detail, we adopted a simple approach to assign a score to each cell and designed the ''Score Table'' 3: • Documentation: We considered having Paper P and/or a public repository for the tool Gh (Github) or Gl (Gitlab) as a minimum requirement. Therefore, we assigned zero to all the tools that did not match this expectation; rewarded with 1 point the tools with at least one additional source of information (such as a dedicated web page or richer documentation that would go beyond Readme files on repositories or Slack support). Finally, 1.5 points were given to all those that provided two or more sources.
• Developer Experience (DX): we assigned 0 points to all the tools that did not seem to mention nor provide a user interface of some sort (e.g., Jupyter notebook [82] or Google Colab [83] to mention two). We assigned 1 point to all the tools with at least 1 form of user interface abstracting from programming on the command line. Finally, 1.5 points were given to all the tools with two or more user interfaces.
• Language: We assigned 0 points where information about the supported version was not clearly outlined in the documentation. One point was given to the tools supporting at least one language (or one version), and 1.5 points were given to all tools supporting two languages (or two versions of a language). Finally, 2 points were given to all the tools where the engineering team made an the extra effort to support more than two languages (or more than two versions of the same language).
• Supported AI frameworks: We assigned 0 points where the information about the supported AI frameworks was not clearly outlined in the documentation; 1 point was 96872 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. '') and distribution channels (''Dist. channel'') columns, P = Paper, Gh = Github, Gl = GitLab, W = Website. In the ''Supported AI Type'' column, DL = Deep Learning and ML = Machine Learning. Column ''H/V'' differentiates between ''Horizontal'' and ''Vertical'' FL, while the ''Sync/Async'' column indicates whether the tool supports synchronous S or asynchronous (A) workflows. The NM label refers to ''Not Mentioned'', meaning that the information did not appear as mentioned in the available documentation.
given to each supported framework. When the number of different supported frameworks exceeded 2, we assigned a maximum score of 2.5 points.
• Type of AI: 0 points if not mentioned in the documentation, and 1 for each type of AI supported (ML or DL).
• Distribution channels: we set the minimum requirement for the ability to download a repository and install the tool from there. Therefore we assigned zero points to all the tools respecting this minimum requirement, 1 point to all the tools that had at least one additional way to access the software package (e.g., Pipy or Anaconda); finally, 1.5 points for all the tools that could be installed in two or more ways.
• Multi-node mode: zero when only the simulated environment on a single computing machine was mentioned. One point to all the tools that allow implementing real federation on multiple nodes and 0.5 if this capability has limitations or constraints.
• Open-source: all the tools presented in the table are open-source. This column was not included in the table.
• Containers/Virtualization: zero was given where the documentation did not provide any of the two options. One point was given where either a containerized or virtualized environment was supplied. 1.5 points when two or more options were listed and 0.5 when containers were available but also presented as the only way to access the tool. VOLUME 11, 2023  • Modular architecture: based on the analysis of the repositories of the tools, we trust that all of them respect this parameter. More specifically, they all have proven to have separate entities (such as client-server and orchestration processes) that can be launched independently.
• Horizontal or Vertical: a tool must implement at least one. Therefore we rewarded 1 point only to the tools that would allow executions in both settings.
• Synchronous or Asynchronous: Asynchronous tools can simulate synchronous orchestration with fewer efforts (i.e., active waiting from the aggregation point) than the one required in the opposite scenario (i.e., different communication protocol orchestration). Therefore, we rewarded 1 point only to the tools that would allow execution in Asynchronous. Where ''Not mentioned'', given that a tool must implement at least one, we assumed the default to be synchronous and assigned 0 points.
• Privacy and Security independent module: zero point to the tools that focused on the ability to implement an FL pipeline but did not seem to mention nor highlight the possibility of tweaking or injecting any privacy or security module (i.e., Homomorphic encryption, secured communication protocols, blockchain). One point to all the tools that included at least one.
• Easy integration with other tools: same process applied for evaluating the containers and virtualization mechanism.

VI. DISCUSSION
This paper proposes a survey of the public tools currently available for building FL pipelines. After retrieving the list of tools, we evaluated them using three different metrics: the tool's popularity (based on community adoption), growth rate, and maturity (based on our proposed review). These evaluations led to various rankings. An in-depth discussion of the results is provided below.
Based on the resulting ranking in Table 3, the most mature tools are Flower [49], OpenFL [50], and IBM-Federated [45]. Although the firsts two are fairly close to each other, the third appears to have a solid distance. PySyft [46], Nvidiaflare [43] and FedML [48] followed the same rate. Fedn [60] comes 7th. This is just an initial observation, but things change when we integrate the popularity results highlighted in Table 1 and the growth rate outlined in Figure 5 into the equation. In Table 1, PySyft [46] and FATE [47] are the two most popular tools according to the developer's community, while Flower [49], OpenFL [50], and IBM-federated [45], cover the 5th, 6th, and 7th placement, respectively, with a considerable distance from the first two. An interesting aspect is the clear gap between what the community awarded as the most popular tools and what this work outlined as the most matures. Similarly, another essential element is the result highlighted by the growth rates reported in Figure 5. Leading that ranking is Flower [49], followed by FedML [48], FATE [47] and PySyft [46]. OpenFL [50] is in the 7th placement, with a growth rate of 0.23, approximately nine times smaller than the Table leader, which has a value of 1.82. With a deeper look at the various growth rates, we can see how generally GR 23 > GR 12 . This means that the tools grew more between H 2 and H 3 , than between H 1 and H 2 despite having a time gap of approximately six months in each window. While this is true for almost all tools, the first four seem to have taken a much more significant leap. This could confirm the effectiveness of the work done by the development team and community contributions. However, it is crucial to note that these leaps did not have a significant impact over an observation window of 374 days (nearly one year). With some exceptions, growth rate GR 13 seems to confirm the growth rates highlighted by GR 12 . By comparing these two curves, we can see how Flower [49] had a stronger start in the first six months but then decreased over one year (i.e., GR 12 > GR 13 ) compared to FedML [48], which gained more points in GR 23 than it 96874 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
One interesting aspect is that regardless of the scoring we decide to consider, the top five placements seem to be occupied by the same names, re-shuffled a bit. Of the 15 possible names, we can only count up to eight different tools. Of these eight different names, four are more dominant as they appear in at least two rankings: Flower [49], FedML [48], FATE [47], and PySyft [46]. The remaining four were Tensorflow-Federated (tff) [44], OpenFL [50], Nvidiaflare [43] and IBM-federated [45]. Among these eight, only OpenFL [50] supports Vertical FL.
Despite our efforts to adopt objective scoring when building Table 3 as described in Section V, we are aware that other valid scoring alternatives might exist. For example, a more in-depth analysis of all functional features provided by each tool (such as communication protocols or level of modularity in the architecture) and a more granular differentiation of external tools that can be integrated into the FL pipeline could lead to different results. However, although we appreciate that such a finer approach might eventually change the distances of the current points between the elements in the lists, we would expect the main order to remain the same. This consideration arises when examining the definition of the current ranking. The success of the top two tools is mainly justified by the high scores obtained in the ''Usability'' and ''Portability'' factors outlined in Section V. This suggests that when tools have similar features with an equivalent level of maturity, the preference goes to the one with a lower entry barrier for users. Providing different documentation sources, tutorials, and access to multiple standard languages and tools may be critical for the community. As confirmed in the lower part of Table 3, low scoring for the worse-ranked tools might not necessarily be related to a lack of critical features, but rather to insufficient documentation that might have compromised exploitation. However, we noticed a discrepancy in Table 1 that led us to the following question: Why are tools with features comparable to the most popular ones, but with better documentation and more accessible entry points, not currently being considered at the same (or higher) level by the community?
Among the possible causes, we identified three main factors: participation in more significant international projects involving multiple institutions, tool adoption in various application fields, and dissemination and marketing activities by the respective engineering teams.
Although summarizing the results of the three tables might be difficult, we can say that if someone does not know where to start with FL, tools such as Flower [49] or PySyft [46] represent a good compromise between maturity and popularity for horizontal pipelines (HFL) either in data-centers or IoT devices. When Vertical FL (VFL) is required, OpenFL [50] can be the tool of choice. These recommendations are valid regardless of the application fields, as all the tools can support different models and data types. At the same time, we recognize that as future directions, more in-depth benchmarking with dedicated tools such as [38], LEAF [52], or FL-bench [77] may be needed to further understand the different peculiarities of each tool.
Another future goal is to revise the proposed criteria to account for these arguments and other factors to get closer to a comprehensive measure harmonizing the overall results.

VII. CONCLUSION
Several tools for implementing FL pipelines can accelerate research activities in this field. In this study, we provided a survey of all open-source solutions and two rankings based on the tools' popularity and readiness with the aim of guiding users (including non-experts) in adopting FL solutions, boosting their exploitation, and accelerating their research and development. One key aspect of this study is that tools primarily adopted by the community are not necessarily the most mature tools available. Owing to the three harvests (searches) performed over nearly one year (374 days), we could understand the growth rate of the majority of the tools. With all the data collected, we were able to provide clear recommendations to end-users on what tool to choose when starting a new journey in FL research.

A. COPYRIGHT FORM
Authors must submit an electronic IEEE Copyright Form (eCF) upon submitting their final manuscript files. You can access the eCF system through your manuscript submission system or through the Author Gateway. You are responsible for obtaining any necessary approvals and/or security clearances. For additional information on intellectual property rights, visit the IEEE Intellectual Property Rights department web page at http://www.ieee.org/publications_ standards/publications/rights/index.html.
ILARIA BOSCOLO GALAZZO (Member, IEEE) received the degree (cum laude) in biomedical engineering from the University of Padova, in 2010, and the Ph.D. degree in neuroscience from the University of Verona, Italy, in May 2014.
She took a position as a Research Associate with the Institute of Nuclear Medicine, UCL, London, from 2014 to 2016, and then with the Department of Computer Science, University of Verona, from 2016 to July 2020, where she is currently a temporary Assistant Professor of bioengineering with the Department of Engineering for Innovation Medicine, within the BraiNAVLab. Her research interests include imaging genetics, modeling of functional MRI data, brain connectivity, and multimodal neuroimaging data integration relying on classical, and AI-based methods. She is a member of the IEEE Bioimaging and Signal Processing (BISP) Technical Committee. She is an Associate Editor of IEEE ACCESS. Open Access funding provided by 'Università degli Studi di Verona' within the CRUI CARE Agreement