Challenges, Applications and Design Aspects of Federated Learning: A Survey

Federated learning (FL) is a new technology that has been a hot research topic. It enables the training of an algorithm across multiple decentralized edge devices or servers holding local data samples without exchanging them. There are many application domains in which considerable properly labeled and complete data are not available in a centralized location (e.g., doctors’ diagnoses from medical image analysis). There are also growing concerns over data and user privacy, as artificial intelligence is becoming ubiquitous in new application domains. As such, much research has recently been conducted in several areas within the nascent field of FL. Various surveys on different subtopics exist in the current literature, focusing on specific challenges, design aspects, and application domains. In this paper, we review existing contemporary works in related areas to understand the challenges and topics emphasized by each type of FL survey. Furthermore, we categorize FL research in terms of challenges, design factors, and applications, conducting a holistic review of each and outlining promising research directions.


I. INTRODUCTION
Recently, machine learning (ML) and deep learning (DL)-based methods have seen tremendous growth, which is attributable to the availability of considerable data. However, not all application domains have considerable properly labeled and complete data available in a centralized location (e.g., doctors' diagnoses from medical image analysis). Curating such large, high-quality datasets can be time-consuming and tedious and often requires domain experts. Efforts from individual organizations result in data silos, with each containing high-quality but small datasets. In these application domains, very few organizations manage to gather high-quality, complete, fully labeled, and sufficiently large datasets, which are required for these DL applications to be effective. Traditionally, data were gathered in a centralized location to build ML models. However, The associate editor coordinating the review of this manuscript and approving it for publication was Hiu Yung Wong . due to concerns related to data ownership and confidentiality, user privacy, and new laws over data management and data usage, such as the General Data Protection Regulation, private, secure, efficient, and fair distributed model training is required.
Thus, instead of training on centralized data, separate models can be trained locally where the data reside in a distributed manner. Then, the respective local model updates can be communicated to obtain a global model. This is the concept behind federated learning (FL), in which the communication process is carefully designed such that the data of an individual organization or device remain private. FL was first introduced by researchers at Google to update language models [1], [2] in Google's keyboard system for word autocompletion. FL builds a joint model using the data located at different sites, where each party contributes some data to train the model. Note that the data belonging to each party do not leave their premises. The model is then encrypted and shared among the participants so that no participant can reverse-engineer others' data. This resulting joint model performance is an approximation of the ideal model trained with centralized data. In practice, this added security and privacy results in certain accuracy loss, but it is often worth for specific application domains. In addition to the privacy and security benefits, collaborative training in FL can yield better models than those trained by individual organizations or devices.
The FL architecture follows the client-server model ( Fig. 1) or peer-to-peer model (Fig. 2) at the fundamental level. In the client-server model, a coordinator is responsible for centrally aggregating the model parameters using federated averaging (FedAvg).
First, the coordinator sends an initial model to each participating client. Each client then locally trains individual learning models using their own local datasets and sends the model updates back to the coordinator for aggregation. After aggregation, the combined model updates are sent back to the local participating client. This process is repeated until the model converges or a preset number of iterations is reached. The client-server architecture incurs less communication overhead. The peer-to-peer architecture is even more secure, as the participating clients communicate directly without a third-party coordinator. The trade-off, however, is that the peer-to-peer architecture requires more computation for message encryption and decryption.
Based on data partitioning among participants in feature and sample spaces, there are three fundamental categories of  FL: horizontal FL (HFL) (Fig. 3), vertical FL (VFL) (Fig. 4), and federated transfer learning (FTL) (Fig. 5). For HFL, there is alignment in data features across participants, not in data samples. In contrast, for vertical FL, there is alignment in data samples, not in data features. Both HFL and VFL can be ineffective when the data are highly heterogeneous. In such cases, FTL is an effective approach that transfers the learned knowledge from the source domain to the target domain. FTL is inspired by transfer learning, where ML models that are trained on a dataset belonging to one domain are re-used and fine-tuned to solve a problem in a related domain.
The aforementioned architecture and FL categories only form the tip of the iceberg in the field of FL. There are numerous research thrusts, such as novel architectures, data partitioning schemes, and aggregation techniques. Moreover, the current research efforts aim to mitigate the core challenges in FL, such as privacy and security, communication costs,    VOLUME 9, 2021 system and statistical heterogeneity, and personalization techniques. Depending on the application area in which the FL method is applied, unique application and domain-specific challenges and considerations arise.
Much research has already been conducted in the field of FL in recent years. Consequently, numerous survey papers have summarized different focus areas. In this study, we first reviewed existing surveys, which cover various domains and focus areas in FL research.
Several core challenges, such as privacy, security, communication cost, system and statistical heterogeneity, architecture, and aggregation algorithm designs, vary by domain and specific use cases. The motivation for this paper lies in reviewing the current body of literature and summarizing the state-of-the-art approaches that have recently been developed to deal with these challenges. In addition, we identify the gaps in the reviewed FL surveys and fill them by surveying the latest developments in all aforementioned FL areas of research. We conduct a holistic review of the challenges, applications, and design factors and outline promising future research directions.
We study papers in related areas and review, in depth, most of the contemporary survey papers in these areas. We classify the topics in the FL survey papers according to the following categories: communication cost, statistical heterogeneity, systems heterogeneity, and privacy/security as the core challenges; data partitioning, FL architectures, algorithms/aggregation techniques, and personalization techniques as the implementation details; and FL applications in different industries and domains. This paper makes the following contributions to the literature: 1) It thoroughly investigates and analyzes contemporary FL survey papers. 2) It classifies FL research into broad categories of design aspects, challenges, and application areas. 3) It conducts a holistic survey of the design aspects-data partitioning, FL architectures, aggregation techniques, and personalization techniques; the core challengescommunication cost, systems heterogeneity, statistical heterogeneity, and privacy/security; and different application areas. 4) It discusses open issues and challenges in FL research. The remainder of this paper is organized as shown in Fig. 6. In Section II, we discuss the related studies. Section III illustrates the taxonomy of the survey papers and discusses them in detail. A discussion and analysis of all topics under each category are covered in Section IV. Section V discusses the open issues and challenges in FL. Section VI concludes the paper.

II. RELATED WORKS
In this section, we investigate and analyze the most contemporary survey papers. The reviewed papers, along with their summaries and focuses, are listed in Table 1.
Li, Sahu et al. [3] discussed how FL differs from standard distributed ML. In addition, they discussed FL's unique characteristics and challenges, along with its current methods and future scope. However, the paper did not focus on any specific domain and discussed approaches that dealt with four core challenges: expensive communication, systems heterogeneity, statistical heterogeneity, and privacy/security. Local updating [1], [4] is an approach for reducing the number of communication rounds. Compression schemes [5], in contrast, reduce the message size in each round of communication. In addition, decentralized training [6], [7] decreases the burden on the central server in terms of communication. For systems heterogeneity challenges, asynchronous communication [8]- [10] reduces stragglers and active sampling selects or influences the participating devices based on system resources and overheads incurred, and fault tolerance [11]- [16] ignores failed devices using algorithmic redundancy. Statistical heterogeneity issues are dealt with by modeling heterogeneous data using methods such as meta-learning and multitask learning, adapting selection between global and device-specific models, and transferring learning for personalization. Some studies have also focused on convergence guarantees for non-independent and identically distributed (non-IID) data [4], [10], [17], [18]. Finally, this survey covers secure multiparty computation (SMC) [19], [20] and differential privacy (DP) [21]- [24] approaches.
The challenges and approaches discussed in [62] were centered around the healthcare domain. Consensus [37], [98] and pluralistic [99] solutions were mentioned to tackle statistical heterogeneity; client selection [33], compression schemes, update reduction, and peer-to-peer learning for expensive communication challenges; and SMC and DP for privacy/security challenges. The focus of [63] was on personalization techniques, which included adding user context [100], transfer learning [101], multitask learning [102], meta-learning [103], knowledge distillation, base + personalization layers, and combination of global and local models. Article [64] is also based on privacy/security challenges. In particular, it includes studies on threat models, various poisoning attacks, and inference attacks.
The reviewed survey papers did not cover all subtopics, as highlighted in Table 2. In particular, less than half of the surveys thoroughly reviewed FL architectures and personalization techniques. We classify the topics as design aspects, core challenges, and application areas, as shown in Fig. 7, and provide an in-depth discussion and analysis on all the subtopics.

III. TAXONOMY
The taxonomy of FL research, in terms of design aspects, core challenges, and application areas, is presented in Fig. 7. The design aspects include data partitioning, FL architectures, aggregation techniques, and personalization techniques. Communication cost, systems heterogeneity, statistical heterogeneity, and privacy/security are among the core challenges. In addition, the reviewed survey papers focused on the application areas of industrial engineering, mobile devices, healthcare, and IoT and edge devices. Table 2 compares the topics covered by the survey papers.
Data partitioning classifies FL as HFL, VFL, or FTL, as explained in the Introduction. Beyond these variants, several specialized FL architectures have been developed to improve features such as accuracy, training speed, efficiency, generalization, and applicability for different areas, such as IoT, healthcare, electronic health records (EHRs), and privacy/security. Depending on the FL architecture used, aggregation techniques/algorithms are employed to integrate the local model updates obtained from all participating clients during training to obtain the global model. Different aggregation techniques/algorithms have different priorities, such as increased privacy, optimal communication bandwidth, and support of asynchronous updates. Personalization is another design aspect that needs to be considered for certain scenarios, namely device heterogeneity (storage, computation, and communication), data heterogeneity (i.e., non-IID data), and model heterogeneity (customized models depending on the client's environment).
Expensive communication is a major challenge in FL systems. A federated network can comprise many devices, which means that network communication is much slower than local computation. Therefore, several studies have addressed communication efficiency. Moreover, there can be varying communication capabilities of devices in federated networks due to systems heterogeneity. The different devices may also exhibit varying computing and storage capacities. Due to system and network constraints under numerous settings, only a few selected devices can participate in a training iteration, and some devices may even drop out during an iteration due to connectivity or power issues. Thus, FL techniques need to overcome such systems heterogeneity challenges. In contrast, statistical heterogeneity issues arise due to the violation of IID assumptions in distributed optimization. The violation occurs because different devices across the network often comprise non-identically distributed data. The number of data points across the devices also varies. Therefore, FL approaches must handle the statistical heterogeneity of data. Finally, privacy/security issues are at the core of FL applications. Increased privacy/security achieved using novel methods often comes at the cost of decreased system efficiency or model performance.
All these trade-offs among the various application-specific challenges and design aspects need to be carefully considered and well-balanced to obtain effective privacy-preserving FL systems. These topics are detailed in the following section.

IV. DISCUSSION AND ANALYSIS
In this section, we review and discuss the design aspects, core challenges, and application areas to provide a comprehensive summary of the subtopics-data partitioning, FL architectures, aggregation techniques, personalization techniques, communication cost, systems heterogeneity, statistical heterogeneity, privacy/security, and application areas.
The data used for training FL are non-identical, as they are available on various devices. The sample space of a dataset comprises all dataset instances, whereas the feature space comprises different dataset attributes. For instance, two hospitals may have records of different sets of patients (sample space) and different types of information stored about each patient in their EHR (feature space). Based on how the data are allocated over the sample and feature spaces across multiple participating devices in the FL process, FLSs can typically be categorized as HFL, VFL, and hybrid FL [61].
1) HFL is used in scenarios in which the feature space of the datasets is the same but the sample space differs. In HFL, the datasets belonging to different organizations have the same featured space, but the sample space is not related. Such data partitioning is suitable for the cross-device mode, where individual users use FL to enhance their model's performance on a task. In FL, horizontal partitioning is more common.
As the local data overlap the feature space, each user can train their local models using the duplicate model architecture. For example, two regional branches of an organization have different groups of users but the same feature spaces as the business. At present, the focus of FLSs is on smart and IoT devices. Work by McMahan et al. [1] falls into the horizontal partitioning paradigm. In this framework, an individual user on the Android platform changes the model parameters locally and sends the updated parameters to the cloud server. This enables the training of the centralized model along with other users. Furthermore, to address the issue of finite labeled entities, a hierarchical heterogeneous HFL framework was proposed in [115], which can address the shortage of labels by adapting each user multiple times as the target domain. The authors in [51] suggested a collaborative deep-learning framework in which each user trains independently and only shares a subset of parameters for updating, and classified FL research into broad categories of design aspects, challenges, and application areas. 2) In VFL, the datasets across institutions share the same or similar sample spaces, but their feature spaces do not have much in common. In this setting, all participants have homogeneous data, which implies that they differ in feature space but have a partial match with the sample space. For example, two organizations in a certain area want to train an ML model in collaboration. They have identical clients, but the data of each organization are of distinct types. Due to privacy and security concerns, they cannot interchange their data.
In such a scenario, VFL is suitable to train the model. VFL models aggregate these distinct features and calculate the model parameters in a privacy-preserving manner. Finally, it constructs a model by combining data from both parties. An approach using linear regression was proposed by the authors in [116], [117] for data having vertical partitioning. For such data, several secure models, including k-means [78], association rule mining [75], decision tree [77], and naive Bayesian classifier [76], were proposed by Vaidya et al. Usually, VFL systems perform entity alignment [118], [119] to combine common samples of different institutions. Then, employing encryption, the combined data are used for training the model. Cheng et al. [120] proposed a lossless VFL system to enable the joint training of gradient-boosting decision trees. To recognize common users between two distinct parties, they used privacy-preserving entity alignment. Finally, the selected samples were used to train the decision trees collaboratively. 3) FTL is used in situations where two datasets differ in terms of sample as well as feature space. FTL was first proposed in [79]. It enhances existing FL systems and can deal beyond the scope of existing FL algorithms. FTL has gained enormous attention in various industries, especially in the healthcare sector [121]. Using FTL, different types of information related to treatment and diagnosis can be shared between hospitals to diagnose different diseases. In general, transfer learning comprehends a common representation of the features of two different parties. Both parties still need to calculate the prediction results at the time of prediction. Hence, transfer learning [80] techniques can be adopted for the entire feature and sample space in a federated environment. To avoid the possibility of exposing the client data, FTL takes advantage of encryption and approximation to ensure that privacy is safeguarded. Hence, both the actual sensitive data and the models are preserved locally [122]. Sharma et al. work on improving FTL by integrating secret sharing technology [123]. The authors in [124], [125] build a FedHealth model that collects data from different institutions through FL and provides customized services for healthcare by using transfer learning. Each of the aforementioned data partitioning paradigms has its own advantages and disadvantages. For example, two different clinics or hospitals can benefit from securely sharing data with each other based on the number of instances or features they need. One clinic can own millions of patient records, but it might only have very specific information about these patients based on their specialty (e.g., oncology). In contrast, another clinic can be relatively new, possessing much lesser patient records. However, if this is a general clinic without a specialty, then it is likely to have different types of patient information. The first clinic can benefit from VFL, whereas the second one can benefit from HFL. Finally, through FTL, healthcare providers can provide more personalized care if they are given access to data from users' wearable devices for personal fitness.
FL architectures represent how different components are integrated to form an FL environment. Two common architectures of FL are client-server and peer-to-peer architectures.
1) In client-server architecture, as illustrated previously in Fig. 1, a central server initiates a global model that it shares with clients to train on their local dataset. After local training, the trained models from the clients involved in the FL environment are collected by the server. The server then aggregates the models' parameters to build a global model and shares it with all clients. The client-server architecture is also known as a centralized architecture for FL. Here, the server coordinates the learning process, which is continuous. In the conventional client-server architecture, the server hosts a model and trains it on shared data. However, the server in the FL setting operates only on local models received from clients synchronously or asynchronously. The main advantage of this architecture is that it incurs less communication overhead. Google used this architecture to develop a virtual keyboard called Gboard for Android. Currently, almost all implementations of FL use client-server architecture. 2) As illustrated in Fig. 2, there is no concept of a central server in peer-to-peer architecture, as in the client-server architecture for model aggregations. The role of the central server is replaced with algorithms to ensure security and reliability. Each participant in the FL environment has its own model. A participant improves its model by using information obtained from its neighbors [126]. In the adopted peer-to-peer topology, a protocol is established using a central authority.
During training rounds, the network follows this protocol. Such architecture is more secure, as the participating clients communicate directly without a third-party coordinator [127]. However, it requires more computation for message encryption and decryption. The aggregation algorithm describes how the global model is formed by combining local model updates from all clients who participate in the training round. It plays a significant role in HFL, based on a centralized architecture. The most popular aggregation algorithms are compared in Table 3 and summarized below.
1) The FedAvg algorithm [1] proposed by Google is based on a stochastic gradient descent SGD) optimization algorithm, which is the best fit for HFL with a client-server architecture. In this algorithm, the server starts the training process by sharing the global model parameters with a group of clients selected randomly from a pool of clients. The clients then perform multiple epochs of SGD on their local dataset to train the global model and share the locally trained model with the server. Next, the server next computes the weighted average of all local models to generate a new global model. This process is repeated for several rounds and is robust to unbalanced and non-IID data distribution. Although FedAvg has achieved great success, it has some convergence issues in some settings due to factors such as client drifting [67] and lack of an adaptive learning rate [128]. 2) Scaffold [67] solves the problem of client drifting by using the variance reduction technique in its local update. It estimates the update direction of the server model and that of each client. Using the difference, it measures client drifting, which is then used for the local update. This strategy helps overcome the problem of client heterogeneity and reduces the communication round in model convergence.

3) Adaptive federated optimization [128], proposed by
Google's research team, introduces adaptability in server optimization. Server optimization is more informed, as the adaptive learning rates allow knowledge to be incorporated from previous iterations. In this optimization framework, a client optimizer minimizes loss using local data over multiple training epochs. Then, to update the global model, the server performs gradient-based optimization on the average of the model updates of clients. FedAvg is a special case in which SGD is used as both a client and a server optimizer with a server learning rate of 1. Although it incorporates adaptive learning rates in server optimization, it does not increase client storage or communication costs. Moreover, it is compatible with cross-device FL. However, it does not completely remove the effect of client heterogeneity. However, for moderate, naturally arising heterogeneity, the adaptive optimizer is quite effective, especially in cross-device settings.

4) FedBoost [129] is a communication-efficient algorithm
for FL based on ensemble learning technique. In this approach, an ensemble of pretrained base predictors is trained through FL. It reduces the cost of both server-client and client-server communications without gradient compression and the model compression approach. In addition to communication efficiency, other advantages of this method include computational speedups, convergence guarantees, privacy, and the optimality of the solution for density estimation, for which language modeling is a special case. 5) FedProx [4] addresses the two inherent challenges of FL. First one is system heterogeneity, which refers to the significant variable characteristics of the system or device participating in FL. Second one is statistical heterogeneity, which implies non-IID data across the network. It is a reparametrized and generalized version of FedAvg. Specifically, FedProx can be modified in two ways. First, it enables partial work to be tolerated. Based on the availability of resources, a device can perform variable amounts of work locally; for example, each device can run a varied number of local epochs. The partial solutions from resource-constrained devices are accepted for aggregation. Second, a proximal term is introduced in a device's local solver objective to control for the impact of the variable amounts of local updates. 6) The FedMA [110] algorithm is proposed for introducing FL in modern network architectures for DL. Matching and averaging, based on similarity of features, is performed layer-wise across the channels of convolutional layers, hidden states of long short-term memory networks, and fully connected layer neurons to construct the shared global model at the server. FedMA can also handle client heterogeneity. Within a few rounds of training, it performs better than FedProx and FedAvg. 7) The secure aggregation [19] algorithm is developed based on the principle of the SMC algorithm. It does not share private information of the mutually distrustful parties, except for the learnable parameters derived from aggregation and thus defends the privacy of each client model. It is fault-tolerant up to 1/3rd of users; that is, it works well even if 1/3rd of the clients fails to engage in the aggregation.

2) PERSONALIZATION TECHNIQUES
In FL, the goal is to train models with a central repository without changing their data samples. Personalization needs to adapt a global model for individual clients and permit users to acquire a richer model so that users' models are trained over a larger set of data samples. Wu et al. [106] mentioned three major challenges handled by the FL process during personalization: 1) device heterogeneity for communication capabilities, storage, and computation; 2) data heterogeneity because of non-IID; and 3) model heterogeneity for different models in personalized situations.
Adding contextual features to datasets in a privacypreserving manner can lead to more personalized predictions. Moreover, based on the similarity of client data, different groups can be formed, and a different model can be trained for each similar cluster [100]. Transfer learning can also be used in a federated setting for model personalization [130]. In transfer learning, knowledge from a global model is transferred to local models, and then the local model parameters are fine-tuned using local data. Other approaches such as multitask learning and meta-learning are used to solve multiple tasks simultaneously. The joint learning in multitask learning enables the model to use the differences and similarities across the tasks. Meta-learning produces models that are quite adaptive and can solve new tasks with much less training data. Both meta-learning [102] and multitask learning [103], [131], [132] algorithms have been proposed in a federated setting to achieve greater personalization. Knowledge distillation is another method in which a student network mimics a larger teacher network. Using transfer learning and knowledge distillation, Li et al. [133] proposed an FL framework that allows clients to design their own networks independently. Arivazhagan et al. [134] proposed a neural network architecture in which global data are used to train only the base layers, whereas the personalization layers are trained on local data. A new gradient descent variant, developed by Hanzely et al. [135], called loopless gradient descent, allows each device to learn a mixture of its own local model and the global model. The different personalization techniques are summarized in Table 4.

B. CORE CHALLENGES
Communication is a basic bottleneck in federated networks, which, coupled with security concerns over sending crude information, requires that the information produced on each device stay local. To overcome this issue, researchers have proposed several strategies, some of which involve local updating, compression schemes, decentralized training, and importance-based updating.
Local updating schemes address communication costs by performing additional work on the client that generates and consumes the ML model. As an extension of classical stochastic methods, mini-batch optimization methods have proven to be successful in many cases [142]. For both convex and non-convex objectives, distributed local-updating primal methods have also been successfully applied [143]. As the pivotal FedAvg algorithm proposed in [1], many directions have been taken, including quantizing uploads from edge devices [140].
Sketched and structured updates are among the compression schemes that enable the reduction of the model update size communicated to the FL server from the participating clients during each round [26], [141]. In addition, subsampling, probabilistic quantization, and sparsification were considered in [144]. The authors in [27] further extended the work of [26] to reduce the communication cost from the server to participant, employing approaches such as federated dropout and lossy compression. The accumulation of error and momentum is handled by the central aggregator instead of the clients [139].
Recent studies, such as [6], have carried out decentralized training over heterogeneous data. Hierarchical communication patterns [145] is another approach that reduces dependency on the central server. First, updates from edge devices are aggregated on the edge servers. Then, from the edge servers, the updates are aggregated on the cloud servers.
Importance-based updating is based on the fact that most parameter values of a deep neural network model are sparsely distributed. The edge stochastic gradient descent algorithm was proposed in [28], in which only selected important gradients are sent to the server for updating parameters in each round of communication. The authors in [29] proposed a communication-mitigated federated learning algorithm, which reduces the communication cost by uploading only the relevant updates of the local model. However, global convergence is still guaranteed. A comparison is first made between the local update of a participant and the global update during each iteration to assign a relevant score to the update. Strategies and approaches for reducing communication costs are summarized in Table 5.

1) SYSTEMS HETEROGENEITY
Due to differences in factors such as network connectivity, memory, CPU, and battery power level, the participants in a federated network often exhibit varying capacities in terms of communication, computation, and storage. Straggler mitigation and other challenges are further compounded due to these system-level characteristics. Popular approaches include asynchronous communication, client participation, and fault tolerance.
Straggler mitigation in heterogeneous environments using asynchronous communication schemes [10] is a promising approach. When there is device variability, synchronous approaches are more susceptible to stragglers. However, asynchronous communication also suffers from bounded-delay assumptions made to control the measure of staleness.
Client participation schemes involve actively selecting participating devices based on system resources such as FedCS [33] and data quality [47] in each round. The FedCS protocol was extended by the authors in [34]. Their hybrid-FL protocol addresses the differences that exist in the data distributions of participating clients. Deep Q-learning [35] is also used to optimize the allocation of resources required for training models. Client participation is controlled by the number of clients in [136] and the amount of data contributed or consumed by clients in [27], [137], [138]. Fault tolerance [102] is used because learning over remote devices becomes more critical, as some devices in the network often drop out, even before an iteration is completed. Introducing algorithmic redundancy to tolerate device failures is another option known as coded computation. The authors in [15] explored the use of codes to increase the speed of distributed training. The strategies and approaches for managing system heterogeneity are summarized in Table 6.
Statistical heterogeneity refers to the existence of non-IID data across the network. The data generated and collected by network devices are usually non-identically distributed. This generates complexity in terms of analysis, modeling, and evaluation. The usage patterns of different users are distinct. For some clients, the globally shared model does not perform as well as models that are trained locally. Thus, they are disincentivized to participate in the federated network. Moreover, there can be significant variance in terms of the amount of data per device. Also, the possible presence of underlying structures can capture the relationship between the devices and their distributions.
In general, an FL system focuses on learning a single global model. There also exist other approaches, such as learning distinct local parameters simultaneously through multitask learning frameworks [102]. The authors of [155] developed tools to measure statistical heterogeneity using metrics such as local dissimilarity. However, calculating these metrics is quite difficult for a federated network before the training begins. These metrics influence future directions for the development of efficient algorithms to quickly quantify the heterogeneity in an FL system.
To tackle statistical heterogeneity, the authors in [134] utilized the concept of multitask learning. In the FEDPER approach, the participants use a set of base layers pretrained with the FedAvg [1] algorithm. Then, each participant individually trains another set of layers using their local data. The authors empirically showed that the FEDPER approach outperforms a pure FedAvg approach using the Flickr-AES dataset [134], considering that the personalization layers can represent the personal predilection of an FL user.

2) FL THREAT MODELS
FL offers an emerging paradigm for facilitating multiple organization data collaborations without revealing their private data to each other. However, recent research has demonstrated that FL may not always provide sufficient privacy guarantees during model update; it may face several vulnerabilities from both the server and participants. As summarized in Table 7, according to the threat models, the following are two prominent forms of attacks that occur: 1) Poisoning Attacks, which can be executed either in the training phase of the model or on the data. Two types of poisoning occur: a) Data poisoning occurs during local data collection. Data poisoning attacks can occur in two forms: clean-label attacks (adversaries can poison the correct class of data samples) and dirty-label attacks (adversaries try to misclassify the target label of the FL training dataset) [146], [147]. b) Model poisoning occurs during model training.
According to Bhagoji et al. [55], model poisoning is accomplished by an adversary controlling a few malicious representatives with the aim of misclassifying specific inputs with high confidence. Bagdasaryan et al. [149] introduced a new scope of FL vulnerability by inserting the backdoor into the joint model. FL models are more vulnerable to model poisoning attacks than data poisoning attacks. This form of attack can be used to create misclassification in image and next-word prediction problems. 2) Inference Attacks: Serious privacy leakage may occur in FL during updates of the model. When exchanging gradients, the private information of participants can be exposed to the adversary [70], [150], [152], [156].
Pyrgelis et al. [151] conducted membership inference attacks to identify vulnerability at the aggregate location. According to the threat model surveyed by Lyu et al. [64], the inference attack falls into two categories-white-box attack and black-box attack. Deep leakage from gradients (DLG) [152] obtains private training data in the inference phase. Another algorithm, iDLG, also exposes the labels of training inputs [153]. Hitaj et al. [154] applied a GAN attack, which allows the adversarial party in the training process to fabricate an inferring class representative. Privacy is one of the most critical parts of FL. This section briefly reviews various privacy and security techniques for FL: 1) Secure Multiparty Computation (SMC) is a privacy mechanism used in FL. An SMC model comprises multiple parties and provides proper security. This model ensures that each party knows only its inputs and outputs and nothing about the other parties. Bonawitz et al. designed a communication-efficient SMC protocol for high-dimensional data to protect the privacy of users' model gradients [19]. 2) DP is a privacy-preserving mechanism that protects individual privacy by adding noise in the data. There are various types of DP: a) Local DP: Each data point is distorted with noise. b) Global DP: To protect individuals' privacy, the output of the dataset query is distorted with noise. c) Hybrid DP: Multiple trust models are combined by partitioning users according to their trust model preferences. Geyer et al. [23] developed a method for obtaining DP at the client level for FL. Wei et al. [157] proposed an aggregation algorithm called NbAFL, in which noise was added to client-side parameters before aggregation. The authors in [158] used both SMC and DP mechanisms to avoid differential attacks.
3) HE is another security mechanism in FL that protects user data by changing parameters under the encryption method. HE is a cryptographic technique that performs mathematical operations on data as if they were unencrypted. Many researchers have worked with homomorphic encryption to preserve privacy [159], [160]. To guarantee the privacy of users' local gradients during FL, Xu et al. [114] proposed a double-masking protocol.

3) APPLICATIONS
Although FL faces some limitations and severe challenges, it has been successfully implemented in several real-life applications: 1) Applications in NLP: FL has become a hot research topic since the concept was first introduced by Google to predict the next word in a virtual keyboard for smartphones [161]. Further improvements in predicting the next word using pretrained word embeddings were achieved by other researchers [87]. Wake word detection was another contribution made by Leroy et al. [74]. Emoji prediction from text typed on a mobile keyboard was introduced by Ramaswamy et al. [88]. In addition, some researchers have worked on learning out of vocabulary words on virtual keyboards for smartphones [86], and some have tried to improve the virtual keyboard's search suggestion quality [162]. 2) Applications in healthcare: Huang et al. [17] predicted the mortality rate of patients suffering from heart disease by using electronic medical records from multiple hospitals. Brisimi et al. [82] used an EHR to determine whether a heart disease patient needs to be hospitalized. Li et al. [163] also studied mortality and hospital stay time. Using health records, Lee et al. [164] proposed a method to determine similar patients from different hospitals while preserving the patients' privacy. They used a federated patient hashing framework.

3) Applications in computer vision: Another important application area of FL is computer vision.
Shao et al. [165] proposed a federated face presentation attack detection method. Liu et al. [166] worked on smart city safety monitoring solutions based on computer vision. 4) Applications in transportation: The development of intelligent transportation systems using FL was explored by Elbir et al. [167]. Lim et al. [168] proposed an FL-based approach in UAV-enabled Internet of Vehicles for developing applications such as the management of car parking occupancy and traffic prediction.

V. OPEN ISSUES AND CHALLENGES
There are several open issues and challenges in FL [169]. Trade-offs among accuracy, privacy, communication cost, and personalization level must be carefully considered when VOLUME 9, 2021 designing an FL system. Such considerations often depend on the specific use case or application area. In this section, we discuss some open issues related to design aspects, core challenges, and application areas.

A. DESIGN ASPECTS 1) DATA PARTITIONING AND FL ARCHITECTURES
In addition to the primary forms of data partitioning schemes and FL architectures discussed in this study, other variations in FL architectures have recently been developed. For instance, PerFit [106] is cloud-based and enables personalized FL approaches to be selected flexibly, thus making it suitable for IoT applications. Another architecture is FedHealth [107], which uses the FTL framework for wearable healthcare to build personalized models, thus enabling personalized healthcare services. Future studies can focus on developing FL architecture schemes that facilitate the specific requirements of different industries and application areas to be met.

2) AGGREGATION TECHNIQUES
Developers who wish to implement FL solutions can benefit from toolkits that offer standardized and preconfigured aggregation algorithms that are suitable for their specific application areas and use cases. Similar to AutoML solutions, such a toolkit for FL can lower the barrier of entry for nonspecialist developers.

3) PERSONALIZATION TECHNIQUES
Adding suitable user and context features to the shared global model is a possible alternative to having device-specific personalization. For example, the filter order in applications such as Snapchat can be arranged according to certain user features, such as browsing history, age, sex, likes and dislikes, and usage patterns. Thus, developing architectures that can accommodate such user and context features effectively for different tasks is another open problem. Moreover, as observed in [170], a gap exists between the accuracy of personalized and global models, making the case of personalization techniques an important research area in FL. Nevertheless, no clear metrics have yet been formulated to evaluate the performance of personalization techniques. Wang et al. [130] evaluated the conditions under which personalization yields desirable models. Further research is required to develop comprehensive metrics to assess the effectiveness of personalized approaches.

B. CORE CHALLENGES 1) COMMUNICATION
There is a trade-off between communication costs and accuracy in FL. The benchmarks in ML do not usually set any restriction criterion. It is worth considering setting the communication budget as a restriction criterion in communication-focused FL benchmarks. For example, the authors in [171], [172] explored one-shot or few-shot communication schemes in FL, and those in [172] attempted to maximize the performance for fixed rounds of communication (i.e., single or few rounds). Additionally, these methods need to be thoroughly evaluated and analyzed in the FL setting, where the networks can be highly heterogeneous.
In cross-device FL, only a few devices are often active during an iteration. There is scope for an in-depth analysis of the consequences of this asynchronous communication scheme where the devices become active based on certain events.

2) SYSTEMS HETEROGENEITY
Various algorithms [33], [35] have been proposed to address systems heterogeneity. However, wireless connectivity might not be available consistently. Many participating devices may drop from the FL system during training. Future studies can design new FL algorithms that are more robust, even when a larger number of devices drop out of the network due to connectivity issues.
Li et al. [4] recently introduced a proximal term in the optimization objective to allow partial solutions obtained from stragglers to be carefully incorporated and aggregated instead of totally dropping them. The authors in [173] took a different approach and implemented an FL system that addressed device heterogeneity by selecting different levels of quantized models following a device-specific analysis conducted by the FL server.

3) STATISTICAL HETEROGENEITY
Eichner et al. [99] developed a pluralistic solution to alleviate a form of data heterogeneity in which devices exhibited different characteristics during the day versus those at night. Further research can be conducted to explore similar methods to address diurnal variations at more granular times of day (instead of only day versus night) or at different times of the week. For example, let us consider a federated network over a commercial neighborhood. The data characteristics obtained from devices available over the weekdays would likely be very different from those available over the weekends. The effectiveness of a pluralistic solution in such a scenario can be investigated.
As noted by the authors in [99], where they only worked with convex objectives and sequential SGD, further analysis can be conducted to explore block-cyclic data in a nonconvex setting and employ methods such as parallel SGD.

4) PRIVACY/SECURITY
While device-specific local or global level privacy has been well-studied and understood, finer privacy requirements at the sample level form a promising, ongoing research topic. The sample-specific privacy guarantee technique developed by Li et al. [174] trades off privacy for higher accuracy.
Hybrid methods deal with both sample-and device-level privacy requirements. One approach can be to use sample-specific privacy for a subset of data based on specific levels of a category or date range while using device-specific privacy for the remaining data.

5) ABLATION ANALYSIS
The evaluation performed by an FL system is often more complex than that performed by traditional ML and DL systems. While different research efforts deal with specific focus areas, a holistic industrial system would need to consider several aspects while building FL solutions, such as privacy, accuracy/loss, communication rounds, and heterogeneity. A standard platform needs to be developed to facilitate a holistic ablation analysis of the different parts of an FL system.

C. APPLICATION AREAS
FL has mainly been applied to supervised learning problems. Future research can attempt to tackle the challenges that may arise when using FL in applications that call for data exploration, unsupervised, semi-supervised, and reinforcement learning.
The challenges faced in implementing FL solutions for different application areas have not yet been thoroughly studied, with the current studies primarily focusing on training FL models. In addition to the core challenges discussed in this paper, issues that are specific to the industry domain or application area also need to be considered. For instance, there are application areas such as mobile edge networks that require energy-efficient communication to be greatly emphasized.

VI. CONCLUSION
FL allows participating organizations to collaboratively train prediction models without having to share their data. Recently, there has been growing interest in FL research in both industry and academia. FL enables certain industries, such as healthcare, to overcome challenges related to data collection and privacy.
This growing interest in FL has motivated us to review most of the contemporary survey papers on FL and to classify FL into several topics under the design aspects, core challenges, and application domains. In this study, we thoroughly investigated and analyzed the FL survey papers and conducted a holistic review of each FL topic. Finally, we outline promising future research directions. This study is expected to help future researchers in FL and related areas to scope their work. University. His research interests include medical image processing, blockchain intelligence, healthcare systems, natural language processing, and data mining. His undergraduate thesis was in wireless sensor networks.
RUHUL AMIN received the B.Sc. degree in computer science and engineering from United International University, Dhaka, Bangladesh, in 2020, where he is currently pursuing the M.Sc. degree with the Computer Science and Engineering Department. He has publications in several international conferences and journals. His research articles are published in Bioinformatics and Scientific Reports journals. His research interests include deep learning, bioinformatics, natural language processing, data mining, and digital forensics. VOLUME 9, 2021 KAZI EHSAN AZIZ received the B.Sc. degree in computer science and engineering from Shahjalal University of Science and Technology (SUST), Bangladesh, in 2016. He is currently pursuing the M.Sc. degree in computer science and engineering with United International University (UIU), Bangladesh. He is also working as a Software Engineer in the industry. His research interests include intelligent transportation systems, computer vision, and audio signal processing.