Differential Privacy for Deep and Federated Learning: A Survey

Users’ privacy is vulnerable at all stages of the deep learning process. Sensitive information of users may be disclosed during data collection, during training, or even after releasing the trained learning model. Differential privacy (DP) is one of the main approaches proven to ensure strong privacy protection in data analysis. DP protects the users’ privacy by adding noise to the original dataset or the learning parameters. Thus, an attacker could not retrieve the sensitive information of an individual involved in the training dataset. In this survey paper, we analyze and present the main ideas based on DP to guarantee users’ privacy in deep and federated learning. In addition, we illustrate all types of probability distributions that satisfy the DP mechanism, with their properties and use cases. Furthermore, we bridge the gap in the literature by providing a comprehensive overview of the different variants of DP, highlighting their advantages and limitations. Our study reveals the gap between theory and application, accuracy, and robustness of DP. Finally, we provide several open problems and future research directions.


I. INTRODUCTION
In recent years, deep learning (DL) demonstrates a big success in many fields such as Healthcare, Marketing, Transportation, etc. For example, DL is used for early disease detection [1]- [3], predicting the future and adapting to the market needs [4], [5], helping people with disabilities [6], facilitating our daily activities [7]. To produce models with high accuracy, DL requires big datasets for training the model. However, datasets may contain sensitive information [8] that should not be disclosed to any third party, which raises concerns about the privacy protection in DL. In fact, users' privacy is threatened even when attackers do not have direct access to the dataset. Attackers may query the trained learning model to recover the original training dataset [9]-this type of attack is called model inversion attacks. There is another type of attack called membership inference attacks [10] where attackers' aim is to distinguish whether an individual was part of the training dataset or not. We refer the reader to [11] for a well-presented review of research work for different privacy attack types facing DL. Therefore, ensuring users' privacy in DL is of great importance.
The associate editor coordinating the review of this manuscript and approving it for publication was Yan Huo .
In this paper, we present the different techniques proposed to tackle the privacy issues in deep and federated learning (FL). Particularly, we focus on differential privacy (DP) which became a de facto standard for protecting users' privacy in statistical computations. These techniques can be divided into three categories: • Techniques protecting users' privacy before publishing a dataset such as k-anonymity, l-diversity, and t-closeness. These techniques produce a new dataset, called a privacy-preserving (PP) dataset, protecting users' sensitive information. Attackers could not learn any critical information even if they have full access to the dataset.
• Techniques protecting users' privacy during the training. These techniques allow collaboratively training a model between many clients (i.e., parties) while keeping the dataset of each client private.
• DP-based techniques. DP may protect users' privacy in the three stages of training a DL model namely: 1) Before the training by producing PP datasets. 2) During the training by protecting the gradients sent from clients to the server in the case of collaborative training.
3) After the training by producing DL models resistant to model inference and model inversion attacks. DP is also used for protecting users' privacy while interrogating a database. This is because an attacker with some VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ background knowledge can perform some count and sum queries on a database and hence conclude the sensitive information of the victim. In the rest of this section, we provide an overview of the approaches proposed in each category. Then, we present the different review works done on privacy protection in DL. Afterward, we outline the contributions of this paper compared to the recent literature in the field of DP applied to DL.

1) TECHNIQUES PRODUCING PRIVACY-PRESERVING DATASETS
k−anonymity [12], [13] is a mechanism for ensuring privacy before releasing a dataset. k−anonymity consists of generalizing quasi-identifier attributes and redacting some others so a record cannot be distinguished from the least k − 1 other records in the dataset, in other words, the probability of re-identification is 1 k . Nevertheless, k−anonymization performs poorly on the anonymization of a high-dimensional dataset and does not provide strong protection against attribute disclosure [14], [15]. An attacker with some background knowledge of victims could infer critical information about them.
l−diversity [16] has been proposed to overcome the k−anonymity shortcomings. It is based on the k−anonymity principle, i.e., generalizing quasi-identifier attributes and redacting some others so we cannot distinguish a tuple from at least k − 1 other tuples. Then divides the dataset into q-block, where each block contains k tuples with the same values of the quasi-identifier attributes. In addition, l−diversity ensures that each block has l distinct values for the sensitive attribute. Hence, l−diversity provides strong privacy against background knowledge and homogeneity attacks. The larger the value of l is, the stronger the privacy is guaranteed. t−closeness [17] which covers some drawbacks of l−diversity especially when the values of the sensitive attributes could take only two values (i.e., when l = 2). The t−closeness mechanism is also based on the k−anonymity principle to create t−closeness classes (blocks) for sensitive attributes. A class is said to have t−closeness for a sensitive attribute A if the earth mover distance [18]) between the distribution of A in the class and in the dataset is not higher than a threshold t. A dataset is said to have t−closeness if all classes satisfy t−closeness. By limiting the distance between classes and the whole dataset, the amount of useful information that an adversary can learn from the quasi-identifier values of an individual and the distribution of the class is limited and does not reveal precious information. Since it limits disclosure about the correlation between quasi-identifier attributes and the sensitive attribute.

2) TECHNIQUES PROTECTING USERS' PRIVACY DURING THE TRAINING
Secure multiparty computing (SMC) [19], [20] is a subfield of cryptography that allows creating methods to jointly compute a function using inputs from different parties without revealing those inputs neither to each other nor to the central server. Thus, SMC does not require a trusted third party. Figure 1 illustrates an example of calculating the average salary of 3 clients without revealing the salary neither to the central server nor to other clients. In the first step, each client splits his/her salary into three pieces. In step 2, each client keeps one piece and shares the remaining two pieces with other clients. For example in Figure 1, client 1 divides his/her salary to −30, 50, 60, keeps −30 locally, and shares 50 with the second client and 60 with the third client. In step 3, each client aggregates the received pieces with his/her local piece and calculates the sum. In our example, the first client aggregates the pieces receive from the second and third client to calculate the sum as 60 = 50 + 40 + (−30). In step 4, clients send the calculated sum to the central server. Finally, in step 5, the server calculates the average of the received values, which is in this example 60+130+90 3 = 93.33. Thus, We get the same average as if we calculated the average using the true values of the salary. An SMC protocol is said to be secure if it satisfies the following properties: • Privacy: A client should not be able to learn any information about any other client in the network, except the information that can be derived from his/her own input and output.
• Correctness: The output received by each participant should be correct.
• Independence of input: The inputs of malicious clients must be independent of the inputs of the honest clients.
• Guarantee of output: Malicious clients should not be able to prevent legitimate clients from receiving their outputs.
• Fairness: Malicious clients receive their outputs if and only if honest participants receive their outputs. There are some recent works using SMC in federated learning (FL) to protect the privacy of clients [21]- [23]. However, SMC is costly in terms of computational complexity and communication overhead. Thus, SMC is unsuitable for training complex models over big datasets implicating many clients.
Homomorphic encryption (HE) [24] provides strong privacy protection as it allows training a model on an encrypted dataset. HE achieves the same accuracy as if the training was performed on the unencrypted version (i.e., original dataset) of the dataset [25], [26]. However, using HE in DL is inefficient in practice due to its computational complexity, especially when the training dataset is too large to fit in the computer memory. HE is more suited for MLaaS [27], [28] when the model is already trained and ready to use. In this case, users send their input encrypted to the cloud that makes the prediction. Then the cloud sends back the results encrypted to the users.

3) DIFFERENTIAL PRIVACY
Recently, DP [29], [30] has attracted a great deal of attention in DL, especially in guaranteeing users' privacy. DP allows analyzing a dataset without revealing a single individual private information. In other words, analyzing the dataset and computing statistics about it (such as mode, median, mean, etc.) does not allow revealing the information that an individual's information was included in the original dataset or not.
Although the first definition of DP back in 2006, it does not receive attention in practical use only in the last few years. The main reason that may prevent using DP in practice is the accuracy. In fact, the accuracy decreases by increasing the level of privacy protection. To overcome this problem, researchers either try to find a trade-off between accuracy and privacy [31], [32] or combine DP with another technique (e.g., memorization, adding a proxy server) for strengthening the privacy protection [33], [34]. There are many applications of DP in practice. For example, Google proposed RAPPORT [35], an approach based on DP for privately collecting statistics from devices of clients (e.g., Software hangs and time of utilization). Microsoft [33] applies DP with the memorization technique for privately collecting statistics periodically from their clients' devices. Apple [36] also used DP to collect statistics from their clients' devices to enhance their quality of experience. DP is also adopted by the US census bureau to protect the publications of the 2018 Endto-End Census Test. DP applications can be divided into two categories: 1) Central differential privacy (CDP), as defined in [29], requires that users trust the database holder (i.e., the data curator) to keep their privacy. CDP consists of adding random noise after collecting the data from individuals. The random noise is added to the original dataset or to the results of queries launched on the original dataset. 2) Local differential privacy (LDP) [37], [38] overcomes CDP shortcomings and ensures privacy when individuals do not trust the data curator. During data collection, individuals perturb and/or encode their responses before submitting them to the central server. LDP mechanisms should be carefully implemented, as each individual perturbs his response individually, the estimated frequencies on the dataset may not be inaccurate [39].
All DP schemes have the same principle which is adding noise to protect the sensitive information of individuals. Certainly, adding more noise guarantees perfect protection of privacy. On the other hand, adding less noise allows attackers to reveal sensitive information about individuals. Recently, Ren et al. [9] succeeded to recover the original dataset when a small noise is added to the gradient. Thus, based on what we will discuss in section II, one has to evaluate the privacy leakage for a given privacy budget before publishing a dataset, a learning model, or responding to a query function.

A. RELATED WORKS
Fatemehsadat et al. [40] present a summary of information disclosure attacks to better situate the need for privacy protection in DL. The authors divide PP methods into three categories: 1) methods for PP datasets that protect the privacy of clients in a dataset, 2) methods protecting the privacy of clients during the training phase, 3) methods for PP models that protect the privacy of clients after deploying the trained model. However, the authors do not detail DP and PP methods for FL models, they only provide a brief introduction to FL and split learning (SL) without detailing the stateof-art PP methods proposed in FL. Ha et al. [41] detail the inference attacks and present methods for producing PP DL models. They categorize these methods into three groups: 1) gradient-level methods that consist of adding noise to the gradient, 2) function-level methods that consist of adding noise to the loss function, and 3) label-level methods that consist of adding noise to the label set during the training. Amine et al. [42] provide a review of 45 papers handling the problem of PP in DL. The authors present different works that are based on different techniques such as DP and HE on top of the strongest approaches, in addition to model splitting [43], mimic learning [44], and partial parameters sharing [45]. All presented works are dated before July 2019, nevertheless, the period after 2019 till now, had recognized the emergence of many works especially for preserving privacy in FL. Xue et al. [46] provide a detailed explanation of the different attacks that may threaten a DL model. More specifically, they categorize these attacks into five types: 1) data poisoning attacks, 2) backdoor attacks, 3) adversarial examples attacks, 4) model stealing attacks, 5) recovery of sensitive training data which includes model inversion attacks and membership inference attacks. Accordingly, the authors present some approaches to deal with the different attacks, including approaches for producing PP models. Chang et al. [47] present a summary of privacy issues in DL. The authors divide these problems into two types: issues during training and issues during prediction, i.e., after deploying the trained model. Accordingly, The authors present some countermeasures approaches to deal with these issues. Zhang et al. [48] discuss PP approaches proposed to deal with attacks threatening collaborative learning. The authors categorize these VOLUME 10, 2022 approaches into two categories: 1) PP during the training phase, and 2) PP after deploying the trained model.
All survey works on privacy in DL focus on detailing possible attacks against DL and presenting the different PP methods to protect the users' privacy. Yet, we didn't find any paper detailing DP in DL as well as presenting the different variants of DP proposed so far. The main differences between the contributions of the present survey and the above state-ofthe-art works are summarized in Table 1.

B. CONTRIBUTIONS
This paper presents a detailed survey of DP mechanisms designed for PP in DL and FL, we bridge the gap of the existing literature by providing: • A comprehensive description of the probability distributions that satisfy the -DP definition with their use cases.
• A review of the different works based on DP for protecting users' privacy in DL and FL. We divide these approaches into three categories based on their type of application: 1) PP queries, 2) PP datasets, 3) PP models.
• An analysis of the main ideas and recent approaches based on DP regarding the computational complexity, communication cost, and accuracy. This analysis illustrates the gap between theory, application, accuracy, and robustness of DP and brings forth many future research directions.

II. DIFFERENTIAL PRIVACY AND ITS VARIANTS
A. -DP The main objective of DP is to allow studying the properties of a dataset (about a population) as a whole without revealing one's individual information. In other words, DP consists of adding noise to either statistical queries or the original dataset so that an adversary cannot know whether a particular individual is included in the dataset or not. DP as defined first in [29] requires that users trust the data curator since users send their correct data without any modification to the data curator. The data is stored in the central server as received from users. Nevertheless, the data curator does not trust the third party or the data analyst. Hence, the data curator uses DP to perturb the original dataset before responding to statistical queries of third parties for analysis. This type of implementation of DP is called Central DP. The name central DP comes from the fact that the perturbation is done centrally at the data curator, unlike the local DP which will be discussed in the next section. We refer to a mechanism that satisfies DP by -DP [29], where denotes the privacy loss or privacy leakage. Before providing the definition of an -DP mechanism, we define the meaning of two neighboring datasets and the sensitivity of a given query function f .
where || · || denotes the 1 norm. 1 Definition 3 ( -differential privacy): A mechanism or an algorithm M is called -differentially private if for all neighboring datasets D, D ∈ D n , and for all S ⊆ Y , where Y is the set of all possible outputs, we have: that is to say the output when the mechanism M is applied to D is similar to the output when M is applied to D . The smallest is the perfect the privacy is guaranteed It is worth mentioning that the combination, known in the literature by composition, of two DP mechanisms is also a DP mechanism (see proof in [49]). The composition theorem is defined as follows.
Theorem 1 (Composition): Let M 1 is an 1 -DP mechanism and M 2 is an 2 -DP mechanism. Then, the composition of The composition theorem allows using DP in practical use cases such as guaranteeing the privacy of gradient in FL. For example, if a client in FL applies an −DP mechanism to the gradient before sending it to the central server. After k epochs, the −DP mechanism results in (due to the composition theorem) (k × )−DP mechanism. That is to say, the privacy leakage at the first epoch was , and after k epochs, the privacy leakage becomes k × .

B. ( , δ)-DP
The first definition of -DP was introduced by Dwork et al. [29] as stated earlier in Definition 3. Afterward, the same authors proposed another relaxation of -DP called ( , δ)-DP [50], [51] by adding δ as an additive term to the original definition. δ was added to capture the privacy protection of the Gaussian distribution (see Definition 6), as detailed in the previous subsection.
Definition 4 (( , δ)-DP [50], [51]): A mechanism M is called ( , δ)-differentially private if for all neighboring datasets D, D ∈ D n . We have, for all S ⊆ Y , where Y is the set of all possible outputs: The interpretation of a mechanism M satisfies ( , δ)-DP is this mechanism is -DP except with probability δ. That is to say, the mechanism M is -DP with probability 1 − δ. ( , δ)-DP is proposed to mitigate the privacy leakage of -DP under composition, as -DP is closed under composition [52]. ( , δ)-DP provides smaller cumulative loss under composition. 2 ( , δ)-DP is not appropriate in the scenario where S is a singleton set. It is worth mentioning that δ should be negligible compared to the size of the set S (i.e. δ 1/|S|), to avoid the worst-case scenario of always violating the privacy of a δ fraction of the dataset. In this subsection, we present the different probability distributions proposed in the literature and satisfy either -DP or ( , δ)-DP. We point out the type of noise generated, the condition of applications, and the use cases as well.
1) Laplace mechanism [29]: is the most used approach in literature as it can be used for any type of data [53]. The Laplace mechanism consists of adding a noise drawn from the continuous Laplace distribution Lap(0, f ). Definition 5: Given a function f : D n → Y , where Y is the set of all possible outputs, and > 0. The Laplace mechanism is defined as 2) Gaussian mechanism [54]: satisfies the principle of the new variant of -DP which is f -DP (see subsection II-E for more details), and support tractability of the privacy budget under composition. Definition 6: Given two neighboring datasets D and D in the dataset universe D n , a query function f : D n → Y , where Y is the set of all possible outputs, and > 0. The -Gaussian DP ( -GDP) mechanism is defined as: where N (0, 2 f 2 ) stands for the normal distribution. 2 Composition means the sequential application of DP. For example, if we apply DP on the result of a query function f , thus calling the query function f one time is -DP, and calling the query function f sequentially k times is at least (k × )-DP.
3) Geometric mechanism [55]: used to add discrete noise to the result of a query function for integer-valued data type [56]. Definition 7: Given a dataset D, a query function f : D n → Y , and > 0. The two-sided geometric mechanism adds independent noise to the query function f : where is a random variable with a two-sided geometric distribution: The probability P( = δ) can be interpreted as the probability of adding discrete noise δ to the result of the query function f . The Geometric mechanism is a discretized version of the Laplace mechanism [57]. 4) Exponential mechanism [58]: is most suited when we have to select a noisy (i.e., random) response from the set of all possible outputs, instead of adding noise to the result of the query function [53]. Definition 8: Given a dataset universe D n , a set of all possible outputs Y , and a scoring function u : D n × Y → R which defines a score for each element D ∈ D n to each element y ∈ Y . That is to say, u assigns a real valued score to any pair (D, y) from D n × Y with the understanding that higher scores correspond to most suited outputs. The exponential mechanism consists of selecting an output y ∈ Y with probability proportional to e ( u(D,y) 2 u ) . This means that the exponential mechanism returns an element from Y that has the highest score with probability e ( u(D,y) 2 u ) . Hence, the exponential mechanism sometimes returns y ∈ Y which does not have the highest score. The Laplace mechanism can be captured from the exponential mechanism by taking u(D, y) = −|f (D) − y|, where f is the function defined in Definition 5. u(D, y) takes the maximal value when the query function result f (D) is equal to the exact output value y. 5) Binomial mechanism [59]: used to add discrete noise to the result of the query function. However, the Binomial mechanism satisfies ( , δ)-DP (( , δ)-DP is a variant of -DP as defined in the next subsection) under constraints [59] as illustrated in the following definition. Definition 9: Given two neighboring datasets D and D in the dataset universe D n , and a query function f : D n → Y , the Binomial mechanism is defined as: where Z ∼ Bin(N , p), and s = 1 t is the quantization scale for some t ∈ N. s helps to normalize the noise correctly. The parameters δ, N , p, and s should satisfy the following condition: where d is the dimension of the output of the query function f , and ∞ is the infinity norm of the sensitivity of the query function f .
In this subsection, we compare the most used DP distributions in the literature, namely the Laplace (i.e., -DP) and the Gaussian distribution (i.e., ( , δ)-DP), in terms of privacy protection and accuracy. For this purpose, we develop and train a learning model using three different scenarios according to the dataset used in the training: • Scenario 1: In the first scenario, we train the learning model on the original MNIST dataset [60] without any noise. This is our reference scenario to evaluate the impact of privacy protection (i.e., noise) on accuracy.
• Scenario 2: In the second scenario, we train the learning model on the privacy-preserving MNIST dataset generated by adding Laplace noise.
• Scenario 3: In the third scenario, we train the learning model on the privacy-preserving MNIST dataset generated by adding Gaussian noise. The learning model, see Figure 2, is composed of two convolutional layers. Each layer is associated with ReLu as an activation function. The second convolutional layer is associated with Dropout Regularization to prevent overfitting. Then, we add three fully connected linear layers with the dimension of the output of the last linear layer is 10, which corresponds to the number of classes that we have in our training dataset.   Figure 3 illustrates the accuracy of the learning model for the three scenarios during 20 epochs of training. We notice that the highest accuracy is achieved for the first scenario (i.e., the blue curve), compared to the other two scenarios (scenarios 2, and 3). On the other hand, for scenarios 2 and 3 where we train the model on privacy-preserving datasets, we notice that the Gaussian distribution gives higher accuracy compared to the Laplace distribution. This endorses the theoretical analysis; As the Laplace distribution is -DP, and the Gaussian distribution is ( )-DP with probability δ (i.e., ( , δ)-DP).
This difference is illustrated in Figure 4, which shows samples from the privacy-preserving MNIST datasets generated using the Laplace and the Gaussian distributions. Images of the Laplace distribution are noisy compared to the Gaussian distribution, especially, when the privacy leakage decreases. For example, in the case of = 2, we can still notice some white pixels for the Gaussian distribution (see subfigure 4-d). On the contrary, for the Laplace distribution (see subfigure 4-h), the image is totally noisy to the extent that we can't extract any useful information. Thus, the Laplace distribution guarantees strong privacy protection compared to the Gaussian distribution, but at the expense of accuracy.   Overall, the accuracy decreases by decreasing the privacy leakage for the two distributions. In addition, the Gaussian distribution gives higher accuracy compared to the Laplace distribution; the difference increases by decreasing the privacy leakage. For example, in the case of = 5, the Gaussian distribution gives an accuracy of 98.24 compared to 97.99 for the Laplace distribution. Whilst, in the case of = 2, the Gaussian distribution gives an accuracy of 96.43 compared to 93.70 for the Laplace distribution.

E. VARIANTS OF DIFFERENTIAL PRIVACY
In this subsection, we present the pertinent variants of DP, namely (α, )-Rényi DP ((α, )-RDP), and f-DP. We state the main differences between these new definitions of privacy protection, as well as the advantages and disadvantages of each variant.
The most challenging problem of the DP mechanism is that the privacy leakage increases due to composition (see Theorem 1). In fact, the privacy leakage increases by increasing k, the number of compositions. Thus, determining a tighter bound of the privacy leakage due to composition allows learning more features (e.g., producing accurate learning models) from a dataset while protecting individuals' sensitive information.
Dwork et al. [61] determine a bound of the privacy budget after k composition defined by ( 2 k ln( 1 δ ) × + k × (e − 1), kδ + δ )-DP for any , δ, δ ∈]0, ∞[. Afterward, Kairouz et al. [62] define a procedure that allows achieving the optimal bound of the privacy budget after k queries. The authors prove that for any i ∈ {0, 1, . . . , k/2 }, the composition of k queries satisfies where Thus, in practice, we may use Eq. (9) to determine the optimal values of i and δ i for each query q to do not exceed a predefined privacy leakage ( , δ) after k queries. Thereafter, Mironov [52] determine a tighter bound of privacy leakage due to composition using Rényi divergence-based DP, which is the subject of the next subsection. Although the original definition of -DP provides strong privacy protection of data privacy, it still does not tightly handle the privacy leakage due to composition. The problem of composition appears also while training a federated learning model, as the privacy leakage increases by increasing the number of training epochs. For example, if we apply a mechanism M with a privacy loss at each epoch, consequently at the end of the training, we will result in a privacy loss of k , where k is the total number of epochs during the training. This problem is of great importance, as the privacy leakage increases by increasing the number of training epochs.
Mironov [52] introduces a new relaxation of -DP based on the concept of Rényi divergence. This new variant of -DP allows accurate tracking of the privacy leakage due to composition. (α, )−RDP is defined as a generalization of the notion of differential privacy based on the concept of Rényi divergence [63]. (α, )−RDP provides a quantitatively accurate way of tracking cumulative privacy leakage under composition. Before defining (α, )-RDP, we will define the Rényi divergence [63]: Definition 10 (Rényi divergence): Given two probability distributions P and Q defined over R, the Rényi divergence of order α > 1 is with P(x) is the density of P at x. The logarithm here is natural, and x ∼ Q means that x follows the distribution Q. Definition 11 ((α, )-RDP): A mechanism M : D n → Y is said to satisfy (α, )-RDP of order α, if for any neighboring datasets D, D , and for all S ⊆ Y , it holds that (11) We also have the following inequality holds for (α, )-RDP: (α, )-RDP allows achieving a tighter bound, of privacy leakage due to composition, compared to the bound determined by [61], [62]. Using the (α, )-RDP definition, Mironov demonstrates the following corollary.
Corollary 1: Let 0 < δ < 1 such that log(1/δ) ≥ 2 k. The composition of k queries, each satisfies -DP, is ( , δ)-DP where = 4 √ 2k log(1/δ). Thus, we may use this result to track the privacy leakage due to composition. For example, in federated learning, we may use this corollary to determine the scale of the Laplace distribution (i.e., λ = 1/ i , where i is the privacy leakage of each training epoch i calculated from corollary 1) in order to do not exceed a predefined privacy leakage ( , δ).
Since this holds for all α > 1, thus, the optimal privacy bound can be determined by optimizing over α the following expression: Developing this expression for the case when M is a Gaussian mechanism, we get: where σ 2 is the variance of the Gaussian distribution.
For the case when M is a Laplace mechanism, we get: where λ is the scale of the Laplace distribution. Ultimately, the RDP allows determining a tighter bound, of the privacy leakage due to composition, compared to the start-of-the-art privacy bounds calculated using the original definition of ( , δ)-DP [49], [61], [62].
2) f -DP Dong et al. [54] propose f -DP, a new relaxation of -DP based on hypothesis testing interpretation. f -DP is parameterized by a function rather than parameters (e.g., , δ), which offers a complete characterization of privacy.
In fact, f -DP is based on the following simple idea of -DP: By interrogating two neighboring datasets D and D , an attacker can not conclude if an individual belongs to D or to D . Thus, this problem can be formulated using the following two hypothesis testing: • H 0 : the underlying dataset is D, • H 1 : the underlying dataset is D . with the objective of making these two hypotheses indistinguishable. This is equivalent to find the optimal trade-off between the achievable type I error 3 and type II error. 4 More precisely, consider a rejection rule φ ∈ [0, 1], the type I error rate and the type II error rate are respectively defined as follows: where M (D) and M (D ) are the probability distributions of the mechanism M applied to the two datasets D and D , respectively. The two error rates satisfy the constraint of the total variation distance: where the total variance distance TV (M (D), M (D )) is the largest possible difference between the probabilities that the two probability distributions M (D) and M (D ) can assign to the same event.
Therefore, the f -DP main objective is to characterize the fine-grained trade-off between type I and type II errors. That is to say, fixing type I error at any level and finding the minimal achievable type II error. Before defining the f -DP, we define the trade-off function.
Definition 12 (Trade-off function): For any two probability distributions P and Q on the same space, the trade-off function T (P, Q) : where the infimum is taken over all (measurable) rejection rules.
The similarity of P and Q increases by increasing the value of the function T (P, Q)(α).
In practice, it is difficult to satisfy this definition. Thus, the following proposition presents a necessary and sufficient condition to determine a trade-off function.
for all neighboring datasets D and D . This definition is explained as follows. Given two distributions P and Q such that f = (P, G), a mechanism M satisfies f -DP means that distinguishing M (D) and M (D ) is at least as difficult as distinguishing P and Q.
The Gaussian probability distribution function is an example of the functions that satisfy the f -DP Definition, where f is the trade-off function of two normal distributions. To be more specific, let > 0, and G := T (N (0, 1), N (µ, 1)) .
An explicit expression of the trade-off function G is: where is the Gaussian standard cumulative distribution function, and −1 is its inverse function. Hence, the GDP is defined as follows: Definition 14: Given two neighboring datasets D and D in the dataset universe D n . A mechanism M satisfies the µ-Gaussian DP (µ-GDP) if it is G µ − DP, i.e., T (M (S), M (S )) ≥ G µ This definition gives a necessary and sufficient condition for a mechanism M to satisfy µ-GDP. The Gaussian distribution satisfies µ-GDP and it is the tightest possible privacy bound of the Gaussian mechanism, see Section II-E, Definition 6.
Using this new definition based on a function (i.e., f -DP) allows determining a tighter bound of the privacy leakage due to composition. For the case of the Gaussian mechanism, the authors in [54] proved that: Corollary 2: The composition of k queries, each satisfies µ-GDP, is ( √ k µ)-GDP. In [52], the authors prove that RDP guarantees a tighter bound compared to [64]. In the rest of this subsection, we compare the bound of RDP and f -DP. For that purpose, we have first to respond to the following question: What is the privacy leakage for a mechanism M satisfying µ-GDP.
Starting from [50], we conclude: Thus, we can calculate the privacy leakage for a mechanism M satisfying µ-GDP using the following expression: From this equation, we can calculate the privacy leakage due to the composition of M mechanism each one of them satisfies µ-GDP: In Figure 6, we compare the privacy leakage (i.e., Eq. (14)) of M RDP mechanism and the privacy leakage G (i.e., Eq. (24) ) of M GDP mechanism. We use the same parameters as [52], i.e., σ = 20, δ = 10 −5 . According to this figure, f -DP allows us to achieve smaller privacy bound by up to 1.025 of difference. This result has an important impact on private DL algorithms especially FL, as it allows for more training epochs for the same privacy budget , e.g., 141 more training epochs for any larger than 7.
Thus, in practice (e.g., training an FL model), given a predefined privacy budget to do not exceed, we may use Eq. (24) to determine the optimal privacy budget of each training epoch. Then, using Eq. (22) we calculate the variance σ 2 of the Gaussian distribution from which we generate the noise to add.
In summary, RDP and f -DP are two new different definitions of DP, where RDP is based on Rényi divergence (parameterized by (α, )) and f -DP is based on hypotheses testing (parameterized by a trade-off function f ). Comparing RDP with f -DP in terms of privacy leakage due to composition, f -DP allows determining a tighter bound compared to RDP. Thus, in FL and for the same privacy budget , f -DP allows for more training epochs compared to RDP. We refer the reader to [65] for more details about the relationship between RDP, f -DP, and ( , δ)-DP.

III. CENTRAL DIFFERENTIAL PRIVACY FOR DEEP LEARNING
In this section, we present the recent research works based on CDP for protecting the users' privacy in DL. We divide these works into three categories: A. PP learning models. The main idea of these approaches is to add static or dynamic noise to the coefficients of the objective function. B. PP query results. These approaches can be divided into two types: i) The works that add noise to the query result after running the query on the original dataset.
ii) The works that partition the dataset, run the query on each part of the dataset, and then add noise to subqueries results. C. PP datasets. These approaches add noise to the original dataset for producing a new PP dataset. Table 3 summarizes the presented works in this section. This table illustrates the main idea and the final objective of the contribution, along with the type of probability distribution used in the DP mechanism.
A. PRIVACY-PRESERVING LEARNING MODEL DL models are threatened by inversion attacks [81], [82]. An attacker can reveal some sensitive information about an individual by interrogating the learning model and using background information about this individual. Information disclosure is done by linking the target features with the model outcomes. In the rest of this subsection, we present the recent research work that handles the problem of privacy in DL. These works apply DP during the training to produce PP models.
Pan et al. [66] present adaptive differentially private regression (ADPR) mechanism, a dynamic privacy noise allocation mechanism that takes into account the relevance of the input attributes to the outputs. The mechanism consists of adding Laplace noise drawn from Lap( f j ) into the polynomial coefficients of the objective function. f is the sensitivity, and j (the amount of privacy) is calculated according to the input's features relevance R j (D). Thus, less noise is added to attributes that highly impact the learning model and vice versa. Although this approach gives better accuracy compared to [67]- [69], [83], it is costly in terms of computation as it has to run a pre-processing learning step to determine the relevance of each attribute. The approaches of [66] and [67] are the same, except that [67] adds noise with the same privacy budget which may decrease the model's accuracy.
Fang et al. [69] decompose the objective function into monomial terms and add noise to each monomial term according to its sensitivity (f i ) and the privacy budget i . The privacy budget i is dynamic and updated at each iteration and should satisfy 1 + . . . + d = where is the total privacy budget and d stands for the number of terms of the polynomial objective function.
Katrina et al. [70] propose a noise reduction framework for learning models based on empirical risk minimization (ERM) algorithms as a loss function. The framework consists of applying a privacy budget depending on a predefined accuracy. The framework adds noise to the model parameter (i.e., gradient) to generate a sequence of parameters, where each parameter corresponds to a privacy budget. Afterward, the framework selects the privacy budget that gives an accuracy higher than the given predefined threshold. This approach is costly in terms of computational complexity because it has to sequentially go over all noisy optimal parameters until finding the privacy budget that gives the predefined accuracy.
Ultimately, as illustrated by our simulation results (see Figure 5), the accuracy decreases by decreasing the privacy leakage (i.e., introducing more noise). That is to say, the accuracy decreases by increasing the privacy protection. Thus, determining the amount of privacy leakage that guarantees both a perfect privacy protection and acceptable accuracy is challenging and depends on the application scenario. There are three categories of works in the literature: The first category consists of predefining an acceptable accuracy c and then determines the optimal privacy leakage that guarantees the highest privacy protection and an accuracy greater than the predefined accuracy c [53], [70]. The second category consists of predefining the privacy leakage that should be guaranteed and then determining the learning model parameters the maximize the accuracy [84]. The third category consists of adding noise based on the relevance of each input feature to the outputs [66], [67].

B. PRIVACY-PRESERVING QUERY RESULTS
Privacy leakage may occur even if an adversary does not have direct access to the dataset but he/she can perform some count or summation queries on the dataset [85], [86]. Figure 7 illustrates how DP is used at the data curator to protect the clients' privacy. Specifically, the process involves three main steps. In Step 1, users send their personal information to the data curator. In step 2, users' data is protected and aggregated in a database. In Step 3, data analysts interact with the database via queries or request the whole dataset for a training purpose. Before responding to the data analyst queries, the data curator guarantees the users' privacy by adding random noise either to the query results or to the values of the attributes of the dataset.
Earlier, Hay et al. [75] propose to add Laplace noise Lap( 1 ) to the set of results of queries q, and send the noisy resultsq to the data analyst. The resulting outputs are evaluated according to a set of constraints to guarantee the consistency of the results. When the noisy results are inconsistent, a post-processing step called constrained inference is added to calculate q, the new consistent results of the results of queries. q is the minimum l 2 -norm solution which is the closest set toq that satisfies the predefined constraints. Xiao et al. [76] propose two algorithms: 1) Cell-based algorithm, and 2) K-d tree-based Algorithm for partitioning the dataset into partitions and then adding Laplace noise to the result of the query on each partition before aggregating these results and responding to the main query. Cormode et al. [87] generalize the approach proposed by [75], [76] and propose to add non-uniform noise to the results of queries. That is to say, instead of adding noise with the same privacy budget to each partition, the authors propose to define a specific privacy budget for each partition. The privacy budget of each partition There are many other works [77], [78], [88]- [90] that handle the problem of privacy for responding to a batch of queries. They use -DP to introduce Laplace noise and perturb the query results for protecting the user's privacy. For example, Huang et al. [78] decompose the original query set Q into orthogonal query subsets to construct another query setQ, such that each query from Q can be represented by elementary queriesq i fromQ. The Laplace noise is added to the query setQ instead of Q which reduces the noise variance and then leads to better efficiency. Li et al. [77] represent the main query as a set of linear base queries on the dataset's attributes. These linear queries are represented as a matrix where each row contains the coefficients of the linear query. Then, the authors apply DP on the matrix to get PP result of the main query.

C. PRIVACY-PRESERVING DATASETS
Nowadays, many organizations release their clients' datasets to third parties for training DL models that help in making decisions [91]- [93]. However, providing dataset to a third party may violate users' privacy and breach privacy laws [94], [95]. Therefore, there is a great need for mechanisms that allows releasing datasets for analysis without revealing users' privacy or any sensitive information. DP proved to provide strong privacy protection while allowing datasets analysis. This is a hot research topic where research work can be divided into two categories: 1) The first category is the works producing pre-processing datasets balancing between accuracy and privacy for a specific learning model, such as frequent itemset mining models [71], [96], [97], and classification and clustering models [53], [72], [98], [99]. 2) The second category is the works producing a PP dataset of the original dataset [56], [73], [100], [101]. Next, we will present the relevant research work based on DP to protect users' privacy before publishing a dataset.
Wang et al. [53] propose a differentially private approach for heterogeneous dataset 5 for cluster analysis. The original dataset D is pre-processed using a clustering algorithm to get an initial cluster structure D * . Then, the authors apply DP to the new dataset D * to get the anonymized dataset D . Sun et al. [72] normalize the dataset rows which makes the dataset distribution more concentrated. Afterward, the authors use classification and regression tree (CART) [102] to apply DP in function of the relevance/impact of each attribute on the classification results.
Lee et el. [71] propose an approach to release noisy dataset for frequent itemset learning; first, the algorithm takes an integer k and distinguish the top k most frequent items by running frequent itemset mining algorithm [103], after that, the algorithm builds an -differentially private FP-tree [104] that is released to the analyzer. The privacy allocation is based on two phases: 1) perturbing the threshold τ of the support itemset σ k to beτ = σ k + Lap(.), and 2) adding a Laplace noise to the originally calculated support σ (X ).
Fioretto et al. [56] handle the problem of releasing a dataset of a large population without leaking sensitive information about individuals. The original dataset is restructured into a tree T of levels and groups (e.g., level 1 may design the country and level 2 may design the state and so on, and a group may be the households owning three cars), the DP consists of adding noise that should satisfy three conditions. i) Consistency: The sum of the groups' sizes of a specific level r and group s after adding the noise should be equal to the sum before. ii) Validity: The size of a specific level r and a group s after adding the noise should be non-negative integers. iii) Faithfulness: The group sizes at each level l of the hierarchy should be equal to the total count of groups G. The problem is solved using three approaches: 1) Direct optimization-based mechanism, 2) dynamic programming 3) polynomial-time mechanism by exploiting the structure of the cost tables. Tang et al. [73] present a stronger privacy protection approach called differentially private latent tree (DPLT). It consists of generating a new synthetic dataset from vertically partitioned data (i.e., the dataset is shared between many data curators where each one holds some attributes of the dataset. Data curators share a common identifier attribute). The approach is based on the latent tree model (LTM) [100] and contains three main steps: 1) generating latent attributes by condensing original attributes and adding Laplace noise to guarantee -DP, 2) quantifying the correlation (i.e., mutual information) between any two latent attributes, and finally, 3) constructing the latent tree T based on the previously calculated correlations. The authors assume that each data curator uses the same privacy budget , which may contribute to decreasing the accuracy and/or leaking data privacy. In fact, the privacy budget may depend on the attributes, hence determining the best privacy budget for each curator is still yet to explore. Mohammed et al. [101] handle the same problem, however, the proposed approach is heavy in terms of computation and communication and only applicable to two data curators.
There are many works that specifically handle the case of sequential datasets 6 [74], [105], [106]. The most relevant one is [74] which handles the problem of releasing and guarantying privacy for sequential datasets. It produces a new sequential datasetD based on n-gram model which provides a good trade-off between storage and accuracy; n-gram models are based on the property of Markov independence assumption to estimate the probability of the new node in the leaf. The privacy is guaranteed by adding adaptable (with regards to privacy budget and the length of a root-to-leaf path) Laplace noise to each node in the tree. Fan et al. [79] propose FAST, a framework to collect time-series statistics. First, the time-series data is sent from clients (i.e., end-users) to a trusted server and then the trusted server sends the collected data to third parties. The trusted server guarantees data privacy before sharing it with third parties. The FAST framework is based on a filtering component method that consists of adding Laplace noise to data points received from end-users (i.e., real-time data points received from end-users, these points are called sampling points), or predicting the data points in the case when the data points are not received from end-users (i.e., these points are called non-sampling points). This framework is not efficient when the response time is crucial. In addition, the privacy of users is not fully protected as the server access to users' information. Huang et al. [80] propose an approach to generate a PP dataset. The authors handle the special case of social networks where the dataset is represented by a graph G(V , E). They apply a classification algorithm based on K-means clustering to classify the graph into T groups. Afterward, the authors apply four different privacy protection algorithms to protect nodes, edges, degrees, and structure of the graph: 1) The first algorithm protects the graph's structure by adding noise to the original graph after the clustering to get a new PP graph denoted by G (V , E ). The resulting graph is different from the original one. 2) The second algorithm protects nodes by adding Laplace noise to each group. The noise is randomly added to the nodes.
3) The third algorithm disturbs the degree sequence to protect specific nodes from identification by an attacker. 4) The fourth algorithm is a post-processing step that consists of adding noise to nodes with a small degree to protect edges. This approach [80] provides a higher privacy protection level and data availability compared to the approaches proposed in [107] and [108], however, the authors did not evaluate the impact of the proposed approach on the accuracy.

IV. LOCAL DIFFERENTIAL PRIVACY FOR DEEP LEARNING
Local DP has been proposed to ensure users' privacy when individuals do not trust the central data curator (i.e., the cloud or the central server). The client adds noise or falsifies his answer with a probability before sending his response to the central data curator [109]. LDP was implicitly introduced in [38], [50] and first formalized in [37]. Although the idea of LDP is relatively old, it has only recently seen many real applications such as privately collecting data [33], [35], [36], and privately train FL models [110]. In the rest of this section, we present the relevant works and ideas based on LDP to protect users' privacy from inference and inversion attacks in FL. In addition, we will go over the three real applications implementing LDP for privately collecting data from end-users. Figure 8 illustrates how LDP collects data from end-users while ensuring users' privacy. Specifically, the process involves three main steps. In step 1, users send their personal data after introducing noise to their responses. In step 2, the data curator collects the PP data from users and stores it in the database. In step 3, the data analysts could interrogate the database by launching direct queries or requesting the whole dataset. Unlike CDP where noise is added by the data curator, in LDP the data curator does not add any noise as the noise was added at users' level in step 1.

A. PRIVACY-PRESERVING FEDERATED LEARNING
FL [110] is a machine learning structure where many devices (e.g., mobile devices, laptops, organizations, etc.) collaboratively train a learning model under the orchestration of a central server. The main property of FL is that the dataset is not centralized, i.e., each device trains the model on its local dataset and then sends the updated parameters (e.g., gradient) to the central server. We refer the reader to [121] which is a good paper that presents in detail FL's characteristics, challenges, and research directions.
FL provides significant privacy improvement, as the local data of users is not explicitly sent to the central server. However, this is not enough for strong users' privacy protection. Since, a malicious user (or the server) could reconstruct the users' local dataset using only the local gradient sent to the server [10], [122], [123]. There are many recent research works tackling the problem of users' privacy in FL. Table 4 summarizes the presented works in this subsection, showing the key idea and the final objective of the contribution along with the type of probability distribution (i.e., type of noise). We divide these works into three main categories.
The first category is the works combining DP with another method/tool (e.g., Homomorphic encryption, secure multiparty computing, etc.) to protect the clients' privacy. Gong et al. [118] propose a framework for protecting the privacy of clients participating in an FL model based on DP and homomorphic encryption. DP is used to protect the gradients of clients from the central server by adding a noise drawn from a Laplace distribution. The amount of privacy budget increases dynamically with iterations (i.e., epochs), for example, in the first epoch the privacy budget is min , then in the second epoch the privacy budget is increased by min + c × max − min γ where c is the current epoch, and γ is the number of epochs to reach max . Homomorphic encryption is used by each client to encrypt the gradient sent to the server. It is used to protect gradients of clients from a malicious server, in this case, all clients should share the same encryption key. Hence, even when the server colludes with a client and gets the key, it will not be able to retrieve the true values of clients' gradients as they are already protected using DP. However, recently, authors in [9] were able to conclude useful information about the original dataset even gradients were protected using DP.
Li et al. [120] propose a privacy-preserving FL framework based on secure multiparty computing called chain-PPFL. This approach is similar to [31], except that here the authors are based on the principle of secure multiparty computing instead of DP. First, the server sends the global model to all clients and initiates a token τ ∈ R d (where d is the dimension of gradient) with a random value. This token is sent to a client chosen from all clients. This latter updates the token by adding its gradient to it and sends the newly updated token to a randomly chosen client from its neighbors, and so on, until the last client sends the token to the server. The server subtracts the initially attributed value to the token and calculates the global weights of the next round. This process is repeated until the learning model converges. Using this approach, the privacy of clients is protected from their neighbors and also from the server. Since the server will receive the aggregated local gradient and will not be able to distinguish the local gradient of each client. Comparing chain-PPFL with other privacy-preserving approaches based on DP [34], [111], [112], chain-PPFL provides strong privacy (equivalent to an FL with 0−DP) if clients do not collude with the server to attack a specific client which is not always guaranteed. In terms of accuracy, chain-PPFL provides higher accuracy as the noise added to the token, in the beginning, is subtracted at the end when the server receives the aggregated local gradients. The major issue of this approach is that the clients should trust each other and do not collude to attack one of them. In addition, using this decentralized strategy to aggregate local gradients make the FL network vulnerable to label-flipping and data poisoning attacks [124], [125], besides, it is difficult for the server to distinguish malicious from legitimate clients.
Wu et al. [116] propose to use DP for multi-task learning models. As the multi-task learning paradigm [126], [127] is to leverage useful knowledge in multiple tasks to improve the generalization performance of all tasks, the authors propose that each device in FL learns a task-specific parameter ω i with the objective function f i . The parameter ω i is sent to the global server for learning the global task parameter ω M +1 which in turn is sent back to the clients and so on until the algorithm converges. For ensuring privacy, each client perturbs its own parameter according to Gaussian distribution f is the sensitivity of the average of the local gradients and is the privacy budget. The advantage of this approach is its ability to learn over multiple clients holding heterogeneous datasets, however, it is vulnerable to label-flipping attacks and also to model inversion attacks [9].
The second category is the works based on DP and the structure of the FL network (e.g., adding a proxy server, using a decentralized architecture) to protect the privacy of clients. Cyffers et al. [31] propose a new relaxation approach of LDP that allows analyzing data belonging to various devices while achieving a good trade-off between utility and privacy. FL is done by peer-to-peer communication from one node to another without a central server handling the communication. The proposed approach is a fully decentralized protocol where participants have only a local view of the studied system. The learning is made by a token τ transiting in the network. The token is updated sequentially by the device receiving it. Before realizing the token, each node adds random noise to the contribution to ensure differential privacy. This process is repeated K (a predefined value) times before getting the final model. The big issue of this contribution is its vulnerability to label-flipping and data poisoning attacks. An attacker could easily be infiltrated into the network and ruin the learning process.
Tran et al. [119] propose a framework, called Secure Decentralized Training Framework (SDTF), to protect the privacy of clients participating in training a decentralized FL. The clients train a model without a server, however, at each epoch, they elect a master node (one of them) which calculates the global gradient and sends it to all nodes, and so on until the algorithm converges. Each client perturbs his local gradient before sending it to the master node to protect his privacy. This framework achieves good accuracy since the master node, before sending the updated global gradient, estimates the sum of all noises added by clients and pulls it out from the global gradient. However, this framework cannot protect the clients' privacy from the master node and it cannot protect the privacy of a client if all other nodes collude against him/her.
Through several experiments, Zhao et al. [34] illustrate that sharing partial parameters of the gradient may almost achieve the accuracy of sharing all the parameters. Based on these results, the authors propose a PP learning approach which consists of sharing only some parameters of the local gradient and adding Gaussian noise to these parameters before sharing them with the server. The proposed approach is based on [64] to determine the noise amount to add and also to control the privacy leakage through synchronization rounds (i.e., composition). Besides, the authors propose to add a proxy between the clients and the server to ensure the anonymity of clients, therefore the server cannot distinguish from which client receives a certain gradient. The authors propose a strong method for protecting the clients' privacy, However, it would be of great importance to evaluate the VOLUME 10, 2022 robustness of the proposed approach against inference and model inversion attacks [9].
Yin et al. [112] propose a PP approach that combines functional encryption and Bayesian differential privacy. The authors use functional encryption (FE) [128] to protect the communication between clients and the server. The FE proposed mechanism, called Multi-Input Function Encryption (MIFE), requires the help of a trusted third party (TTP) that provides a public key to clients and a private key to the server to encrypt/decrypt the gradient by the client/server. In addition, the authors are based on Bayesian DP [129] to 1) protect the privacy of clients by adding Gaussian noise to gradient depending on the data distribution, and 2) to track the privacy leakage due to synchronization rounds of FL. In order to reduce the communication cost between the clients and the server and also reduce computation cost at the server-side, the authors propose a method called sparse differential gradient where clients, at each synchronization round, send gradient to the server only if the gradient experienced a massive change (i.e., higher than a predefined threshold) compared with the previous gradient of the last synchronization round. Although this approach secures the communication and protects the gradient from the server, it requires a trusted third party that may launch model inversion attacks using the gradient received from clients.
The third category is the works whose objective is to protect privacy using DP and at the same time reduce the resource consumption, such as the energy and the communication overhead. Liu et al. [114] handle the problem of communication overhead and data privacy in federated edge learning for edge computing in the Industrial Internet of Things (IIoT). The authors propose: 1) an asynchronous model update to reduce the computation time that edge nodes wait for global model aggregation. Edge nodes send their gradients once they finish the local training without waiting for the next synchronization round by the server, this enhances the communication efficiency. 2) utilize LDP to mitigate gradient leakage attacks. The LDP mechanism is deployed at the edge nodes to protect the gradient. 3) a cloud-side malicious node detection method to detect malicious nodes. The detection of malicious nodes relies on the accuracy of the model parameters (e.g., gradient) sent by these nodes to the server. A node is characterized as malicious if the accuracy of its model parameters is lower than a dynamically calculated threshold ρ th . However, the proposed malicious node detection mechanism may discriminate some legitimate nodes from participating in the learning process. Indeed, the accuracy of these legitimate clients may be low than ρ th in the first training rounds.
Sonee et al. [115] handle the problem of privacy and communication for training a federated stochastic gradient descent (SGD) model where the communication between the clients and the server takes place over a multiple access channel (MAC). The problem is formulated as an optimization problem aiming to determine the transmission rates allocation for the clients in the MAC to achieve the maximum convergence rate while satisfying the privacy and communication constraints.
Hu et al. [84] tackle the problem of resources constraints, accuracy, and privacy using FL in the Internet of Things (IoT). The authors' contribution is based on the assumption that each device should perform multiple local training epochs before sending the model updates to the server, instead of sending the model updates after each training epoch. This reduces the number of communication rounds and hence reduces the communication overhead. The problem has been formulated as an optimization problem to find the best model parameters ( * ) which guarantee the constraints on t and , where t is the number of epochs an IoT device should perform before sending the updates to the central server and is the minimum achievable privacy.
Mahawaga et al. [117] handle privacy in the case where multiple clients try to train a DL model using a convolutional neural network (CNN). TThe first convolutional layers of the model are placed at the clients' side along with a new layer called LATENT. This new layer is responsible for protecting the privacy of clients via LDP; The authors propose an approach called modified optimized unary encoding (MOUE) [130] that consists of randomizing the bit's vector 1 and 0 differently before sending the output to the server. The last layers are placed at the cloud server that communicates with the clients via software-defined networks (SDN) and network function virtualization (NFV). The simulation results show that the proposed approach achieves high accuracy (up to 90%) with a lightly high privacy budget (i.e., = 0.5), however, the authors did not evaluate the privacy leakage due to composition.
Kim et al. [32] study the trade-off between the privacy budget, utility, and communication rate for an SGD FL model. The authors characterize the Gaussian noise variance σ 2 required to guarantee a target privacy budget after T rounds of weight updates between the clients and the server. The authors compare their works to [49], [61], [62], [64] and find that their approach requires the smallest noise variance for the same privacy budget .
Wei et al. [113] improve the work done by [64] for controlling the privacy leakage of DP through sequential composition and provide an explicit expression for calculating the standard deviation σ i of the Gaussian distribution that should be used by a client i to guarantee a privacy leakage of ( i , δ i )−LDP at the end of T synchronization round. The expression of σ i of the client i depends on the sampling ratio q (i.e., q is the ratio of clients chosen randomly by the server to participate in the synchronization round t) and T the total number of communication rounds. The authors also propose an algorithm called communication rounds discounting (CRD) that allows the server to adjust, during training, the total number of communication rounds T to an optimal value that leads to achieve a better convergence performance. Although the authors get important results in terms of controlling the privacy leakage through composition and enhancing the convergence performance, they did not evaluate their contribution against model inversion attacks.
Yu et al. [111] use DP to protect the privacy of Internetof-Things (IoT) devices while collaboratively training an FL model for content popularity prediction. The proposed approach is called FL-based Cooperative Hierarchical Caching (FLCH), it keeps data locally and trains the model using Fog Access Points (F-APs) with their connected IoT devices. The F-AP is responsible for constructing the global model using the weighted averaging method [110] on the gradient received from IoT devices. These IoT devices add a Gaussian noise for protecting their privacy before sending their gradients. However, applying DP is not sufficient to protect the privacy of IoT devices, as an attacker could recover the original data from the noisy gradients sent to the F-AP [9]. Finally, Farokhi et al. [131] proved an important result about the relationship between the privacy budget, the size of the dataset, and the loss function. The authors study the cost of privacy for training asynchronously differentially private models with asynchronous communication with different clients. The cost is defined as the mean of a loss function f (θ) that captures the distance between the output of the ML model M (x; θ) and the true output y. The authors find that the cost is inversely proportional to the combined size of training datasets squared and the privacy budgets squared.

B. PRIVACY-PRESERVING STATISTICS COLLECTION
LDP consists of applying DP at user devices. The user protects the privacy of his/her data before sending it to the data curator. LDP can be used for collecting data from end-users when users do not trust the data curator. LDP can be implemented either by adding a noise drawn from a probability distribution that satisfies −DP (such as Laplace, Gaussian, etc.) as stated before in section II, or by implementing the randomized response (RR) [29], [132], [133] technique.
The RR technique consists of flipping the true answer of the user by a certain probability before sending it to the data curator. For example, a social scientist wants to collect statistics from users about drug addiction while maintaining privacy; Before responding to the question, the user toss a coin: 1) if the coin comes up heads then he/she respond truthfully, otherwise 2) the user tosses another coin and respond truthfully if the coin comes up heads, otherwise, the flips his/her response. Specifically, the RR technique is defined as follows: • if b = 0, then answer truthfully, • otherwise, if b = 1, then toss another coinb.
RR is proved to satisfy -DP [132] and it is recently used in many applications [33], [35], [36]. In the rest of this subsection, we explain three practical implementations of -LDP along with RR by the major technology organizations namely: Apple [36], Microsoft [33], and Google [35].

1) APPLE
Starting from the theory ''understanding how people use their devices often helps in improving the user experience'', Apple is interested in implementing LDP for their users to understand how they use their devices. It started by studying frequencies per element [36], specifically, estimating typed emojis per web domain; when a user types an emoji, this record (emojis) is privatized via one of the three explained algorithms below and stored locally in the user's device in a list. After a time, the user's system randomly selects some records (i.e., emojis) from the already stored list and sends them to the server. The server, before analysis, strips the privatized records of their IP addresses and any additional sensitive information. The three proposed algorithms for -LDP are detailed below: • Private Count Mean Sketch (CMS) [134]: outputs a histogram of counts over for a dataset of n records over a domain D. This algorithm is divided into two parts: -The client sends his response of size m by mapping the client's response d with one of k hash functions (preliminary defined at the clients and the server). Before sending the response, each bit of the response v (i) is flipped with probability 1 1+exp( )/2 to guarantee the -differential privacy.
-The server introduces some noise to the i th client's responseṽ (i) before constructing the sketch matrix M where each row j represents the sum of the users' response who selected the hash function indexed by j. Finally, the server estimates the count for the entry d ∈ D by debiasing the counts and averaging over corresponding entries in M . On the client-side, privacy is guaranteed by applying privacy budget to the hash function of size 256−bit (used to hash the word) in addition to another privacy budget to another sub-string (of length 2 constructed from the original string) add to the original output. The server calculates the frequency oraclef for the word and the frequency oraclef for the substrings, then it calculates the heavy hitters to get the hash function h and creates the Cartesian product across the hash function and the substrings to create the dictionary. Although the three proposed algorithms could guarantee strong privacy, they are greedy in terms of resources (i.e., computation and bandwidth). In addition, data elements (i.e., responses) with low frequency may not appear in the statistics at the server.

2) MICROSOFT
LDP does not guarantee strong privacy when we are collecting data repeatedly from the same individuals, such as studying an application usage behavior for several days to improve the user experience. As long as we collect the same statistic, as long as an attacker could learn more information about the real values. For example, if we collect the same statistic T time stamps, the privacy leakage will increase from to T × (see Theorem 1). Microsoft Research team, Bolin et al. [33], handles this problem and proposes new LDP mechanisms for mean and histogram estimation. The first method is called 1−bit LDP mechanism for mean estimation. It is inspired from [135]- [138] with an efficient communication enhancement and stronger protection for repeated data collection. The main idea is based on sending only onebit b i (t) at time t for each response x i (t) that may take the values 0 or m, where b i (t) is independently drawn from the distribution: b i (t) = 1, with probability 1 e +1 + x i (t) m × e −1 e +1 , 0, otherwise.
Therefore, the data collector calculates an estimate of the mean σ (t), for n individuals, as The above mechanism is proved to satisfy −LDP [33]. Based on the same principle, the authors propose another method called d−Bit mechanism for histogram estimation. In addition, the authors introduce the memoization 7 technique to mitigate privacy leakage for continuously collected statistics. Memoization consists of memorizing the calculated 1−bit response for each specific counter value. At data collection, the client sends the memorized responses without re-calculating the 1−bit responses for already encountered counter values. Hence, an attacker/spy will not learn much information even if he/she collects the client's responses for a very long time. This mechanism has been first implemented in Windows 10 Fall Creators Update to collect the number of seconds that a user has spent using a particular application.

3) GOOGLE
Earlier, Google [35] has used -LDP in its proposed algorithm called Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR). RAPPOR is used for privately collecting all types of statistics on clients such as frequencies, histograms, etc. However, the Microsoft approach [33] is less expensive in terms of computation and communications overhead.
In RAPPOR, first, the client's response v is hashed onto a Bloom filter B [139], [140] of size k using a hash function h. Second, each bit i in the Bloom filter B is flipped with a certain probability to get a noisy response vector B : where p is a user-tunable parameter that controls the level of privacy protection. The resulting new vector B is memoized and reused for all future response values equal to v. This memoization step is very important to protect the user's privacy when we are collecting the data repeatedly. Third, the client initializes a new bit array S to 0, and modifies each bit in S with probability: Fourth, the client sends the new response S to the data curator. It is worth mentioning that steps 3 and 4 allow RAPPOR to guarantee strong privacy protection even for the case where the data is collected repeatedly for a long time. If an attacker gets access to all the individual responses, he/she will be able to only learn the randomized response B without getting any information about the true response B. Although RAPPOR provides a strong privacy guarantee it is costly in computation and communication overhead. In addition, RAPPOR is not able to detect responses with low frequencies. When the number of different responses increases, their frequencies proportionally decrease and they become hard to detect at low frequencies.

V. OPEN ISSUES AND FUTURE DIRECTIONS A. COMPOSITION
One of the major shortcomings of differential privacy is that the privacy decreases under composition, we can distinguish two scenarios: 1) Sequential querying. The privacy of a fixed pair of dataset neighbors D, D decreases under the composition of interactive queries; An attacker could learn with some certainty if an individual belongs to a dataset or not by launching several queries. The composition of k queries each of which is ( , δ)−differentially private is at least (k , kδ)−differentially private [61], [62], [141], [142]. Thus, sequential querying degrades privacy. This issue has been handled by Kairouz et al. (Theorem 9) [62] by answering the question: how much privacy is guaranteed after k-fold composition experiment (i.e., after k times databases access). Given k, the authors in [62] define a sequence of privatization mechanisms to guarantee an upper bound on the overall privacy level after the k queries. However, the remaining open questions are about scalability and consistency: • Does the privacy control leakage, proposed by [62], guarantees strong privacy protection especially when k takes a larger value?
• How does this approach impacts the accuracy? 2) Stochastic gradient descent. The privacy protection of a learning model degrades with each stochastic gradient descent iteration. Similar to sequential queries; a composition of k SGD iteration each of which is ( , δ)−differentially private is at least (k , kδ)−differentially private [61], [62]. The released (i.e., trained) learning model becomes crisp against model inversion attacks [81] when the amount of privacy loss (i.e., k ) is large. One of the earliest works that handle this problem is [64], where Abadi et al.
propose a method called moments accountant (MA) as a tool for tracking the privacy loss across multiple iterations. The MA approach uses Rényi differential privacy [52] in which composition has a simple linear form. In this approach, the privacy budget of the SGD iteration is determined using RDP, afterward, it is mapped back to the standard ( , δ)−DP by determining and δ via the relationship between DP and RDP (Theorem 2, [64]). However, this solution is loose, i.e., it does not define an upper bound on the privacy budget. Asoodeh et al. [143] derive an approximate of the optimal DP parameters that should guarantee a given level of privacy for about 100 SGD iterations. Although, this approach cannot provide strong privacy beyond 100 iterations, where the greatest need for a solution guaranteeing strong privacy while maintaining a good accuracy regardless of the number of the SGD iterations.

B. EVALUATING DIFFERENTIAL PRIVACY RESISTANCE
Recently, Ren et al. [9] propose to use Generative Regression Neural Network (GRNN) for attacking the privacy (i.e., recovering the original dataset) in FL by only using gradients' of clients shared with the server. They found that DP is the most strong approach for protecting privacy. The proposed approach fails to recover the original image when a high level of noise is added to the gradient, however, it succeeds to recover the original image when a small noise (the scale of noise is 0.01) is added to the gradient. Nevertheless, adding a high level of noise leads to poor accuracy. Therefore, the most important question that needs to be answered is: What is the privacy budget that gives good accuracy while guaranteeing strong privacy protection?

VI. CONCLUSION
In this paper, we provided a detailed survey on differential privacy and its applications. Differential privacy and local differential privacy guarantee strong privacy protection of users' privacy in deep learning, federated learning, and data collection. However, differential privacy still suffers some drawbacks of sequential composition. The privacy degrades as long as the number of composition times increases, which procreate some new variants of −differential privacy and open new future research directions for tracking the privacy leakage while ensuring a high level of accuracy.