Federated Learning-Based Privacy-Preserving Data Aggregation Scheme for IIoT

The Industrial Internet of Things (IIoT) is the key technology of Industry 4.0. The combination of machine learning and IIoT has spawned a thriving smart industry. Machine learning models are trained and predicted based on raw data that contains sensitive information, and data sharing leads to information leakage. Data security and privacy protection in IIoT face serious challenges. Therefore, we propose a federated learning-based privacy-preserving data aggregation scheme (FLPDA) for IIoT. Data aggregation to protect individual user model changes in federated learning against reverse analysis attacks from industry administration centers. Each round of data aggregation uses the PBFT consensus algorithm to select an IIoT device from the aggregation area as the initialization and aggregation node. Paillier cryptosystem and secret sharing are combined to realize data fault tolerance and secure sharing. Security analysis and performance evaluation show that the scheme can effectively protect data privacy and resist various attacks. It has lower communication, computational, and storage overhead than existing schemes.


I. INTRODUCTION
Industrial devices are interconnected through the Internet of Things to form the Industrial Internet of Things (IIoT), becoming one of the key technologies to achieve Industry 4.0. With the rapid development of wireless communication, the wide application of big data, artificial intelligence (AI), 5G, and other technologies, IIoT is becoming intelligent, which has dramatically improved industrial productivity and efficiency [1]. The Industrial Internet of Things has emerged a variety of advanced mobile devices, such as smart gateways, The associate editor coordinating the review of this manuscript and approving it for publication was Theofanis P. Raptis . smart watches, smartphones, etc. The industrial data generated by IIoT is developed by AI through machine learning technology and applied to various fields such as smart healthcare, autonomous driving, smart cities, smart homes, and live games [2], [3], [4]. Smart devices generate industrial data containing sensitive information transmitted, shared, and stored in IIot [5]. For example, in autonomous driving, smart devices provide users with navigation and emergency avoidance by sensing road condition information and realtime vehicle status [6]. If data security and user privacy are not guaranteed, attackers will tamper with private data, cause traffic network accidents, and endanger personal safety. Therefore, it is crucial to ensure the data security and privacy protection of smart devices in IIoT.
Machine learning has advantages such as real-time prediction, improved industrial automation, and saved time and cost. The combination of machine learning and IIoT has spawned a booming intelligent industry. Traditional machine learning often requires collecting large amounts of raw data from IIoT devices (such as gateways) to train, requiring all IIoT devices to upload data to a central server and then train learning models. The rapid development of machine learning brings convenience to people, but also carries significant security risks. The training and prediction of machine learning models are based on raw data containing sensitive information, and data sharing will lead to information leakage. Without access to rich shared data, it isn't easy to train high-precision models. Security and privacy concerns about machine learning have become stumbling blocks. How to make data be used in real-time without disclosing its privacy has become a problem that the industrial Internet of Things must solve.
Inspired by this issue, Federated Learning (FL) is proposed [7]. It lets devices in the industrial Internet of Things collaborate to learn a common model by exchanging model parameters with a central server rather than raw data. Training data in federated learning is still stored locally, eliminating the need to access local data directly, and reducing the risk of privacy breaches. FL aggregates local model parameters from many IIoT devices to train global models, and model accuracy will be significantly improved [8]. During the transmission of federated learning models, adversaries can use the correlation between model output and sensitive data features to predict sensitive information based on background information and published models, thereby increasing the risk of privacy information leakage [9]. A single user usually only has a small amount of data to train the local model, and FL compares the trained and untrained local model, easily obtaining the trained datasets containing sensitive information. Thus, FL is vulnerable to reverse analysis attacks. Therefore, it is a challenge to design FL solutions that meet the requirements of individual privacy and information security.
To solve the above problems, we propose a federated learning-based privacy-preserving data aggregation scheme (FLPDA) for IIoT, which combines federated learning with secret sharing to aggregate data. In this scheme, each round of federated learning elects initializing and aggregating nodes through the PBFT consensus algorithm, which changes the dependence of the existing scheme on trusted entities and dramatically reduces the risk of information leakage. Data aggregation is an effective method to protect data privacy and reduce communication overhead [10], [11], [12], which is used to protect the changes of individual user model in FL and prevent reverse analysis attacks from the central server. Paillier cryptosystem and secret sharing are combined to achieve fault tolerance, data security, and sharing.
The main contributions of this paper are as follows: (1) A federated learning-based privacy-preserving data aggregation scheme for IIoT is proposed. Through the PBFT consensus algorithm, one of the smart devices in IIoT is selected as the initialization and aggregation node, which does not rely on trusted authorities or trusted third parties. (2) Data aggregation is adopted to protect model change of industrial devices in FL and resist the reverse analysis attack from the industry management center. Secret sharing is applied to share IIoT data while achieving fault tolerance. (3) The lower overhead of computation, communication, and storage in FLPDA means that the system has better efficiency and higher execution speed, which is very suitable for data aggregation in IIoT. The remainder of this paper is organized as follows: The related works are discussed in Section II. Sections III and IV present the relevant preliminaries and the proposed system model, respectively. In Section V, FLPDA is described in detail. Security analysis and performance evaluation are carried out in Sections VI and VII, respectively. Section VIII concludes this paper.

II. RELATED WORK
In recent years, the research on the privacy protection of federated learning and its application in IIoT has attracted extensive attention in the academic community.
Ma et al. [13] proposed a privacy-preserving multi-party framework for federated learning. Bilinear aggregation signatures are used to verify the correctness of the ag-gregation results, and all participants participate in the verification. Therefore, the computational cost increases significantly with the number of participants. In [14], the authors combine the Harn-Gong key with data aggregation to propose a machine learning data aggregation scheme with noninteractive keys. The results show that the scheme has a private data masking function. However, the scheme relies on the trusted authority to distribute the keys. Authors in [15] present a hybrid privacy-preserving federated learning scheme, which combines secure multi-party computation with differential privacy to reduce the noise injection that increases with the number of participants without sacrificing privacy. In [16], a privacy-preserving federated neural network learning scheme is proposed to solve the problem of neural network evaluation and privacy-preserving training in an N-square federated learning environment. Bonawitz et al. [17] designed a federated learning secure aggregation scheme for high-dimensional data, which allows the central server to compute the sum of data vectors from mobile devices securely with communication efficiency and fault robustness. Song et al. [18] proposed an efficient secret-sharing privacypreserving FL data aggregation mechanism, which can aggregate user-trained models without revealing user models, with efficient fault tolerance and resistance to reverse attacks. Scheme [15], [16], [17] has high computation and communication costs, and cannot resist replay attacks. Scheme [18] also fails to defend against replay attacks.
Tian et al. [19] proposed a blockchain-based machine learning framework for IIoT. The scheme builds a new smart VOLUME 11, 2023 contract, which uses the aggregation strategy to verify and aggregate the model parameters to ensure the accuracy of the decision tree model. But miners are assumed to be honest and trustworthy nodes.
All the above schemes proposed a theoretical framework, but did not design specific federal learning methods for privacy protection.
Kong et al. [20] proposed a privacy-preserving model aggregation scheme based on a federated learning navigation framework. A homomorphic threshold cryptosystem combines the skip list and the bounded Laplace mechanism to protect the locally trained model updates. Zhou et al. [21] proposed a blockchain-based federated data sharing scheme. A federated extreme gradient boosting learning algorithm is constructed to solve the data isolation problem, and a data sharing mechanism is designed to ensure secure on-demand control data sharing. Scheme [20] relies on the trusted authority (TA), and scheme [21] assumes that the Key generation center (KGC) is the trusted entity. In fact, a trusted entity does not exist, and there is a risk of information source leakage, so it is not practical.
We propose a scheme combining PBFT, data aggregation, and federated learning. It solves the problem that existing data aggregation solutions rely on trusted entities, while resisting the reverse analysis attack problem of federated learning solutions.

III. PRELIMINARIES A. FEDERATED LEARNING
FL is a distributed machine learning model. The model is trained on the user side to protect the user's privacy, and then the model update is transmitted to the central manager for aggregation. Raw data remains in local storage. During local model training, the central manager can access all local model updates and share model data with other servers in the aggregation area. The global model in the central manager is then updated and shared back to the local device for further training [22]. FL is a typical example of machine learning and analysis on mobile wearable devices through 5G and later wireless networks, which have been deployed to sensitive healthcare applications [23].

B. PBFT
The practical byzantine fault tolerance (PBFT) consensus algorithm is a distributed voting mechanism. Its purpose is to solve the consensus problem in an N-nodes network with F concurrent failed nodes, where N satisfies N > 3F + 1 [24].
PBFT is divided into three processes: pre-prepare, prepare, and commit, which solves the problem of General Byzantine [25].

C. SECRET SHARING
Secret sharing is an entirely homomorphic scheme, which uses Shamir technology to split a secret into L parts and allocate them to different members. If an opponent steals part of the system, it can only gain part of the secret. It can obtain the whole secret only if it gets at L pieces of the secret [26].
The following polynomial was chosen to split a secret.
where αis a secret, and L is a threshold value.
The following formula can be obtained by Lagrange interpolation polynomial Then a is calculated as follows:

IV. SYSTEM MODEL A. COMMUNICATION MODEL
As shown in Figure 1, the hierarchical structure of industrial data communication is composed of IIoT devices (IID) and Industrial Management Center (IMC). The following presents the details of each component.
(1) IID: Each user is equipped with a wearable smart device for industrial data collection. One user is equal to an IID. IIDs simultaneously and regularly collect IIoT data. P2P communication is used between all IIDs in each aggregation area. Each aggregation area adopts PBFT consensus algorithm to select an IID from all IIDs as system initialization and data aggregation node (SN). Sometimes IIDs may stop reporting or reset later due to failure. IID is assumed to be honest but curious.

B. ADVERSARY MODEL
In our model, IMC and IID are semi-trusted. IMC or IID does not tamper arbitrarily with its industrial data, but it may want to gain access to other people's private industrial information and sell it to interested entities. Internal attackers may conspire to access other users' industrial sensitive private information. An external attacker may attempt to impersonate a legitimate entity (i.e., a smart industrial device in the aggregation area) and send relevant data on its behalf. In addition, external eavesdroppers may eavesdrop on network traffic to obtain industrial data and attempt to modify and forward it.

C. DESIGN GOALS
(1) Data security. It can resist various attacks. Even if the aggregated ciphertext of IIoT data collected by smart devices is intercepted, the IIoT data of a single smart device cannot be recovered.
(2) Privacy. Industrial data can be securely aggregated against internal and external attacks. No entity can obtain the industrial data of a single smart device.
(3) Fault tolerance. If the smart device is not able to collect industrial data due to external malicious damage or failure, the utility of the system is significantly compromised. Even if some smart devices cannot collect or send industrial data, the system can still effectively aggregate the data of IIoT from other smart devices.

V. THE PROPOSED SCHEME
In this section, we introduce a federated learning-based privacy-preserving data aggregation scheme for IIoT. The notations are listed in Table 1.

A. INITIALIZATION
Suppose each aggregation area has N SDs recorded as a set P g = {SD 1 , SD 2 , . . . , SD N }. Some users do not participate in IIoT data aggregation because of specific concerns or because IIDs may be malfunctioning. It is assumed that there are at least L SDs online and participate in aggregation, these IIDs constitute P on ⊆ P g . Through the PBFT consensus algorithm, each round of data aggregation selects an IID from the data aggregation area P g as system initialization and data aggregation node (SN).
(2) IID j selects s j ∈ Z * q as the private key and computes the corresponding public key P j = s j ·g 0 , then it sends P j to IMC. IMC: (1) IMC collects at least L messages from IID j in the previous round.
(2) Make sure the number N of IIDs, the threshold value L.
(3) Broadcast list of received public keys to IIDs in P on .

C. ROUND 1 (SHARES GENERATION)
IID: (1) Receive global parameters broadcasted by the server. Verify that |P on | ≥ L.
(2) Generate secret shares, IID j generates its polynomial E(y j ) = α + p 1 y j + p 2 y 2 j + · · · + p L−1 y L−1 j , γ y j = L k =j y k y k −y j , then sends Ts E(y j )γ y j to IMC. IMC: (1) Forward received shares to IIDs in P on .

D. ROUND 2 (CIPHERTEXT GENERATION AND VERIFICATION SIGNATURE)
IID: (1) IID j generates IIoT data d j at timestamp Ts, and computes H 0 (Ts), then selects r j ∈ Z * N to generate ciphertext: (2) IID j generates signature σ j = s j · H 2 (C j P j H 1 (Ts E(y j )γ y j )).
(3) IID j sends C j P j H 1 (Ts E(y j )γ y j ) σ j to IMC and SN . IMC: (1) After receiving C j P j H 1 (Ts j E(y j )γ y j ) σ j , IMC batch verification signature e s j · H 2 C j P j H 1 Ts j E y j γ y j , g 0 = L j=1 e(H 2 (C j P j H 1 (Ts j E(y j )γ y j )), s j · g 0 ) (2) Forward batch signature verification results to SN . VOLUME 11, 2023
(2) SN sends C to IMC. IMC: (1) After the batch verification signature is passed, IMC arbitrarily chooses L shares of E(y j )γ y j from the received N shares of E(y j )γ y j to reconstruct the secret.

F. ROUND 4 (CIPHERTEXT DECRYPTION)
IMC uses the private key (λ, µ) to decrypt C to obtain the aggregated the data of IIDs.

VI. SECURITY ANALYSIS A. DATA INTEGRITY
FLPDA scheme adopts BLS short signature to sign private data and aggregate data of IIDs.
For the message C j P j H 1 (Ts j E(y j )γ y j ) σ j sent by IID j , IMC first checks P j and H 1 (Ts j E(y j )γ y j ), and then verifies the integrity of the message by checking whether e( L j=1 σ j , g 0 ) = e( L j=1 s j · H 2 (C j P j H 1 (Ts j E(y j )γ y j )), g 0 ) is established. Each element of the message is involved in validation, and any manipulation of the message results in unequal batch validation. Therefore, the integrity of the messages sent by IID j can be verified by IMC.

B. PRIVACY-PRESERVATION
Attackers can be divided into internal attackers and external attackers. The internal attackers include IMC and IIDs in this aggregated area who seek to compromise the privacy of other IIDs. External attackers are entities that are not in this aggregated area.
Theorem 1: FLPDA scheme is resistant to external attacks, i.e., it is computationally infeasible for an external adversary to obtain d j from C j .
Proof: When an external attacker infiltrates an IID, the ciphertext C j = g d j 1 × r N j × H 0 (Ts) E(y j )γ y j mod N 2 sent by the IID can be obtained. Since the attacker does not know the decryption key λ and the shared keys of other L-1 users. Therefore, an external attacker cannot obtain the plaintext. Theorem 2: FLPDA scheme is resistant to internal attacks, i.e., it is computationally infeasible for an internal opponent (IID or IMC) to extract d j from C j .
Proof: The other user's IID k (k = j) could not successfully extract d j from C j ,because he doesn't know H 0 (Ts) E(y j )γ y j . Even if the malicious IIDs obtained H 0 (Ts) E(y j) γ y j of IID j , they did not know the decryption key λ by paillier's encryption algorithm, so they still could not obtain the plaintext of IID j . IMC does not know H 0 (Ts) E(y j )γ y j of IID j and cannot derive the real-time data of IID j , only the aggregated data of all IIDs can be obtained. Therefore, FLPDA scheme can resist internal attacks.
Theorem 3: FLPDA scheme can resist the reverse analysis attack.
Proof: Suppose IMC is a potential adversary, trying to obtain the individual data of each device. After IMC receives the aggregated ciphertext C sent by SN, the decryption gets N j=1 d j via (λ, µ). IMC cannot obtain the ciphertext C j of a single device IID, so it cannot obtain d j . IMC only obtains the aggregated data of all IIDs, but does not know the single data of each IID. Therefore, FLPDA scheme can resist the reverse analysis attack.
Theorem 4: FLPDA scheme can resist the conspiracy attack.
Proof: If SN gets the decrypted key from IMC and tries to acquire the plaintext of IID j , the privacy of IID j can still be preserved because they don't know H 0 (Ts) E(y j )γ y j . Moreover, if at least L IIDs conspire, they can get L j=1 E(y j )γ y j . Because they don't know the decryption key λ, they still can't get the data of a single IID.

C. FAULT TOLERANCE
Some devices may malfunction and do not send industrial data to SN at all. As SN only knows which group an IID belongs to according to β j . IMC uses H 1 (Ts j E(y j )γ y j ) to find the malfunctioning IID j , while masking the IID's identity.
First, IMC compares this group of hash tables constituted by H 1 (Ts j E(y j )γ y j ) with other complete groups to find the malfunctioning IID. Then, selects an IID from other groups with the same hash value H 1 (Ts j E(y j )γ y j ) to replace IID j . Therefore, if there is a malfunctioning IID j , we shouldn't consider IID j 's data. IMC arbitrarily chooses L shares of E y j γ y j from the received (N -1) shares of E(y j )γ y j to reconstruct the secret.
Let's assume IID j (1 ≤ j ≤ L) has failed to transmit d j to SN , then SN aggregates the aggregated ciphertext of IIDs.
IMC obtains the aggregated industrial data of all IIoT devices except IID j . As a result, IMC can get the right aggregation results in this case, and thus our scheme achieves the fault-tolerant property.

D. FEATURE COMPARISON
In Table 2, the comparison between our scheme and other related schemes [15], [17], [18], [20], [21] shows that our scheme does not require any trusted entity, can resist all attacks, meets the privacy protection requirements, and realizes reverse attack resistance.

VII. PERFORMANCE EVALUATION A. COMPUTATION COMPLEXITY
To facilitate the comparison of FLPDA scheme with schemes [18] and [20], the execution time of each algorithm is listed in Table 3.
According to Table 3, it can be found that FLPDA scheme is more efficient than schemes [18] and [20]. In FLPDA scheme, due to batch signature verification, the signature verification is only 1/N of schemes [18] and [20]. Since scheme [18] or [20] generates N key pairs, the computational cost of key generation in FLPDA scheme is half that of scheme [18] or [20]. The number of key agreements for FLPDA scheme is 0, and the number of key agreements for scheme [18] or [20] is 2N times.
As shown in Figure 2, the calculation cost of FLPDA scheme is lower than that of scheme [18] or [20], which improves the response speed and enhances the practicality.

B. COMMUNICATION OVERHEAD
The comparison of communication costs between FLPDA and schemes [18] and [20] is shown in Table 4.  In Table 4, FLPDA requires fewer communication costs than schemes [18] and [20], especially in Round 1. This means that if the FLPDA scheme is adopted, lower latency can be ensured, and practicality is improved.
N indicates the number of IIDs, and M bits indicate the data length. Figure 2 shows the communication costs comparison between FLPDA, [18] and [20] at M=512. Figure 4 shows the communication costs comparison between FLPDA, [18] and [20] when N=1000.
As shown in Figure 3 and Figure 4, FLPDA scheme has higher communication efficiency than schemes [18] and [20], especially the longer the data length, the more the number of IID, and the longer the data length, the more the communication cost will be saved.

C. STORAGE OVERHEAD
The storage costs of FLPDA scheme, scheme [18] and [20] are listed in Table 5. N represents the number of IIDs, and M bits represent the storage cost of each data to be sent. In Table 5, the calculated storage cost of each round for FLPDA scheme is compared with that for schemes [18] and [20]. In round 0, each IID needs to store an updated data request, so the storage cost of FLPDA is NM bits. In round 1, each IID needs to hold public key P j , so the storage cost of FLPDA is NM bits. In round 2, each IID needs to keep Ts E(y j )γ y j , so the storage cost of FLPDA is 2NM bits. In round 3, each IID needs to store C j P j H 1 (Ts E(y j )γ y j ) σ j , so the storage cost of FLPDA is 4NM bits. In round 4, each IID needs to store C i , SN needs to store C, so the storage cost of FLPDA is (NM+M) bits.
As shown in Table 5, the total storage cost of FLPDA is (8N M+M) bits, and that of scheme [18] is (15NM-6M) bits. FLPDA scheme performs better than scheme [18] in terms of storage cost.

VIII. CONCLUSION
In this paper, a privacy-preserving data aggregation scheme based on Federated Learning for IIoT is proposed. Data aggregation is used to resist reverse analysis attacks and protect single user model changes in federated learning. The PBFT consensus algorithm is adopted to realize that it does not depend on any trusted entity. Combined with Paillier cryptosystem and secret sharing, the data security sharing is realized and the data island problem is solved. The analysis proves that the proposed scheme can resist various attacks and meet all design goals. Compared with existing schemes, our scheme has lower communication, computational, and storage costs. In the future, we will focus on improving the federated learning model to study multidimensional IoT data collection from the perspective of the physical layer.