End-Edge Collaborative Lightweight Secure Federated Learning for Anomaly Detection of Wireless Industrial Control Systems

With the wide applications of industrial wireless network technologies, the industrial control system (ICS) is evolving from wired and centralized to wireless and distributed, during which eavesdropping and attacking become serious problems. To guarantee the security of wireless and distributed ICS, this article establishes an end-edge collaborative lightweight secure federated learning (LSFL) architecture and proposes an LSFL anomaly detection strategy. Specifically, we first design a residual multihead self-attention convolutional neural network for local feature learning, where the variability and dependence of spatial-temporal features can be sufficiently evaluated. Then, to reduce the wireless communication cost for parameter exchange and edge federal learning, we propose a dynamic parameter pruning algorithm by evaluating the contribution of each parameter based on the information entropy gain. Furthermore, to ensure the parameter security during wireless transmission in the open radio environment, we propose an adaptive key generation algorithm for parameter encryption. Finally, the proposed strategy is experimentally validated on representative datasets, including Smart Meter, NSL-KDD, and UNSW-NB15. Experimental results demonstrate that the proposed strategy achieves 99% accuracy on different datasets, where at least 89.6% wireless communication cost is reduced and tampering/injecting attacks are defended.


I. INTRODUCTION
Industrial control system (ICS) plays an irreplaceable role in key fundamental infrastructures, such as manufacturing factories, power systems, and nuclear power plants [1].However, ICSs are facing more and more serious security and privacy problems when they are interconnected by the open Internet.Thus, anomaly detection becomes critical in protecting the security of ICSs.By detecting abnormal behaviors, potential attacks can be detected in advance, further guaranteeing the reliability of ICSs.
Existing anomaly detection strategies are mainly based on machine learning algorithms including support vector machine, random forest, and decision tree [2].Zhou et al. [3] proposed a variational long short-term memory (LSTM) learning network for intelligent anomaly detection based on reconstructed feature representation.Kaur et al. [4] proposed the Bayesian method for the convolutional neural network (CNN) integration, where Bayesian component was used to distinguish network physical intrusion from normal events in binary and multiclass events, while the CNN was used to process high-dimensional feature space before intrusion classification task.Ahakonye et al. [5] proposed an agnostic Chi-square feature selection and prepruned decision tree for intrusion detection in SCADA systems.Chen et al. [6] proposed an adaptive method of information-enhanced countermeasure domain, and constructed a feature extractor through CNN and bidirectional LSTM architecture.These strategies are mainly oriented toward centralized ICSs and can achieve high-accuracy detection when there is enough training data.However, when a single node in the centralized ICS is attacked and fails, there will be serious privacy leakage and data security problems, which could significantly reduce the performance of anomaly detection and impact the security and reliability of ICS.
As a consequence, a distributed anomaly detection strategy is gaining popularity in fault tolerance, scalability, and privacy protection for distributed ICSs [7].Rather than relying on a single centralized algorithm, the distributed strategy employs multiple interconnected nodes that cooperatively perform anomaly detection.Liu et al. [8] designed a CNN based on attention strategy and a LSTM network for distributed anomaly detection, where the gradient is compressed.Li et al. [9] proposed a distributed anomaly detection strategy based on the CNN and gated cyclic unit network, and designed a secure communication protocol based on the paillier cryptosystem to protect the security and privacy of network parameters.Huong et al. [10] proposed a hybrid network based on variational autoencoder and LSTM to deal with distributed anomaly detection of time-series data.Zhai et al. [11] proposed a distributed intrusion detection method based on the CNN and gated recurrent unit (GRU) under the federated learning architecture.Khan et al. [12] proposed a distributed intrusion detection system combined with simple cycle unit.
In particular, federated learning is emerging as a promising distributed strategy that has gained wide attention and applications in academia and industry [13], [14].With federated learning, nodes only share parameters rather than raw data or intermediate results.Therefore, the risk of data leakage can be effectively reduced.However, most federated learning-based anomaly detection strategies simply transmit the raw parameters, and do not fully consider the features and importance of parameters.In this way, the communication cost for parameter exchange is still ignorable, especially for the wireless ICS whose communication resources are always limited.More importantly, when the parameters are wirelessly exchanged in an open radio environment, eavesdroppers and attackers are more easy to intrude and destruct the wireless ICS.If a trusted third party is employed, more communication costs will be introduced to the wireless ICS and further reduce the performance of anomaly detection.Thus, how to design an effective federal learning strategy with low communication cost and high security remains a hot topic.Motivated by this, this article proposes an end-edge collaborative lightweight secure federal learning (LSFL) anomaly detection strategy for the wireless ICS.
The major contributions are summarized as follows.

II. END-EDGE COLLABORATIVE LSFL ARCHITECTURE
The ICS contains an edge server and I end devices denoted as ED i (i = 1, 2, . .., I).The edge server, which has sufficient computation resources, is responsible for end-edge collaborative task scheduling and data training.End devices, which are task customized with limited computation resources, can be sensors, controllers, and actuators distributed along the production process.Each end device continuously collects local data, e.g., monitoring information and control command.The edge server and end devices are wireless connected by a high-reliable and strong-realtime industrial wireless network such as industrial 5G [15].Due to the limited spectrum resources, the data volume transmitted by the wireless ICS is limited.Moreover, the transmitted data maybe eavesdropped since the radio environment is open to everyone.In this way, attackers can send invalid requests and malicious data to the wireless ICS, and further interpolate the data to compromise end devices.This may cause serious problems, such as production interruptions and device failures preventing industrial production lines from running normally.
To ensure the security of the wireless ICS, we propose the end-edge collaborative LSFL architecture as Fig. 1.End devices first perform feature learning with the raw data collected locally, then prune the parameters to reduce the wireless communication cost, encrypt the pruned parameters for secure wireless communication, and finally, transmit the encrypted pruned parameters to edge server for federal learning.Specifically, the process of end-edge collaborative LSFL is given as follows.

A. RAW DATA COLLECTION
Each end device first collects raw data and form local dataset X i (i = 1, 2, . .., I) for training.Due to the limited computation resource, each end device collects the same length of sequence.Thus, the local dataset X i with M sequences, where each sequence is with the same length and N features, is given as where x i m,n denotes the nth feature of the mth sequence of the local data by ED i .

B. PARALLEL FEATURE LEARNING
With the collected raw data, all end devices perform parallel training for feature learning by the following proposed RMS-CNN, where the input training data of ED i is X i .The initial training models of all end devices are the same, and the initial training parameter set of ED i is denoted as Y i (i = 1, 2, . .., I).In detail, Y i , which contains J parameters, is given as where y i, j denotes the jth initial training parameter of ED i .Then, according to the local training results, each end device updates Y i into a new parameter set Ỹi (i = 1, 2, . .., I), which also contains J parameters and is given as By training, the parameter set of ED i is updated as

C. PARAMETER PRUNING
To reduce the wireless communication cost for parameter exchange and realize lightweight federal learning, each end device further prunes the parameter by the following proposed dynamic parameter pruning algorithm.The parameter set Y i is pruned as Z i (i = 1, 2, . .., I) with J parameters, which is given as where z i, j denotes the jth parameter.

D. PARAMETER ENCRYPTION AND SECURE TRANSMISSION
To avoid being eavesdropped and attacked during wireless communication in the open radio environment, all pruned parameters are encrypted for secure transmission.The encryption key for Z i is denoted as K i , which is generated by the following proposed adaptive key generation algorithm.Then, each end device encrypts the pruned parameters and transmits the encrypted parameter to edge server for federal learning.

E. PARAMETER AGGREGATION
With the encrypted pruned parameter sets from I end devices, the edge server decrypts each pruned parameter set and aggregates the pruned parameters to get a new parameter set with J parameters.Herein, the new parameter z j is calculated as the weighted mean of the parameters uploaded by I end devices, namely where α i, j ∈ [0, 1] indicates the importance of the pruned parameter z i, j .In this way, we formulate a new parameter set

F. PARAMETER UPDATE
With the new parameter set Z, the edge server sends it to all end devices, and each end device updates its parameter set for the next-round training, namely With multiround interactions for parameter exchange and training, we can obtain an accurate model for anomaly detection.After completing the offline anomaly detection training, the end devices are ready for online anomaly detection.

III. LSFL ANOMALY DETECTION STRATEGY
With the established end-edge collaborative LSFL architecture, we further specify the LSFL anomaly detection strategy in this section.In detail, we first design RMS-CNN for feature learning to enhance the capability of anomaly detection.Then, we propose the dynamic parameter pruning algorithm to reduce the wireless communication cost for parameter exchange.Finally, we propose the adaptive key generation algorithm for parameter encryption and secure transmission in the open radio environment.

A. RMS-CNN FOR LOCAL FEATURES LEARNING
To enhance feature learning capability for anomaly detection, we design RMS-CNN for adequately extracting, integrating, and capturing the variability and dependence of spatialtemporal features.The RMS-CNN structure is illustrated in Fig. 2.

1) SPATIAL-TEMPORAL FEATURE EXTRACTION
In the wireless ICS, both normal and abnormal protocol data are collected cyclically for training, most of which are spatialtemporal sequences.To fully extract the spatial features of sequences, we employ 1-D CNN (1D-CNN) to extract the local information according to the length of input data.Thus, we have Herein, X i and X spatial i are the input sequence and the output feature of the convolution layer, respectively; f (•) is the activation function; * denotes the convolution operation; W i,l and V i,l are the weight factor and bias factor at the lth (1 ≤ l ≤ L) convolution kernel in the convolution layer with a total of L convolution kernels.
After spatial feature extraction, the dimension of X spatial i should be very high.Thus, it is necessary to reduce the redundant information.We apply two pooling layers after 1D-CNN to reduce the feature dimensions while retaining important spatial feature information.Specifically, we divide each line of  (10) where x i,max m, n denotes the nth feature of the mth data.In this way, a sequence with N dimensions is shortened to a new sequence with N dimension.
Then, X max i is input to the global average pooling to obtain the average value of features, i.e., where x i,avg m, n denotes the nth feature of the mth data.In this way, a sequence with N dimensions is shortened to that with only one dimension.
Furthermore, to learn the temporal features of X avg i , we apply GRU to extract long-memory dependencies with unique memory property, i.e., X temporal i = g X avg i (12) where g(•) denotes the activation function used in GRU.

2) FEATURE DEPENDENCE CAPTURE
In order to reduce information loss during spatial-temporal feature extraction, we further capture the dependence of spatial-temporal features by multihead self-attention.To maintain the relative consistency of the original spatialtemporal features, we first add a residual connection after the spatial-temporal feature extraction, wherein the residual connection integrates the features extracted by 1D-CNN and GRU.In this way, both the enhanced features and unmodified original input features provided by 1D-CNN and GRU are fully considered.Mathematically, the extracted features X spatial i and X temporal i from 1D-CNN and GRU are added together, i.e., Then, we capture the dependence in X res i by calculating the importance of each location with respect to other locations.Specifically, we map the input features to different subspaces by multiple times linear transformations to obtain more different information.Furthermore, by calculating multiple attention heads and paying attention to different location information and dependencies in parallel, we can get more comprehensive and accurate global information.
Mathematically, X res i is linearly transformed by a learnable parameter matrix to obtain query vector Q i , key vector K i and value vector V i .Each vector is further divided into H numbers of attention heads.Then, the self-attention of hth (1 ≤ h ≤ H) head is calculated as where are the hth query vector, key vector, and value vector, respectively; U Q i,h , U K i,h , and U V i,h are the learnable parameter matrix for linear transformation; softmax(•) is the normalization function; √ d is the dimension of Q i,h and K i,h ; and • is the dot product.
To get all features captured through multihead selfattention, we connect the output of each attention head to form a large matrix, which is then multiplied by the weight matrix for final linear transformation.That is where c(•) is the connection function; and U O is the learnable parameter matrix.
After calculating the multihead self-attention matrix, the output feature X mattn i together with the original X res i is again used to learn the enhanced features denoted as X mres i , i.e., Finally, we use the general flatten layer and fully connected layer to obtain the parameters and perform anomaly detection.Note that the initial or updated parameters are utilized throughout the aforementioned process for training.

B. DYNAMIC PARAMETER PRUNING FOR LIGHTWEIGHT FEDERAL LEARNING
As a large number of parameters are generated after RMS-CNN training, the wireless communication cost for parameter exchange of federal learning increases dramatically.However, the communication resources (e.g., bandwidth and transmit power) are very limited in the industrial wireless network, which certainly cannot support the massive parameter exchange frequently.Thus, we propose the dynamic parameter pruning algorithm to reduce the wireless ICS.Specifically, at the end of each training round, each end device dynamically prunes its parameters by fully considering the features and contributions of parameters.

1) PARAMETER CONTRIBUTION EVALUATION
To evaluate the contribution of the parameter after training, we propose to calculate the information entropy gain.First, we make the parameter y i, j discrete as Then, we calculate the information entropy of each parameter as e i, j = −p(y i, j ) log p(y i, j ) where p(y i, j ) is the ratio of parameter y i, j to the total parameter set Y i .Similarly, the parameter after training ỹi, j can also be discrete and the information entropy is calculated as ẽi, j = −p( ỹi, j ) log p( ỹi, j ).
Then, we calculate the information entropy gain of each parameter as e i, j = e i, j − ẽi, j . ( With the information entropy gain, we can evaluate the contribution of each parameter, where the greater the information entropy gain, the higher the contribution of the parameter.In this way, we can enhance the training performance.

2) DYNAMIC PARAMETER THRESHOLD SETTING
To further measure the volatility of the parameters and ensure the reliability of the contribution, we further calculate the standard deviation of the information entropy gain as where ēi = J j=1 e i, j J is the average value of information entropy gain.
As different parameters have different contributions, we set the parameter pruning threshold based on i .In this article, we mainly consider the parameters with respect to weight, bias, and gradient since they significantly impact the training performance.Specifically, the parameter pruning thresholds , bias i , and grad i are the pruning thresholds with respect to weight, bias, and gradient.ϕ, ζ , and ψ are weight factor, bias factor, and gradient factor, respectively.In this way, we can select different thresholds for parameter pruning.

3) PARAMETER PRUNING AND RECONSTRUCTION
With the parameter pruning threshold calculated based on the information entropy gain, we prune the parameters dynamically.Specifically, according to the parameter characteristics with respect to weight, bias, and gradient, we prune the parameters according to ( 21)-( 23), respectively.If a parameter is larger than the calculated threshold, the parameter is reserved; Otherwise, the parameter is pruned to be 0, namely it is inactive.In this way, we reduce the data volume of redundant parameters with low contributions.
However, after parameter pruning, the value range of parameters is compressed, which may influence the performance of federal learning.Thus, we further enlarge the value range of parameter.The scaling factor is defined as the ratio of the sum of the parameters' absolute values before and after pruning, namely Then, we multiply the parameters by (24) to enhance the sparsity of parameter's value range.In this way, we reconstruct the pruned parameters for federal learning.

C. ADAPTIVE KEY GENERATION FOR PARAMETER ENCRYPTION
When the pruned parameters are exchanging between end devices and edge server for federal learning, the risk of attack increases substantially in the open radio environment.Thus, to avoid being eavesdropped and attacked, we propose the adaptive key generation algorithm to encrypt parameters for secure wireless communication.

1) KEY NEGOTIATION
The key is generated by each communication pair, namely end device and edge server, and we do not use third party for the distributed ICS.Hence, in order to establish the secure transmission channel for each communication pair, we randomly generate key pairs for all communication pairs.Each key pair includes a private key K r i and a public key is a point the elliptic curve y 2 = x 3 + ax + b subjecting to 4a 3 + 27b 2 = 0.Then, the communication pair retains the private key K r i and exchanges the public key K u i to each other to ensure that they are the only participants who can decrypt them.With the received K u i , the communication pair performs key negotiation.Specifically, ED i and the edge server negotiate on their own K r i to obtain a shared key K s i , where is calculated by the communication pair and should be the same.By key negotiation, the two participants over the same communication channel have the same key K s i , while those over different communication channels have different keys.

2) KEY CONVERSION
To ensure the security of parameter exchange, we employ the widely used AES algorithm to encrypt the pruned parameters.We first prepare the input key material K m i by adding a random number However, K m i cannot be directly applied to encrypt the pruned parameters since the length of K m i is much longer than the length supported by the AES algorithm.Thus, we need to covert the format of K m i to make its length is supported by AES algorithm.
As Hash function maps the key with any length to that with a fixed length, we employ Hash function to convert the format of the key.Meanwhile, we can also enhance the security of the key as Hash function is one way, namely the output is unique and irreversible, and the attacker cannot obtain the original key by calculating the Hash value reversely.Furthermore, to increase the complexity of the key, we execute multiple times of Hash operations as where HASH(K m i , T i ) is the Hash function indicating T i times Hash operation for K m i .T i is calculated as where L i is the length supported by the AES algorithm (i.e., 128 bits, 192 bits, or 256 bits), and L hash i is the output length by the Hash function.
Then, we connect the output of each Hash operation to obtain the key until the length is supported by the AES algorithm, i.e., In this way, we obtain the key for parameter encryption.

3) PARAMETER ENCRYPTION
As the volume of industrial data is generally very large, we employ CounTeR (CTR) in AES to split the large data into small blocks quickly and encrypt parameters.Specifically, CTR generates the key stream based on a counter and K i , and performs the exclusive OR operation with the plaintext parameters to obtain the encrypted parameters.The decryption process is on the contrary.It is worth noting that with CTR mode, multiple parameter blocks can be encrypted and decrypted simultaneously, which speeds up the encryption process.
Table 1 makes a comparison on the proposed algorithm with the basic AES algorithm.Obviously, by dynamically generating different key pairs for parameter exchange, the proposed algorithm is more secure than the basic AES algorithm even with some complexity enhancement.

D. SUMMARY OF THE PROPOSED STRATEGY
With the aforementioned proposed RMS-CNN, dynamic parameter pruning, and adaptive key generation algorithms, we summarize the LSFL anomaly detection strategy as Algorithm 1 corresponding to the process depicted in Fig. 1.

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. EXPERIMENT SETTINGS
All experiments are conducted on TensorFlow-GPU-2.7.0 with Python 3.9 running on Intel i7-11700 CPU and NVIDIA RTX4060-16 G GPU.To fully evaluate the proposed strategy, we select three typical datasets for experimental validation, namely Smart Meters [16], NSL-KDD [17], and UNSW-NB15 [18].The fundamental characteristics of these datasets are described in Table 2.The data are divided into two parts: 80% of one dataset is used for training, while the remaining 20% is used for testing.
Furthermore, the proposed LSFL anomaly detection strategy with RMS-CNN, denoted as RMS-CNN-LSFL, is compared with two benchmark strategies denoted as CNN-FL and Algorithm 1: LSFL anomaly detection strategy.MLP-FL.Herein, CNN-FL is a distributed federal learning anomaly detection strategy based on 1D-CNN with GRU [11], while MLP-FL is a similar strategy based on the MLP network [16].
To evaluate and compare the performances of different strategies, we calculate four performance metrics, namely Accuracy, Precision, Recall, and harmonic mean F-score, which are calculated as follows:

Recall = T N T N + F N
(30) where true positive (TP), false positive (FP), false negative (FN), and true negative (TN) are defined in Table 3.
Obviously, Accuracy indicates the proportion of all correctly detected samples to the total samples as given by (28).The higher of the accuracy, the more effectiveness of the anomaly detection strategy.Precision indicates the proportion of true abnormal samples among the predicted abnormal samples as given by (29).Recall indicates the proportion of abnormal samples correctly detected in true abnormal samples as given by (30).Harmonic mean F-Score comprehensively measures precision and recall as given by (31).The higher of the precision, recall, and harmonic mean, the lower probability of false alarm by the anomaly detection strategy.

B. PERFORMANCE COMPARISON
Fig. 3 first verifies the effectiveness of the three federal learning strategies on the Smart Meter dataset by evaluating the accuracy versus communication round (i.e., the times for parameter exchange).We can observe that the accuracy increases with the increase of communication rounds, and finally, remains invariability.That is to say, all strategies can converge, indicating that the proposed strategies are effective.Herein, the accuracy of RMS-CNN-LSFL remains higher than those of CNN-FL and MLP-FL, indicating the advantage of the proposed RMS-CNN-LSFL.Moreover, the convergence speed of RMS-CNN-LSFL is more quickly than those of CNN-FL and MLP-FL for different numbers of end devices.This is because the combination of multihead self-attention and residual connection in RMS-CNN-LSFL speeds up the process of feature learning.Meanwhile, the residual connection can make the model easier to optimize by mitigating the gradient vanishing problem.Therefore, RMS-CNN-LSFL can converge more quickly and stably.
More specifically, Table 4 comprehensively compares the performance with respect to accuracy, precision, recall, and F-score for different strategies with different numbers of end devices.Obviously, when I = 4, the accuracy, precision, recall, and F-score of the proposed RMS-CNN-LSFL strategy on the Smart Meter dataset are 99.982%,99.981%, 99.983%, and 99.982%, respectively.These performance values are much better than those of CNN-FL with 98.031%, 97.972%, 97.634%, and 97.800%, and those of MLP-FL with 97.801%, 97.815%, 97.778%, and 97.796%.Similarly, the performance evaluations on NSL-KDD and UNSW-NB15 also indicate that RMS-CNN-LSFL achieves much better accuracy, precision, recall, and F-score than CNN-FL and MLP-FL.In detail, the accuracy of RMS-CNN-LSFL is above 99%, while those of CNN-FL and MLP-FL are generally below 99%.The main reason is that the proposed RMS-CNN-LSFL with multihead self-attention network can capture more spatial-temporal features for the data with long-term dependencies.
Fig. 4 compares the runtime of the proposed RMS-CNN-LSFL strategy with and without dynamic parameter pruning on different datasets.Note that the running time include all the time for feature learning, parameter pruning, encryption, and exchange as described in Section III.We can observe that, with dynamic parameters pruning, the runtime is reduced by 21.2%, 15.6%, and 13.7% on the three datasets.This is because less parameters are exchanged after pruning, while the accuracy is not loss.Fig. 5 depicts the processed data volume at different stages by RMS-CNN-LSFL, CNN-FL, and MLP-FL on different datasets.It is observed that the processed parameters by federal learning is only 4%, 8%, and 13% of the raw data on the three datasets.Furthermore, with dynamic parameter pruning, the parameters are reduced to only 2.4%, 5.9%, and 10.4% of the raw data.That is to say, our proposed strategy saves at least 89.6% wireless communication cost for parameters exchange.
Fig. 6 evaluates how tampering attacks impact the performance of different strategies on the three datasets.With the increase of tampering attack, namely more and more parameters are ineffective, the accuracy of CNN-FL and MLP-FL is gradually decreasing since they do not perform parameter encryption.For this case, once the parameters are tampered, the features and distribution of parameters cannot be accurately captured, thus decreasing the accuracy of anomaly detection by federal learning.In contrast, the accuracy of RMS-CNN-LSFL does not decrease and remains the highest, since RMS-CNN-LSFL performs adaptive key generation to encrypt the parameters and certainly prevent the tampering attacks.Furthermore, Fig. 7 studies the influence of injecting attacks on the accuracy of different strategies.By evaluating on different datasets, we can observe that the accuracy of all strategies decreases with the increase of malicious data continuously injected.However, our proposed strategy still remains the highest accuracy than those of CNN-FL and MLP-FL.
Comparing Fig. 7 with Fig. 6, we can also observe that the accuracy of all strategies significantly decreases when there is injecting attack.This is because tampering attack and injecting attack are different kinds of attacks, which make different influence on the valid parameters for federal learning.Tampering attack directly modifies the content of parameters, which can make the unencrypted parameters invalid or even destructive.In this way, the proposed strategy with parameter encryption can protect the pruned parameters from tampering attack.In contrast, injecting attack does not destroy the existing parameters, but add more invalid or even destructive parameters.In this way, the ratio of valid parameters is decreased, which decreases the accuracy of all strategies.

V. CONCLUSION
In this article, we established an end-edge collaborative LSFL architecture and proposed the LSFL anomaly detection strategy for the wireless ICS.First, the RMS-CNN structure was designed for local spatial-temporal feature learning at end devices.Then, the dynamic pruning algorithm based on information entropy gain was proposed to reduce the wireless communication cost for parameter exchange.Furthermore, the adaptive key generation algorithm was presented to encrypt the pruned parameters for edge federal learning.Extensive experiments were performed on three representative datasets, namely Smart Meter, NSL-KDD, and UNSW-NB15, during which two benchmark strategies were compared.The results showed that the proposed LSFL anomaly detection strategy achieves above 99% accuracy on different datasets, where at least 89.6% communication cost is reduced and tampering and injecting attacks are defended.
To summarize, the proposed anomaly detection strategy simultaneously considered the powerful computation resource requirement of federal learning, the low communication cost requirement of end-edge collaborative computing and the high security requirement of parameter exchange in the open radio environment of the wireless ICS.This is different from existing federal learning-based anomaly detection strategies, where only communication cost or security issue is considered.In the future, we will further consider the joint computation and communication allocation for LSFL in the end-edge collaborative architecture.

1 )
A novel lightweight secure collaborative federal learning architecture: We establish an end-edge collaborative LSFL architecture for the wireless ICS, where multiple end devices perform local feature learning and share parameters with an edge server for federal learning.2) Enhanced CNN for feature learning: We design a residual multihead self-attention convolutional neural network (RMS-CNN) for spatial-temporal features learning.Herein, we employ multihead self-attention network to learn the dependence among features and use a residual connection to retain and integrate variability features at different layers.In this way, we can obtain more comprehensive information with fewer layers, and thus, enhance the detection capability.3) Dynamic parameter pruning for lightweight wireless communication: We develop a dynamic parameters pruning algorithm based on information entropy gain.Herein, we evaluate the contribution of each parameter by calculating the information entropy gain, and dynamically set the pruning threshold.In this way, we can prune the parameters and reduce the wireless communication cost for parameter exchange.4) Adaptive key generation for secure parameter exchange: We propose an adaptive key generation algorithm to encrypt the pruned parameters.Herein, end devices together with edge server dynamically generate different key pairs, adaptively adjust keys by multiple Hash operations and encrypt the pruned parameters by the advanced encryption standard (AES) algorithm.In this way, we ensure the parameter security during wireless transmission in open radio environment.5) Extensive experiments: We perform extensive experiments on representative datasets including Smart Meter, NSL-KDD, and UNSW-NB15, and compare the proposed strategy with two benchmark strategies.The experimental results demonstrate that the proposed strategy achieves 99% accuracy on different datasets, while reducing at least 89.6% communication cost and ensuring the security.The rest of this article is organized as follows.Section II presents the end-edge collaborative LSFL architecture.Section III specifies the LSFL anomaly detection strategy in detail.Section IV evaluates the performance of the proposed strategy by extensive experiments, and finally, Section V concludes this article.

FIGURE 3 .
FIGURE 3. Accuracy versus communication round for different strategies with different numbers of devices.

FIGURE 5 .
FIGURE 5. Comparison of communication costs on different datasets.

FIGURE 6 .
FIGURE 6. Accuracy for tampering attack on different datasets.

FIGURE 7 .
FIGURE 7. Accuracy for injecting attack on different datasets.
parts, where is the ceiling function.In this way, each part is with S features and denoted as x . It is first input to the maximum pooling to obtain the maximum value of every S features, i.e.,