Anomaly Detection for Insider Attacks from Untrusted Intelligent Electronic Devices in Substation Automation Systems

In recent decades, cyber security issues in IEC 61850-compliant substation automation systems (SASs) have become growing concerns. Many researchers have developed various strategies to detect malicious behaviours of SASs during the system operational stage, such as anomaly-based detection. However, most existing anomaly-based detection methods identify an abnormal behaviour by checking every single network packet without any association. These traditional methods cannot effectively detect “stealthy” attacks which modify legitimate messages slightly while imitating patterns of benign behaviours. In this paper, we present feature selection and extraction methods to generalise and summarise critical features when detecting insider attacks triggering from untrusted control devices within SASs. By applying a sliding window-based sequential classification mechanism, our detection method can detect anomalies across multiple devices without the need to learn datasets collected from all devices. Firstly, to generalise critical features and summarise systems’ behaviours so that it is unnecessary to collect all datasets, we selected and extracted six critical network features from generic object-oriented substation events (GOOSE) messages and seven summarised physical features based on the general architecture of the primary plant of distribution substations. After that, to improve detection accuracy and reduce computational costs, we applied sliding window algorithms to divide datasets into different overlapped window-based snippets. Then we applied a sequential classification model based on Bidirectional Long Short-Term Memory networks to train and test those datasets. As a result, our method can detect insider attacks across multiple devices accurately with a false-negative rate of less than 1%.


I. INTRODUCTION
In recent decades, cyber security issues in IEC 61850compliant substation automation systems (SASs) have become growing concerns. Many researchers have developed various strategies to detect malicious behaviours of SASs during the system's operational stage. Some of them focused on knowledge-based detection by specifying legitimate rules and abnormal conditions based on expertise [1][2][3][4]. However, these knowledge-based methods can only detect known attacks [5]. Others targeted anomaly-based detection by applying machine learning algorithms to learn systematic behaviours [6][7][8]. Most existing anomaly-based detection methods identify an abnormal behaviour by checking every single network packet without any association between them. However, these traditional methods cannot effectively detect "stealthy" attacks which modify legitimate messages slightly while imitating patterns of benign behaviours. For instance, due to the special nature of substation operation, stealthy attacks which only alter the Boolean control signals from "opening circuit breakers" to "closing circuit breakers" when a short circuit event happens, can still have severe consequences. Traditional methods cannot detect such attacks accurately as these attacks might be misclassified as benign behaviours. Thus, a bespoke anomaly-based detection model for detecting such attacks is required.
In this paper, we present feature selection and extraction methods, and sliding window-based sequential classification algorithms to detect "stealthy" attacks triggering from untrusted control devices within SASs. Our method targets such insider attack scenarios and can detect anomalies across multiple devices without the need to learn datasets collected from all devices. Both feature selection and extraction methods were applied as they both help generalise critical features and summarise systems' behaviours so that it is unnecessary to learn all datasets [6]. Additionally, sequential classification algorithms learn the contexts of the current behaviour, and thus, such additional information improves accuracy when detecting stealthy attacks [9]. Furthermore, sliding window algorithms divide the entire sequence of data into small snippets so that only recent contexts of the current behaviour are considered, and thus, improve detection accuracy and reduce computational costs [10].
The overall methodology in this paper: 1) collected datasets of both benign and malicious behaviours from a software-based simulation testbed and labelled datasets based on various behaviours; 2) selected and extracted critical network features from GOOSE messages and generic physical features from sensor data, and prepared datasets for machine learning; 3) used sliding window algorithms to divide datasets into different overlapped window-based snippets, and applied a sequential classification algorithm based on the bidirectional long short-term memory (BiLSTM) to train and test those datasets; 4) evaluated the approach by comparing experimental results of different anomaly-based detection methods, including decision tree and support vector machine algorithms; and 5) determined the preferred window size and step size in sliding window algorithms experimentally.
Three main contributions in this paper are listed below: • We presented feature selection and extraction methods to generalise and summarise a total of 13 critical features when detecting stealthy insider attacks. Such methods help detect anomalies across multiple devices when only learning behaviours of one typical device. • Compared to traditional detection methods, our detection algorithms combined a BiLSTM sequential classification algorithm and sliding window algorithms and improved detection accuracy by decreasing the false-negative rate from 30% to 1% approximately. • Based on various experiments, we provide recommended settings for the window size and step size in sliding window algorithms for anomaly detection within SASs. The suggested settings balance the tradeoff between detection accuracy and detection time.

II. BACKGROUND
With the rapid advancement of information and communication technology, modern power grids have been experiencing a digitisation process during recent decades. Substation automation systems (SASs), also called the secondary plant, are critical components in power grids, that monitor and protect the primary plant (e.g., transformers) in substations. Legacy SASs involved numerous electromechanical components with intricate hardwiring in a centralised topology, significantly increasing operational complexity, configuration and maintenance costs, and potential safety hazards [11,12]. Therefore, according to international standard IEC 61850 [13], new SASs have been developing continuously to satisfy contemporary high-level requirements regarding interoperability, maintainability, and flexibility [14]. However, IEC 61850-compliant SASs are vulnerable for various reasons. Firstly, when the IEC 61850 standard was first introduced, cyber security problems were not the main concerns as these issues were addressed later in another standard -IEC 62351 [15]. Nevertheless, many control devices from different vendors do not support IEC 62351 [16]. Secondly, as a new convenient feature for system administrators, remote access and control also increase risks of systems being penetrated since additional access portals are introduced [17]. Thirdly, the IEC 61850-compliant SASs support various communication protocols, including legacy protocols (DNP3, Modbus), and new protocols (MMS, GOOSE, SV). However, both legacy and new protocols are insecure due to improper authentication, lack of encryption, poor access control, and lack of integrity checks [18].
Lastly and importantly, protection relays, as a major component in SASs, also called intelligent electronic devices (IEDs), usually come from third-party vendors and may not be fully trusted by utility companies. According to utility companies' shared concerns, from the design specification stage to the deployment and operational stage, a control device (e.g., a protection relay) may become untrustworthy at any point. For instance, the device may acquire various vulnerabilities during the design and implementation stages. Although most vendors test and validate their new products before manufacturing, the assessment process may not be rigorous or standardised [19]. A vulnerable device is untrustworthy as it may become a weak point for attackers. Additionally, hidden malware and stealthy hardware Trojans may be introduced during the manufacturing process, either accidentally or deliberately [20]. Similarly, a validation engineer from a third-party supplier may install a hidden backdoor for future remote access which, even though introduced for altruistic reasons, could be exploited as part of an attack [21]. Finally, during system operation, random faults, misconfiguration or mistakes made during software or firmware upgrades can lead to a previously reliable device becoming untrustworthy.
features are redundant, it will increase the diagnosis time when detecting anomalies, and thus add communication latency to normal system operation. Conversely, if the selected features are not sufficient, detection accuracy will decrease. Even worse, with insufficient features, it may not be possible to detect "stealthy" attacks as the altering features may be ignored. Furthermore, different machine learning algorithms are suitable for various application scenarios, and it is important to select appropriate ones for the best performance.

III. RELATED WORK
This section contains a comprehensive literature review. Since GOOSE is the most critical communication protocol within SASs, we firstly describe several network features of GOOSE messages. Then, we introduce physical features from sensor data, and review the possibility of combining physical features and network features to detect anomalies within SASs. After that, we point out the issue of promoting the utilisation of limited datasets into general and systematic problems, and provide potential feature extraction methods to overcome such issues. Lastly, four typical machine learning algorithms for anomaly detection are reviewed. We also summarise related works of applying sliding window algorithms to improve the accuracy of machine learning classification in various applications.

A. NETWORK FEATURES OF GOOSE
There are two types of proprietary features of GOOSEdynamic features and static features [28]. Dynamic features are usually calculated based on the statistical trends of traffic volume and traffic frequency [28]. Kwon, et al. [28] defined three particular features relating to GOOSE messages based on RFM analysis in business and marketing research. They defined the last GOOSE arrival time as Recency, the mean time interval of GOOSE arrival time (also known as the heartbeat) as Frequency, and the total GOOSE arrival count as Monetary. On the other hand, static features are often filtered and extracted from different fields in a single GOOSE packet [29]. Based on the standardised GOOSE structure defined in the IEC 61850-8-1 standard [30], static features of GOOSE include MAC address, "APPID", "gocbRef", "stNum", "sqNum", Boolean control signals from the "allData" field, etc.
Many researchers have applied both dynamic features and static features to monitor and identify abnormal GOOSE messages in SASs [1,3,4,28,29,31,32]. However, some of them failed to provide detailed statistical results, such as a false-positive rate (FPR) and false-negative rate (FNR) [28,29] while the others did not consider or failed to detect stealthy attacks [1,3,4,32]. Therefore, more features are required to improve the accuracy of detecting stealthy attacks.

B. PHYSICAL FEATURES FROM SENSOR DATA
From an electrical engineering perspective, based on Kirchhoff current and voltage laws, Valdes, et al. [7] used sensor data, such as current and voltage, to classify three distinct states corresponding to normal operation, nonmalicious fault, and false measurement injection in SASs. However, their approach is limited to detecting attacks on SV messages and the accuracy is not satisfactory. Additionally, Kreimel, et al. [8] target communication among Remote Terminal Units (RTUs) based on the DNP3 protocol. They selected both dynamic features (round-trip-time of packets) and static features (packet length, TCP window size) as well as sensor data (voltage) collected from solar panels. They achieved high detection accuracy on man-in-the-middle (MITM) drop attacks, but low accuracy on stealthy attacks which only change the transmitted measurement data by a small amount. This low accuracy can be attributed to the fact that the FDIA did not deviate much from the values of normal behaviour [8]. As a result, we conjectured that a similar method can be applied to detect stealthy attacks targeting GOOSE messages by carefully selecting both critical network features from GOOSE messages and generic physical features from sensor data.

C. COMBINING BOTH NETWORK AND PHYSICAL FEATURES
In our previous work [33], we identified and selected one dynamic feature (the GOOSE heartbeat), nine static features (e.g., MAC, APPID, gocbRef, allData), and two types of physical features (circuit physical values and circuit breaker status). By including two additional features, Boolean control data from the "allData" field and various physical features, our method improved the accuracy of detecting stealthy attacks by decreasing the false-negative rate from 25% to 5% approximately.
However, we also observed an issue when introducing physical features to anomaly detection in SASs. Within SASs, multiple instances of the same type of IEDs may be applied to protect different sections of the primary plant. When the same types of IEDs are compromised, IEDs protecting different parts will generate datasets with different network and physical features, for instance, when IED1 provides overcurrent protection to transformer1 while IED2 protects transformer2. When both compromised devices IED1 and IED2 are triggered to conduct the same FDIA respectively, network features generated from both IED1 and IED2 may have the same patterns. However, physical features generated from sensor data are different as two attacks impact different physical parts. The attack from IED1 only influences physical values and statuses around transformer1 without interfering with transformer2 while the attack from IED2 is totally opposite. Therefore, even though IED1 and IED2 are the same types of devices, if an anomaly-based detection model which applies both network features and physical features only learns attack datasets from IED1, it cannot detect attacks from IED2 accurately. Due to this reason, the detection model must learn all attack datasets generated from all devices with an SAS. However, since SASs are complex systems that involve numerous control devices, it is impractical to collect attack datasets which contain anomalies occurring on every single device [34]. Thus, an additional feature extraction method is required to generalise and summarise critical features to detect anomalies across multiple devices while only learning behaviours of one typical device [35].

D. FEATURE EXTRACTION
Some researchers have proposed several feature extraction methods, and those methods might be useful to overcome the issue mentioned above. Ouyang, et al. [6] presented a hierarchical time series feature extraction method to detect anomalies in power consumption. They defined four types of features from daily power consumption readingssummary features, shift features, transform features, and decompose features. The summary features are time-windowed statistical variables, including mean, median, and standard deviation of daily power consumption. Qiu, et al. [36] also introduced trend indicators to detect anomalies for power consumption. The trend indicators are calculated based on the average values of the time series. Although these features are based on daily power consumption, a similar method can be applied to summarise physical features in our previous method [33] to detect anomalies across multiple devices. Furthermore, Gomes, et al. [37] summarised two general feature extraction methods for streaming datasummarisation sketches and dimensionality reduction. Summarisation sketches combine any sketches of individual streams in a space-efficient way while dimensionality reduction converts original input data into a simplified form without compromising relevant patterns of the input data. These two methods can also be helpful to promote the utilisation of limited datasets into general and systematic problems.

E. TYPICAL MACHINE LEARNING ALGORITHMS FOR ANOMALY DETECTION
Generally, when applying machine learning algorithms for anomaly detection, there are two main types: supervised learning and unsupervised learning. Although unsupervised clustering algorithms do not require upfront effort to label datasets appropriately, their detection accuracy is usually lower than supervised classification algorithms [38]. For supervised learning, according to different classification outputs, there are commonly two types: 1) classifying each sample with one label, and 2) classifying a consecutive sequence of samples with one label. The former one usually applies traditional algorithms, such as K-Nearest Neighbour (KNN), Support Vector Machine (SVM), and Decision Tree, while the latter one adopts sequential classification algorithms, including Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM).
The KNN algorithms classify a new abnormal behaviour by a majority vote from its near neighbours [37]. The SVM algorithms find a hyperplane to clearly distinguish data points into two groups based on various features [10]. The decision tree algorithms improve the accuracy of classification through various abnormal factors in a tree structure [39]. These traditional algorithms usually do not consider time-serial correlations among various features, and may not accurately detect stealthy attacks in which messages are only modified marginally [8].
On the other hand, sequential classification algorithms, such as LSTM, have feedback connections which can learn the contexts of the current behaviour, and thus provide more accurate results when detecting stealthy attacks [9]. Meanwhile, LSTM is better for a short time duration of timeserial data and can reduce the system complexity [9]. Since most transient behaviours in substations are short-term events with tangled logical processes, LSTM is suitable for identifying such systematic behaviours. Furthermore, bidirectional LSTM (BiLSTM) models are evolved from LSTM, which consists of two additional layers: a forward LSTM layer, and a backward LSTM layer. By learning both the forward flow and the backward flow, BiLSTM can effectively understand the correlation patterns among the previous behaviour, the current behaviour, and the next behaviour [40].

F. SLIDING WINDOW ALGORITHMS
In recent decades, sliding window algorithms have become popular when learning time-series streaming data. They can effectively reduce learning computational costs and improve detection accuracy when detecting anomalies from time-series streaming data [34]. The sliding window is defined as a sequence of data which represents the most recently arrived tuples [41]. It covers W number of samples and pushes forward S number of samples every window. W is called the window size while S is called the step size. Generally, there are two types of sliding windowsquantity-based (countbased) and time-based [34,42]. Quantity-based defines the window size based on a number of samples while time-based specifies the window size based on a period of time.
Many researchers have applied sliding window algorithms to various applications. In the field of power systems, some researchers utilised sliding window algorithms for power supply load forecasting [43], transient stability prediction [44], faulty equipment detection based on image recognition [45], and IED defect classification based on text mining [40]. In the field of anomaly detection, researchers applied sliding window algorithms to detect anomalies in different applications, such as IoT networks [9,10,46], and in-vehicle networks [47,48]. However, none of them has applied sliding window algorithms to detect anomalies within SASs.
Furthermore, according to research from both fields, it is widely believed that sliding window size is an important factor in sliding window algorithms, and it is application-specific to select an appropriate sliding window size. In the field of anomaly detection, researchers also emphasised that there is a trade-off between the detection accuracy and detection time when choosing different sliding window sizes [49]. The larger the window size, the higher the accuracy, the more the detection time (the time between the start of the attack and the attack is detected), and vice versa [34,47,48]. Meanwhile, however, there is no empirical procedure or standard which can help us determine a preferred window size in the application field of anomaly detection. Therefore, the preferred window size must be selected based on security requirements considering both accuracy and detection time.

IV. DATASET COLLECTION
In our research, the datasets were collected from a simulation testbed. Although datasets generated from real operational systems are better than simulated datasets, such datasets usually lack attack scenarios or do not indicate if they contain malicious behaviours [11]. Furthermore, datasets for commercially sensitive critical infrastructure such as SASs are difficult to obtain. On the other hand, with a certain level of fidelity, simulation testbeds can generate more varieties of datasets, including both benign behaviours and various types of malicious behaviours. Thus, in our work, based on the IEC 61850 standard, a cost-efficient software-based simulation testbed was implemented. We generated and collected a total of 31 datasets based on various scenarios, including 15 benign scenarios and 16 attack scenarios.

A. SIMULATION TESTBED
The testbed 1 runs on Oracle VirtualBox with five virtual machines (VMs). One VM simulates a small-scale primary plant of a distribution substation using MATLAB/Simulink 2 . According to the general architecture of distribution substations, the simulated primary plant consists of a 66kV high-voltage line, two transformers, a 22kV low-voltage line, four feeders, and many circuit breakers (shown in Figure 1). Ten short-circuit fault blocks were set up in the Simulink for generating non-malicious events at ten different locations. The rest of the four VMs represent different types of IEDs simulated using OpenPLC 3 , including three instantaneous overcurrent protection -IED_PIOC_TRSF1 (IED1), IED_PIOC_TRSF2 (IED2), and IED_PIOC_FDR and one circuit breaker failure protection -IED_BFP. Communication networks among each VM, such as GOOSE trip messages between IEDs and the primary plant, were written in C/C++ according to libiec61850 4 . Additionally, in each VM, various interface programs were written to link OpenPLC, MATLAB/Simulink, and the "libiec61850" library. As shown in Figure 2, the interface program in VM-IEDs reads analogue values from Simulink in VM-Primary-Plant via UDP packets, and passes these values to OpenPLC. Meanwhile, the program also reads digital signals from OpenPLC and passes these signals to the "libiec61850" program to assemble GOOSE packets. After the "libiec61850" program in VM-Primary-VOLUME XX, 2017  Plant receives those GOOSE packets, the interface program reads digital signals from decoded packets, and passes them to Simulink via UDP packets. Although the testbed only includes a limited number of IEDs as a prototype, this can be extended with more functionalities in the future. Figure 2 illustrates the communication architecture of the testbed. In particular, the central process bus is used by IEDs to communicate with actuators (e.g., circuit breakers) in the primary plant via GOOSE messages.

B. DATASET GENERATION
Generally, there are two different benign behaviours in substation operation: 1) normal operation when no unusual events happen; and 2) emergency operation when nonmalicious events (e.g., short-circuit faults) happen. Attacks can occur in both benign behaviours, which introduces two types of malicious behaviours: 1) an attack under normal operation to disrupt energy transmission, and 2) an attack under emergency operation to stop protection mechanisms or trigger undesirable protection operations.
Based on these four types of behaviours, a total of 31 datasets 5 were generated. Each dataset consists of five network packet capture files collected from five VMs, and four sensor data records collected from four IEDs. Since the testbed only models instantaneous overcurrent protection and circuit breaker failure protection, the sensor data only contains circuit current values from sensors at different locations and the operational status of various circuit breakers. Table I describes each scenario. Datasets of both normal operation and emergency operation are benign behaviour datasets which contain 7447 and 12457 individual samples respectively. Attack datasets include attack scenarios from both IED1 and IED2 with 8015 and 9902 individual samples respectively. All benign behaviour datasets and attack datasets of IED1 were used for training only while attack datasets of IED2 were used for testing only.
As mentioned before, this paper mainly focuses on FDIA and replay attacks. We created eight different attack scenarios regarding GOOSE messages from IED1 and replicated these eight scenarios to GOOSE messages from IED2. IED1 and IED2 are the same type of protection relays that protect transformer1 and transformer2 respectively. All these eight attack scenarios can be classified as FDIA. According to Ahmed and Pathan [50], FDIA consists of three forms in general: 1) deletion of data from the original message; 2) modification of data in the original message, and 3) addition of fake data or fake messages. We cover the last two forms in this paper and implemented four attack scenarios for both the normal operation and the emergency operation respectively. For the normal operation, four attack scenarios include: 1) two message injection attacks that inject additional GOOSE trip messages to mislead circuit breakers into opening; and 2) two message modification attacks that change the payload of GOOSE messages from non-trip to trip to mislead circuit breakers into opening. For the emergency operation when a phase-to-phase fault happens, four attack scenarios contain: 1) two message injection attacks that a) inject additional GOOSE non-trip messages to stop protection mechanisms, and b) inject additional GOOSE trip messages of another IED to trigger unnecessary and unexpected protection mechanism; and 2) two message modification attacks that a) change the payload of GOOSE messages from trip to non-trip to stop the protection mechanism, and b) change the payload of GOOSE messages of another IED from non-trip to trip to trigger unnecessary and unexpected protection mechanisms.

V. DATASET PRE-PROCESSING
After collecting all the datasets, we conducted a comprehensive dataset pre-process, and prepared various datasets for different controlled trials later. Firstly, we handled all datasets with three processesformat conversion, data merging, and data normalisation. Then we selected nine critical distinguishing features from GOOSE messages and various physical features (circuit current values and circuit VOLUME XX, 2017 7 breaker statuses) from sensor data needed for precisely anomaly detection of insider attacks. Meanwhile, we applied a feature extraction method to generalise and summarise critical features from both network and physical features. Six network features and seven summarised physical features were extracted. After that, we labelled each sample based on its behaviours. Since sequential classification algorithms classify a sequence of samples with one label, we applied the worst-case principle to generate labels for each sequence. Lastly, we applied sliding window algorithms to divide datasets into different overlapped window-based snippets. Both the non-window-based datasets and window-based datasets were generated to satisfy traditional machine learning algorithms and sequential classification machine learning algorithms respectively.

A. DATA HANDLING PROCESSES
Before selecting critical features from datasets, three necessary dataset handling processes were required. These include format conversion, data merging, and data normalisation, which are demonstrated below. Some processes are proprietary to the simulation environment in this paper and may require slight adjustment to apply to other cases.
Format conversion: Since original datasets contain network packet capture (PCAP) files which cannot be directly used, we firstly converted network packet files to commaseparated values (CSV) files. As stealthy attacks targeting GOOSE messages are our main concern, only GOOSE packets were extracted. We wrote a Python program using Scapy to elicit all static features from the GOOSE packets as well as the packet received timestamp, and exported this data to CSV files.
Data merging: In this step, we merge all CSV files into one CSV file and remove redundant data. Firstly, we merged all converted network CSV files into one file and removed redundant network packets. Then, to link packet transmission events to physical sensor data, according to the packet received timestamp of each packet, we wrote a macro to find the closest timestamp from four sensor data records, and added corresponding physical features after that packet. Finally, we generated one CSV file which contains both network features and physical features.
Data normalisation: Some data may be missing or invalid and some data may be difficult to recognise by machine learning algorithms, thus, requiring a data normalisation process. Firstly, we removed all missing and invalid data. Secondly, all nonnumerical values were converted to unique numerical values to reduce computational costs. For instance, the MAC address "20:17:01:16:F0:99" was simplified to 99, gocbRef "Testbed/PIOC$TRSF1$CBStval" was changed to 1111, and Boolean control value [1, 1, 1, 0] in the "allData" field was converted to a binary number "1110" first, then converted to the corresponding decimal number "14".

B. FEATURE SELECTION
After finishing data handling processes, we selected nine critical network features and two genres of physical features, which are demonstrated below. All network features were selected to detect various specific attack scenarios, especially for stealthy attacks including FDIA and replay attacks. Two types of physical features were selected to indicate physical systems' behaviours, and accordingly, help determine network behaviours.

1) Heartbeat (network):
An important dynamic feature in GOOSE message communications. For packets with the same "gocbRef", we calculated the time intervals of any two near packets as the heartbeat. This dynamic feature normally indicates if non-malicious events happen, but also can be used to detect DoS attacks, FDIA, and time delay attacks [2].

2) MAC-src (network):
The source MAC address, which can be used to identify the publisher to detect MITM attacks and FDIA [1].
3) APPID (network): GOOSE application identification, which can be used to verify the type of application and detect if an illegal application is sent from the publisher.

4) Length (network):
The length of the GOOSE header and APDU, which can be used to detect if a GOOSE message is invalid or modified.

5) gocbRef (network):
The GOOSE control block reference, which contains all information of a pre-defined control block. Since the "datset" field and the "goID" field are included in the "gocbRef" field, only the "gocbRef" field was selected.

6) Dif-st (network):
The differential value of the state number. For packets with the same "gocbRef", we calculated the differential value between the current stNum and the previous stNum. This normally indicates if non-malicious events happen, but also can be used to detect FDIA by highlighting large jumps in the number sequence [4].

7) Dif-sq (network):
The differential value of the sequence. Similar to Dif-st.

8) numDatSetEntries (network):
Indicates the amount of data in the "allData" field and can be used to detect if additional data is included in payloads.

9) Dec-allData (network):
The decimal number of converting all Boolean values in the "allData" field, which normally invokes the control commands, and can be used to detect FDIA targeting Boolean control value.

10) Circuit physical values (physical):
Normally consists of current and voltage readings observed from various sensors, which imply various power systems' behaviours. By checking if the circuit current values are normal or not, it helps distinguish between malicious behaviours (e.g., the relay was attacked to make circuit breakers open) and benign behaviours (an actual short-circuit fault occurred).

11) Circuit breaker statuses (physical):
The statuses of different circuit breakers, usually collected from system logs in each IED, also help determine systems' behaviours.

C. FEATURE EXTRACTION
In practice, it is costly and time-consuming to generate all attack datasets targeting each individual IED as there are many IEDs on the process bus of an IEC 61850 compliant SAS. As we mentioned before in the literature review, the anomalous behaviours of one IED may be different from another one, even if they are the same type of devices. These differences reflect on both network features (e.g., MAC address and APPID) and physical features (e.g., current values around transformer1 and transformer2). Therefore, it is important to generalise our detection methods to be suitable for general cases. We need to detect anomalies from all IEDs in a smallscale simulation environment while only learning datasets from one typical IED. Accordingly, with minor adjustments, our methods can be extended and applied to large-scale real systems. To achieve this objective, in this work we applied feature extraction methods to generalise critical features from both network features and physical features.
Firstly, we excluded three network features which indicate the identity of a particular device or a particular message application. These three network features include MAC-src, APPID, and gocbRef. Without such identities, the network behaviours of an insider attack are generalised to arbitrary devices, and thus, the detection model only needs to learn malicious behaviours from one typical device. However, without such identities, we can only detect if there is a malicious message, but cannot identify where this message comes from directly. Nonetheless, after detecting the malicious message, we can still re-extract such identities from the message payload, and trace the sources of the anomaly.
Secondly, we summarised various physical readings and reduced the number of physical features from 18 to 7. Although there are only two types of physical features, various sensors and system logs still generate numerous features which indicate the physical statuses of different zones within a substation. Based on the general architecture of the primary plant of distribution substations, we extracted six summarised features based on various circuit physical readings. Each feature is the average value of one horizontal level of the primary plant as shown in Figure 3. If one of these physical features is abnormal, it means there is something wrong at that horizontal level, and accordingly, indicates the corresponding IEDs which protect that level may be compromised to launch attacks. Furthermore, we also summarised all circuit breaker statuses into one feature. Since most circuit breakers only have two statusesopen (1) and closed (0), we compiled all circuit  breaker statuses following a specific sequence and showed them as a binary number.
As a result, we generalised and extracted a total of 13 features which are listed below. With a limited volume of datasets that only contain malicious behaviours from one typical IED, these 13 features can help detect anomalies from all IEDs. Additionally, reducing the number of features also helps simplify the complexity of the training neural network, accordingly, and decreases training and detection time while increasing detection accuracy.

D. LABELLING
According to the various scenarios shown in Table I, different labels were given. Label 0 indicates the normal operation scenario when no event happens. Label 1 shows the emergency operation scenario when a non-malicious event happens. Labels 901 -908 represent different stealthy attack scenarios. According to the characteristic of publisher-tosubscriber communication, GOOSE network traffic involves repeated network packets publishing from different IEDs. Since our datasets contain both network features and physical features, there are three types of labelling methods which are illustrated in Table II.
Label type 1 focuses on network features. Only the packet from the impacted IED is labelled as an emergency or attack, and the rest are labelled as normal. Label type 2 emphasises physical features, labelling all packets as 0 under normal operation, 1 under emergency operation, and 901 to 908 under various attacks. Label type 3 is the combination of type 1 and type 2 which was used in this paper. Under normal operation, all packets are labelled as 0. Under emergency operation, all packets are labelled as 1. Under attacks, only the packet from the impacted IED is labelled as an attack, and the rest of the packets are labelled as emergency operations due to abnormal physical features.
Datasets for sequential classification algorithms require different labelling methods as such algorithms classify a sequence of samples with one label. Therefore, within a sequence of samples, different labels need to be integrated into one label. According to Baldini [48], they labelled a sequence of samples as attacks if it contains at least one malicious packet. We improved their labelling methods and applied the worst-case principle that labelling a sequence of samples based on the highest-priority label (attacks higher than emergency, emergency higher than normal). Table II illustrates examples of labelling a sequence. If a sequence of samples involves any attack labels (label 901 to 908), it will be labelled as the corresponding attack. If a sequence of samples does not contain attack labels, but has label 1, it will be labelled as 1. Only a sequence of samples with all labels 0 will be labelled as normal. VOLUME XX, 2017

E. SLIDING WINDOW PROCESS
In this step, we divided the whole sequence of samples into different overlapped window-based snippets. We applied both the quantity-based and the time-based sliding window algorithms.
For the quantity-based sliding window algorithm (shown in Algorithm 1), each window-based snippet has the same number of packets. The window size is set to w packets while the step size is set to s packets. Firstly, we extracted the first snippet with w number of packets from X1 to Xw. Then we slid the window forward s packets and get the second snippet with a size of w packets from X1+s to X1+s+w. After that, we repeated the process until we reached the last packet.
For the time-based sliding window algorithm (shown in Algorithm 2), each snippet has the same time interval. The window size is set to w seconds while the step size is set to s seconds. Similar to the quantity-based algorithm, we extracted the first window by finding the first w seconds of packets, then step forward s seconds every time to get the rest of the windows until reading the last packet.

F. FINAL DATASET SUMMARY
After processing all the datasets, we generated a total of six groups of datasets for later evaluation which are shown in Table III. In each group, we divided a total of 31 datasets into training datasets and testing datasets. Training datasets contain 15 benign scenarios (one normal and 14 emergency) and eight insider attack scenarios triggered from IED1. Testing datasets include eight insider attack scenarios triggered from IED2 which were never used for training purposes. The training datasets involves a total of 27919 individual samples, while the testing datasets have 9902 individual samples. For window-based datasets, they still have the same number of individual samples, but have different amounts of windowbased samples according to different window sizes and step sizes. For instance, if the window size is 8 packets and the step size is 1 packet, the number of training window-based samples is 27912. If the window size is 8 packets and the step size is 2 packets, the number of samples is 27912 / 2 = 13956.
Datasets with two different numbers of features were created, in order to compare the performance between applying feature extraction and without feature extraction. For find Tp in T such that Tp -Tk ≤ w ⋀ Tp+1 -Tk ≥ w 7: Yj = {Xk, Xk+1, Xk+2, …, Xk+p-1} 8: find Tr in T such that Tr -Tk ≤ s ⋀ Tr+1 -Tk ≥ s 9: k = k + r 10: end for Groups 1 to 3, each individual sample consists of nine network features extracted from a GOOSE packet, 18 physical features from various sensor data when the packet is received, and the label based on labelling method type 3. For Groups 4 to 6, each individual sample consists of six network features, seven summarised physical features, and the label.
Meanwhile, to evaluate the performance of applying quantity-based sliding window algorithms, applying timebased sliding window algorithms, and without sliding window algorithms, we divided datasets into an additional three groups. Groups 1 and 4 did not apply sliding window algorithms. Groups 2 and 5 applied time-based sliding window algorithms. Groups 3 and 6 applied quantity-based sliding window algorithms.
All groups of datasets have 10 labels. Label 0 indicates everything is under normal operation. Label 1 means there is a non-malicious event that happens under the emergency operation. Label 901 to label 908 represent various insider attack scenarios described in Table I. By doing this, our detection model can distinguish different types of attacks as well as two benign behaviours. Furthermore, we also implemented a shuffle process to generate holdout samples to avoid overfitting problems in machine learning. For dataset Groups 1 and 4, all individual samples from all scenarios in both training datasets and testing datasets were merged into two separate data sheets respectively and shuffled into random orders. For dataset Groups 2, 3, 5 and 6, all individual samples from each scenario were divided into window-based snippets first, then all snippets from all scenarios were merged into one data sheet and shuffled with random orders.

VI. MACHINE LEARNING PROCESS
At this stage, we created a total of five machine learning models for evaluation. These models contain three traditional machine learning models and two sequential classification machine learning models which are listed in Table IV. On the one hand, traditional machine learning models usually classify each individual sample with a predefined label. We selected the three most popular traditional machine learning models   with high detection performance, and these include support vector machine (SVM), K-nearest neighbour (KNN), and decision tree. On the other hand, sequential classification machine learning models identify a sequence of samples with one particular label. Most sequential classification machine learning models are generally based on recurrent neural networks (RNN), and we chose an improved RNNbidirectional long short-term memory (BiLSTM). In comparison to the performance of using sliding window algorithms and without sliding window algorithms, we created two corresponding models. All models were created based on MATLAB's existing libraries. The settings of critical hyperparameters from these libraries are shown in Table IV. Additionally, for model 5, we implemented a cross-validation mechanism during the training process to improve the training performance. We divided the original training datasets into 80% training datasets and 20% validation datasets. Also, we set up the validation patience to 5 epochs to avoid overfitting. The training process will stop if the validation loss does not improve for 5 epochs. A detailed analysis of the quality of the learning process in BiLSTM is presented in Section IX.

VII. EVALUATION RESULTS
We conducted a total of five experiments to train and test different groups of datasets using various machine learning models. In each experiment, the main objective was to train all benign behaviours and stealthy attack behaviours only triggered from IED1, and then to detect the same insider attack behaviours generating from IED2. IED1 and IED2 are the same type of protection relays but protect different sections of the primary plant. As shown in Figure 1, IED1 monitors the status of the three-phase transformer1, detects any abnormal current readings and provides over-current protection to transformer1. Similarly, IED2 protects transformer2. However, if these two devices are compromised during manufacture or after deployment via a software update, they could be used to launch "insider" attacks within the substation's own infrastructure. Furthermore, the design objective of each experiment is shown in Table V. The detailed results are discussed below.

A. TRADITIONAL MODELS WITHOUT FEATURE EXTRACTION
In this experiment, we applied three traditional machine learning models -KNN, SVM, and decision treeto classify different individual samples. For each model we repeated training and testing processes three times and calculated the average performance which is shown in Table VI. FP is the number of false-positive errors when nonmalicious samples are misclassified as malicious ones. FN is the number of false-negative errors when malicious samples are misclassified as non-malicious ones. TP is the number of true positives when malicious samples are identified successfully. TN is the number of true negatives when nonmalicious samples are classified successfully. "Others" is the number of non-critical errors when either one benign sample is misclassified as another benign type (e.g., label 0 is misclassified as 1), or one malicious sample is misclassified as another malicious type (e.g., label 908 is misclassified as 907). The False-positive rate (FPR), also called fall-out, indicates how many FPs there are among all non-malicious samples. The False-negative rate (FNR), also called the miss rate, shows how many FNs are among all malicious samples.
From Table VI, without the feature extraction method, all three traditional models show a very high FNR from 51.921% to 94.22% such that most stealthy attacks cannot be detected. This proves that existing methods cannot effectively detect insider attack scenarios triggering from all other IEDs when only learning attacks from one typical IED.

B. BILSTM AND TWO SLIDING WINDOW ALGORITHMS WITHOUT FEATURE EXTRACTION
In this experiment, we still used the original 27 feature datasets without feature extraction, and applied the sequential classification model -BiLSTM and two sliding window algorithmsto classify a sequence of samples. The training and testing process was repeated four times with different settings of window size and step size.
For time-based sliding window algorithms, the window size and step size are defined as a certain number of seconds. Since GOOSE packets are sent periodically over the process bus, the heartbeats of GOOSE messages are normally one second during normal operation. This means in a cycle of one second, all IEDs publish their GOOSE messages at least once. Thus, the window size and step size are preferred to be an integer value of seconds, so every window-based snippet contains a full cycle of information of all GOOSE packets. The evaluation results are shown in Table VII. From Table VII, the FNRs are from 55% to 67.762%. Thus, without feature extraction, the BiLSTM model with time-based sliding window algorithms is better than traditional machine learning models.
For quantity-based sliding window algorithms, the window size and step size were defined as a certain number of packets. The evaluation results are shown in Table VIII. From Table  VIII, the FNRs are from 37.402% to 55.340%. These results also prove that the BiLSTM model with quantity-based sliding window algorithms is also better than traditional machine learning models.
Additionally, from Table VII and Table VIII, the FNRs when applying quantity-based sliding window algorithms are from 37.402% to 55.340%. These are generally lower than the FNRs with a range of 55% to 67.762% when applying a timebased algorithm. Therefore, we conclude that the quantitybased sliding window algorithms perform better than the timebased ones in the application of detecting stealthy attacks within SASs. Since a 37.402% FNR is not acceptable, only applying BiLSTM and sliding window algorithms are not enough. Thus, additional feature extraction methods are required.

C. TRADITIONAL MODELS WITH FEATURE EXTRACTION
In this experiment, we still applied three traditional machine learning models to classify different individual samples. However, we also applied a feature extraction method in which a total of 13 features were extracted from the original 27 features. Similar to experiment A, we also repeated training and testing processes three times for each model. The evaluation results are shown in Table IX.
From Table IX, all three traditional models still produce high FNRs. However, compared to the results in experiment A, after applying the feature extraction method, the FNRs of all three models dropped sharply. For instance, the FNR in the KNN model reduced from 94.22% to 31.748% while the FNR in the decision tree model decreased from 77.815% to 30.261%. Therefore, it is believed that the feature extraction method helps generalise critical features, and thus improves the accuracy of detecting stealthy attacks among multiple devices when only attack datasets from one typical device are trained.
Furthermore, the FNRs in experiment C are from 30.261% to 39.619%, which are better than the FNRs in experiment B   from 37.402% to 55.340%. From these results, it is inferred that when detecting stealthy insider attacks with a limited volume of datasets, applying the feature extraction method is more important than improving the machine learning model.

D. BILSTM AND FEATURE EXTRACTION WITHOUT SLIDING WINDOW
In this experiment, we only applied feature extraction and BiLSTM. The whole sequence of the dataset was trained using BiLSTM without dividing it into different window-based snippets. Since time-serial patterns of datasets are important during the training process, the datasets were not shuffled. Different from applying sliding window algorithms, the machine learning model still classifies each sample with a particular label. The training and testing process was repeated three times.
The evaluation results are shown in Table X. From Table X, the FNR is always 100%. Therefore, it is obvious that without sliding window algorithms, the BiLSTM model even with feature extraction cannot detect any stealthy attack scenarios. The reason may be that almost 97% of samples in the training datasets were labelled as benign behaviours, and the model ignores those 3% samples of malicious behaviours during the learning process. Due to the special characteristic of SASs that every IEDs publishes GOOSE messages repeatedly, unless all IEDs are compromised, the system behaviours of SASs are usually unbalanced so that most samples are benign. Therefore, regarding the unbalanced network traffic when FDIAs occur within SASs, it is essential to apply sliding window algorithms for sequential classification to detect stealthy attacks accurately.

E. BILSTM AND TWO SLIDING WINDOW ALGORITHMS WITH FEATURE EXTRACTION
Finally, we applied feature extraction, BiLSTM, and two sliding window algorithms to classify a sequence of samples. The training and testing process was repeated four times with different settings of window size and step size.  Table XI, the FNR is almost reduced to 22.984% when the window size is three seconds, and the step size is one second. This FNR is better than the 55% FNR when applying time-based sliding window algorithms in experiment B, and even better than the 37.402% FNR when applying quantity-based sliding window algorithms in experiment B. This result proves the importance of applying feature extraction with BiLSTM and sliding window algorithms to detect stealthy attacks when only learning attack datasets from one typical device.
Table XII displays the evaluation results of applying quantity-based sliding window algorithms. From Table XII, when the window size is 12 packets, and the step size is one packet, the FNR is reduced to 5.385%. This is the best result among all the experiments. This result again proves that the quantity-based sliding window algorithms perform better than the time-based ones when detecting stealthy attacks in SASs.
Furthermore, in Table VIII and Table XII, when applying BiLSTM with quantity-based sliding window algorithms in two different experiments, the smallest FNR is always shown under the same configurationwhen the window size is 12 packets and the step size is one packet. Therefore, different configurations of window size and step size influence the detection accuracy. Additionally, from the results, it is assumed that larger window size and smaller step size may present better performance. Thus, a comprehensive analysis of the performance of various window sizes and step sizes was conducted and is discussed in the next section.
In conclusion, these experiments testify that applying feature extraction, BiLSTM, and quantity-based sliding window algorithms can effectively detect stealthy attacks triggering from similar untrusted IEDs when only learning malicious behaviours from one typical IED in the process bus. Although the 5.385% FNR is still high for anomaly detection in critical infrastructure, it is greatly improved from 51.921% in experiment A, when only traditional machine learning models were applied without feature extraction.

VIII. RECOMMENDED WINDOW SIZE AND STEP SIZE
In this section, we conducted two additional experiments to determine the recommended setting of window size and step size in sliding window algorithms. Since previous experimental results show that quantity-based algorithms are better than time-based algorithms, we only focused on the settings in quantity-based sliding window algorithms.

A. FIXED WINDOW SIZE AND VARIOUS STEP SIZE
For experiment F, the step size was researched. We applied BiLSTM with quantity-based sliding windows to train and test dataset Group 6. With a fixed window size of 24 packets and various step sizes, we observed how different step sizes influence the detection performance. The number of testing samples and the average detection time per sample were recorded. We collected the total testing time in milliseconds, then divided it by the number of testing samples to get the average detection time per sample. Table XIII shows the performance of various configurations. From Table XIII, when the step size increases from 1 to 12, consistently, the number of training samples reduces from 9718 to 813, the FNR increases from 1.666% to 43.304%, and the average detection time per sample increases from 2.6358 to 9.4726 milliseconds. Similarly, the same pattern happens when the window size is 16. Therefore, regarding the FNR and average detection time, the preferred configuration of step size is the smallest, i.e., one packet. The reason is related to the number of training samples. From Table XIII, when the step size is doubled, the number of testing samples is almost reduced to half. Accordingly, the number of training samples is also reduced to half. Without a sufficient number of training samples, the training process will not perform well, and thus, leads to high FNR.

B. FIXED STEP SIZE AND VARIOUS WINDOW SIZE
For experiment G, the window size was researched. With a fixed step size and various window sizes, we observed how different window sizes influence the detection performance. Based on the assumption of the previous experiment, the step size was set up as one packet to get the lowest FNR. Table  XIV shows the performance of various configurations.
From Table XIV, when the window size increases from 8 to 30, the number of testing samples reduces slightly, and the average detection time per sample increases. However, the FNRs are varied without an obvious pattern due to the uncertainty and randomisation of the learning process. We illustrated different FNRs regarding different window sizes in Figure 4. From Figure 4, generally, the FNR is improved when the window size increases. When the step size is from 20 to 30, all the FNRs are below 4% which is assumed to be the preferred configuration range. Most importantly, when the step size is 22, the FNR is 0.37%, which is the lowest among all the experiments. Furthermore, we also illustrated different detection times regarding different window sizes in Figure 5. From Figure 5,  the detection time keeps increasing when the window size increases. Since IEC 61850 compliant SASs have a strict requirement for the response time of 3 milliseconds, the detection time should also be less than 3 milliseconds. According to this requirement, when the window size is larger than 28, the detection time is larger than 3 milliseconds which is not acceptable for anomaly detection within SASs. Therefore, the window size is suggested to be less than 28. As a result, in our simulation environment, considering both the detection time and detection accuracy, the recommended window size is from 20 to 28 packets while the preferred step size is one packet.

IX. DISCUSSION
Based on our experimental results, we can conclude that by applying feature extraction, BiLSTM, and quantity-based sliding window algorithms, our approach can effectively detect stealthy attacks triggering from similar untrusted IEDs when only learning malicious behaviours from one IED on the process bus. When the window size is 22 packets and the step size is 1 packet, we achieved the best results in which the FNR reduced to 0.372%. Furthermore, we conducted a further assessment of our anomaly detection methods, such as the time efficiency of algorithms, unbalanced dataset issue, and the quality of the learning process.

A. TIME EFFICIENCY OF ALGORITHMS
Similar to other researchers' findings, our experimental results also indicate that there is a trade-off between detection accuracy and detection time when selecting different window sizes. When the window size increases, the waiting time for obtaining all samples of one window-based sequence increases, thus the detection time increases accordingly. Figure 5 demonstrates this pattern. Due to the strict requirement of 3 millisecond response time in SASs, the window size should be limited to satisfy this requirement even though a larger window size gives more accuracy.
Additionally, we trained and tested all datasets offline. After observing the total testing time, we divided it by the total number of testing samples to get the average testing time per sample. This average testing time reflects the average detection time when the detection model is online. However, all training and testing processes were run in a simulation environment with a single CPU. If the detection system is deployed in a dedicated computer, it is believed that the detection time could be less than our current results.

B. UNBALANCED DATASETS
According to the results in experiment D, it is important to apply sliding window algorithms for sequential classification algorithms, especially when datasets are unbalanced between benign behaviours and malicious behaviours. Generally, due to the specific behaviours of the IEC 61850 compliant substation that various IEDs publish GOOSE messages to the process bus, the ratio of benign packets and malicious packets is usually unbalanced. Unless all IEDs have been compromised, the number of benign packets is much more than malicious ones. Therefore, the detection model needs to learn unbalanced datasets to detect anomalies in real systems' environments.

C. BiLSTM OR LSTM
Compared to the LSTM model which only learns datasets from the forward direction, the BiLSTM model learns datasets from both forward and backward directions. Therefore, BiLSTM usually has more complex neural networks than LSTM, and accordingly, requires more time and resources for the training and testing process. However, it is important to apply BiLSTM to detect stealthy insider attack scenarios within SASs as it gives higher detection accuracy.
The BiLSTM model learns the contexts from two directions. It helps the detection algorithms to understand what happens before an anomaly and what happens after an anomaly. Knowing what happens before an anomaly helps predict the following behaviours, while knowing what happens after an anomaly helps validate the current classification, and thus provides more accurate results. In power systems, when a non-malicious fault happens, the instantaneous behaviours might be similar to a stealthy attack scenario that mimics when a non-malicious fault occurs. However, their following behaviours may be different. Thus, with backward learning, the BiLSTM can identify such stealthy attacks within SASs accurately.
Furthermore, in experiment H, we repeated experiment G and only changed the BiLSTM model to the LSTM model. The evaluation results are shown in Table XV. From Table  XV, the FNRs are from 13.331% to 27.338%, which are all higher than the worst result 11.594% in Table XIV. Therefore, it is obvious that BiLSTM performs better than LSTM when detecting stealthy insider attack scenarios within SASs.

D. TIME-BASED OR QUANTITY-BASED
Based on the results from experiments B and E, it is obvious that the quantity-based sliding window algorithms perform better than the time-based ones in the application of detecting insider attacks within SASs. There are three reasons why we selected quantity-based sliding window algorithms.
Firstly, according to our findings, the smallest step size produces the best accuracy with the lowest FNR. When the time-based sliding window is applied, the smallest step size is one second which involves at least n packets where n is the number of IEDs in the process bus. On the other hand, the smallest step size for the quantity-based sliding window is one packet which is obviously smaller than n packets. Thus, when both algorithms choose the smallest step size, quantity-based shows lower FNR than time-based.
Secondly, according to the complexity of two sliding window algorithms, the time-based approach needs to find the edges of each window by calculating the packet arrival time, and thus require more time to generate window-based datasets than quantity-based.
Thirdly, during training, the sequential classification model usually groups the training data into mini-batches and pads the sequences to ensure they have the same length. For quantitybased sliding window algorithms, every window has the same number of samples, thus it does not require additional padding. On the contrary, for time-based sliding window algorithms, each window has the same time interval, but different amounts of samples, thus requiring additional padding with more computational costs.
Due to these three reasons, quantity-based sliding window algorithms are better than time-based when applying sequential classification algorithms to detect anomalies within SASs.

E. THE QUALITY OF LEARNING PROCESS
Overfitting is a common issue when applying machine learning. The detection results are too close to a particular dataset, and may fail to fit unseen datasets. In this paper, we applied three techniques to mitigate the risk of overfitting.
Firstly, we applied cross-validation during the learning process, and the validation datasets were randomly generated from the whole training datasets. The validation datasets were different among each individual training process. Secondly, we shuffled both the training datasets and validation datasets before the training process started. This "holdout" process will disorder the sequence of datasets and avoid overfitting. Thirdly, we set up the validation tolerance to be 5 epochs. This means that the training process will stop if the validation loss does not improve for 5 epochs. This strategy also mitigates the risk of overfitting. Furthermore, we selected proper hyper-parameters by trial and error. Table XVI shows the performance when choosing different numbers of hidden units in the BiLSTM layer. The window size is 10 packets, and the step size is one packet. From Table XVI, when the number of hidden units increases,  the FNR decreases, and the detection time per sample increases. However, when the number of hidden units increases from 200 to 400, the FNR did not show obvious improvement. Therefore, considering both the accuracy and the detection time, the preferred number of hidden units is 200. Similarly, we set up the minibatch size to be 256 regarding the trade-off between the accuracy and the detection time.

F. MITIGATION OF DRAWBACKS WHEN ONLY LEARNING DATASETS FROM A SAMPLING DEVICE
In this paper, we presented a feature extraction method to solve a generalisation issue to support detecting anomalies from all IEDs while only learning datasets from one typical sampling IED. However, after removing some features, we also lose identity information. Thus, the detection model can only indicate there is an anomaly within an SAS, but cannot directly identify which devices triggered the anomaly. This drawback can be mitigated in two different ways.
Firstly, after the initial detection stage when an anomaly is detected, we can implement an additional process to re-extract the identity information from the abnormal packets. Then, we can discover which devices caused the anomalies. Secondly, by extracting the average value of each level in the primary plant, we can observe if any of these levels involves anomalies. Similarly, we can extract additional physical features to indicate which specific parts of the primary plant have been impacted, such as the standard deviation value of the same level. This aspect of the problem will be considered further in future work.

G. ERROR TYPE "OTHERS"
Lastly, in each experiment, we indicated the number of special classification errors, called "Others". This type of error is a non-critical one that has two cases. For Case 1, a benign sample is misclassified as another benign type, e.g., label 0 (normal operation status) is misclassified as label 1 (emergency operation status). For Case 2, a malicious sample is misclassified as another malicious type, e.g., label 908 is misclassified as 907. Case 1 will impact the systems' normal operation as it will mislead systems' statuses. However, Case 2 will not bring any impacts to the systems' normal operation as anomalies are still detected eventually, though it may interfere with mitigation decisions as anomalies are misclassified.
To investigate the impacts of this "Other" error, we analysed the detailed results of each experiment in Table XIV. For all results with different window sizes, we discovered that Case 1 occurs once at most, and all the rest of the errors are Case 2. Therefore, these "Other" classification errors are of little consequence overall. Nonetheless, this type of error will be investigated further in future work.

X. CONCLUSION
In this paper, we presented an anomaly detection model to detect insider attacks triggered from untrusted control devices within SASs. Our model combined feature selection and extraction methods, sequential classification algorithms, and sliding window algorithms. By selecting and extracting six critical network features and seven summarised physical features, our model can effectively detect insider attacks from any IED even though malicious behaviours from only one typical IED were learnt. Compared to traditional individual sample classification methods, our method combines BiLSTM and quantity-based sliding window algorithms, and improves detection accuracy by reducing the FNR from 30.261% to 0.372%. In future work, we will extend this method to protect SV communication and also the communication in the highlevel station bus of IEC 61850-compliant SASs.