Physical Layer Detection of Malicious Relays in LTE-A Network Using Unsupervised Learning

For Long Term Evolution Advanced (LTE-A) network, although there exist many studies that focus on improving the performance with relays, security issues are often neglected. Due to the broadcast nature of wireless channels, relay nodes in LTE-A network may act maliciously, affect communication, reduce quality, and cause delays. Recently, physical (PHY) layer security has attracted researchers to provide secure communication and data privacy. In this study, we propose using unsupervised learning approach at the destination node to detect malicious relay attacks in cooperative LTE-A network based on received source signal in the PHY layer. Outlier detection algorithms such as one class support vector machine (OCSVM), local outlier factor (LOF) and isolation forest ( $i$ Forest) are applied to detect various malicious relay behaviors such as garbling, regenerative, and false data injection type attacks. As input to these algorithms, feature vectors are constructed by using amplitude, phase, and relative phase information of modulated baseband symbols. The performance of the outlier detectors are evaluated with respect to precision, accuracy, and under the area curve (AUC) measures under changing signal-to-noise ratio (SNR) levels, different modulation types, allocated number of resource blocks (RBs), and varying data size. The results demonstrate the effectiveness of our proposed outlier detection approach for detecting malicious relays in the LTE-A network. Accuracy and precision of the algorithms are observed to be above 90% for 10 dB and larger SNR levels for the relay attack scenarios considered here. AUC values for all algorithms for all SNR levels is also above 0.9 for the attack detection cases, and the performance of the LOF algorithm with 0.95 and above AUC values is superior to other algorithms. The results verify the contribution of this study, which is the demonstration of the effectiveness of one class outlier detection approaches for detecting malicious relays in the LTE-A network.


I. INTRODUCTION
LTE-A standard provides increased coverage and lower delays in cellular systems that use higher carrier frequencies at the same transmission power and higher bandwidths [1]- [4]. LTE-A supports deployment of femtocells, picocells, relays, and remote radio heads in a macrocell arrangement to provide wider coverage and higher capacity at low cost to target areas with low power nodes [5]- [7].
As cell edge performance becomes more critical, LTE-A relay technology becomes an eminent solution with low cost and high performance. LTE relays are easy to set up for their applications to improve network capacity. The main The associate editor coordinating the review of this manuscript and approving it for publication was Adnan M. Abu-Mahfouz .
benefits of the deployment of relay nodes are to provide enhanced coverage in LTE-A targeted areas for efficient heterogeneous network and to avoid interference in surrounding areas at low cost. However, although many outstanding studies have been done on ways to improve relay performance in the LTE-A network, security issues are often neglected.
Due to the broadcast nature of the wireless channels, even though the diversity received from the relays is high quality, these relays may act maliciously by garbling the message transmitted from source, injecting false data, or regenerating new signal [8]. Therefore, cooperative systems are naturally vulnerable to these types of attacks. Recently, PHY layer security has attracted researchers to provide secure communication and data privacy. Before cooperating with a relay, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ it may be possible to prevent potential attacks and possible costs by detecting whether the relay node is secure or not.
In [9]- [12], malicious node detection studies for relay network are based on statistical approaches by distinguishing the two signals with a very similar distribution using threshold values for their detectable or undetectable states. In [12], attack detection problems have been evaluated as statistical learning problems with channel state information (CSI) for different attack scenarios where measurements are observed collectively or online [12].
Studies in [13]- [15] have applied supervised machine learning technique with statistical CSI to provide security in relay network, in which artificial neural networks and decision trees techniques have been used with past data on the network.
Under the category of the PHY security with supervised and unsupervised learning in relay network, it has been seen that supervised algorithms are widely used [16]. Although supervised learning is effective method, it is a costly approach due to need for balanced data sets and it can detect only trained categories [17]- [20].
In [17], [18], relay signals have been categorized into two known classes as reliable and unreliable by using the features of I/Q constellations such as amplitude and phase in supervised learning algorithms. In [19], identity fraud has been detected through reliable channel features with a semi-supervised learning method. Anomaly detection has been studied in [20] using clustering and semi-supervised algorithms with well-known data sets, i.e., categorizing the samples on predefined data and model. Outlier detection is one of the most important approach in all data-driven scientific disciplines. Data that does not match expected behavior often has some interesting features and can help them better understand the problem at hand [21], [22].
In this study, we focus on detecting malicious relay attacks in cooperative LTE-A network in PHY layer via unsupervised approach of machine learning with a more inclusive and powerful way than the statistical and supervised machine learning algorithms. We consider outlier detection as an unsupervised learning method.
In our study, one-class support vectors (OCSVM) [23], local outlier factor (LOF) [24] and Isolation Forest (i Forest) [25] are used as outlier detector algorithms to detect anomalies and unreliable behaviors of relays using secure data samples from the source.
The reason for choosing these three algorithms is that they had been shown to be effective even with a small number of samples in the training phase as compared to other detection methods. Another reason is that these algorithms are representative for different categories of outlier detection methods, i.e., LOF is distance-based, i Forest is isolation based, and OCSVM is domain-based outlier detection methods. This allows us to show the contribution of outlier detection algorithms to the problem solution [22]. Conventional outlier detection approaches utilize the statistics of the received signals and may detect the low-density signals as an outlier. Therefore, it may not be possible to detect malicious relays as outliers, which have signal statistics that are indistinguishable as to be discussed in Section III-A.
Although supervised learning methods may achieve successful results with a balanced training dataset, it may not be feasible to always predict the attack models and find a sufficient number of examples for training the algorithms in real-time applications. Application of supervised learning approach to our problem definition is not feasible, since malicious relays must be detected as their signals are received at the destination node before further processing for diversity purposes. Therefore, in this article, we have compared performances of the aforementioned unsupervised learning approaches for detecting malicious relay attacks.
The contributions of this article are as follows: 1) We show that unsupervised learning algorithms such as OCSVM, LOF, and i Forest which represent broad category of outlier detection approaches are capable of detecting malicious relay attacks with high accuracy and precision in physical layer in the LTE-A network.
2) We propose feature vectors as inputs to above algorithms based on baseband modulated signal characteristics in the OFDM system and show the effectiveness of this feature selection for the problem definition in LTE-A relay network.
We explore the effects of different factors on the attack detection performance of OCSVM, LOF, and i Forest algorithms such as modulation type, attack type, and signal bandwidth.
The remainder of the article is as follows: We explain the system model and problem definition in Section II. In Section III, we define the feature set and explain the algorithms adapted to our problem. Section IV presents the experimental setups and performance evaluation is given in Section V. Finally, the conclusion is given in Section VI.

II. SYSTEM MODEL
In this section, we discuss the system model along with a problem definition of cooperative communication technology in LTE-A. In LTE Rel. 10, it has been specified that user equipment (UE) or mobile station can communicate to eNB via a relay node which is referred to as LTE-A donor eNB. Radio relay stations can be deployed on in different architectures in the network as being standardized in 3GPP. Potentially useful relay deployment scenarios are summarized in Table 1 as discussed in 3GPP for LTE Rel. 10 [26]. Relay technology supports one-hop for fixed position and mobile relay stations provide a feasible solution to expand the coverage area in all scenarios.
In this study, we consider deployment of Layer 1 relay nodes in downlink of LTE-A network, which is known as amplify and forward (AF) type of radio relay technology as illustrated in Figure 1. Layer 1 relay implementation is simple and has short proccessing delays that makes it relatively lowcost when compared to other layer relay implementations.
The aim of Layer 1 relay is to increase the coverage in montainous, urban and indoor areas and also to increase data rate in highly populated center of urban areas. However, there is a disadvantage of this implementation which is amplifying the noise as well as the desired signal. The relay communication consists of two consecutive time periods. During time period 1 (1st hop), macro base station (eNB) broadcasts the signal, and UE and relay node (RN) receive. Then, during time period 2 (2nd hop), RN amplifies the received signal from eNB and then transmits it to the UE. In LTE-A downlink channel, the input data stream is carried by the Orthogonal Frequency Division Multiplexing (OFDM) symbols over multiple frequency subcarriers. The block diagram of OFDM signal for LTE-A downlink data transmission channel and our attack detection approach is illustrated in Figure 2. The input data stream is converted from serial to parallel through a specific modulation scheme such as Quadrature Phase Shift Keying (QPSK), Sixteen Quadrature Amplitude Modulation (16-QAM), and 64-QAM. The modulation scheme depends on the physical channel mapped on the resource grid. The complex representation of a modulated symbol is carried by the signal x i which is on i th subcarrier. Consequently, the modulated symbols are mapped over the resource elements (RE) of an LTE-A RB. According to LTE-A standards, one RE consists of a subcarrier of 15 kHz in the frequency domain and 66.7 µ s of time duration in the time domain. Moreover, one RB consists of 84 REs which means 12-subcarriers and 7-symbols in frequency and time domain, respectively. Mathematically, it can be represented as, where P is the number of RBs assigned to the user. The number of RBs depends on bandwidth. For instance, if we have 10 MHz bandwidth, 9 MHz is used for data transmission and 1 MHz is reserved for guard band. At the transmitter side the Inverse Fast Fourier Transform (IFFT) is applied to convert the parallel frequency-domain symbols into serial samples to generate a composite time-domain signal. The composite transmitted signal x (t) is further divided into samples. To prevent the inter symbol interference (ISI), we add a guard band called as cyclic prefix (CP) which is a copy of the tail of a symbol tail. Figure 3, where S is the eNB, R is the RN, and D is the UE when referred to Figure 1. In this system model, R is assumed to be Layer 1 relay which implements AF relaying protocol. The source node (S) communicates with the destination node (D) via direct transmission and a relay node (R). The source transmits an OFDM signal x (t) within a period T1 that is equal to half subframe in LTE-A standard. There are total of N transmitted symbols from S with a half subframe, i.e. x=[x [1] x [2] . . . x[N ]] T . During the transmission time period 1, the received signal vectors at R and D are given by,

LTE-A network downlink transmission model is considered in
(3) VOLUME 8, 2020 During T2, the relay node R amplifies the received signal y sr and then retransmits it to D as given by, In above equations, the vectors y and n are constructed from time samples, which can be written as, where ρ = 1as relay power, P sr is the received signal power at the relay node, and σ 2 sr is the variance of S-R link channel. In Equation (2) through (6), n sr , n sd and n rd are i.i.d additive white Gaussian (AWGN) noise signals with variance σ 2 in the S-R, S-D and R-D links, respectively. The complex channel coefficients are denoted by h sr , h sd and h rd in the S-R, S-D and R-D links, respectively, which include the effects of path loss and fading in the wireless channel. We assume the wireless channels over S-R, S-D and R-D links are assumed to be unchanged during the half subframe period.
In the conventional cooperative network, signals received at the source node and the relay node are combined using different diversity techniques at the receiving node to improve signal quality. However, in some cases, the relay node may manipulate the received signal from the source and intentionally send false information to the destination node. And, this relay is referred to as malicious relay. Cooperative combining techniques ignore the malicious behavior of the relay nodes, and just try to use the diversity provided by them. Given that the signal transmission from Source S to destination D is received correctly and reliably, the critical issue is to determine whether the transmission from the relay is secure before combining its signal for diversity. The aim of this study is decided to detect whether the relay is secure under various types of relay attacks. In the LTE-A relay communication, the possible attack types that a relay may create in the physical layer are discussed. Due to the nature of the communication, it is not possible to determine all the malicious behaviors that may arise from the relay, and we have considered possible attack scenarios that may occur in the physical layer. Therefore, three different malicious relay attacks are considered as described below [8].

B. GARBLING ATTACK MODEL (A1)
The malicious relay node generates a new random permutation of the modulated symbols in the constellation. The index of original symbols y,i.e., y sr [1] . . . y sr [N ] will be repositioned by a random permutation function P, P generates a sequence by taking a uniformly distributed random index between 1 to N in the symbol constellation.

C. REGENERATIVE ATTACK MODEL (A2)
The relay node intentionally ignores the original signal x and replaces it with newly created signalx, The new signal generated at the R has the same modulation properties of the original signal x.

D. FALSE DATA INJECTION ATTACK MODEL (A3)
In this attack type, the R deliberately injects a random signal vector ato the received signal x from the source before transmitting it the D as given by,

III. ATTACK DETECTION USING MACHINE LEARNING METHODS
In this section, we discuss how relay attack detection can be achieved through machine learning techniques with unsupervised learning methods. Attack detection for unsupervised learning algorithms is also called as outlier detection in literature [22]. Unsupervised learning identifies outlying data by using the score of each unlabeled nominal signal sample whereas supervised learning needs labeled samples. In supervised learning, training set Z = {(z i , l i )} tuples where class labels l i with secure or attack signal models and sample z i can be used to construct model to predict label l i for the new sample z i . Supervised learning methods can achieve more successful results with a balanced training dataset, however in real world applications, it may not be feasible to always predict the attack models and find a sufficient number of examples for training the algorithms. For this reason, due to nature of our problem definition to detect malicious relays, it will be faster and more independent to detect the signals received from the relay with unsupervised learning. When imbalanced class distribution for attack models is considered, this requirement is particularly costly affecting efficiency of supervised learning algorithms [27]. However, unsupervised learning does not require balanced and labeled data sets. In addition, outlier detection techniques may be more suited for the cases with difficulties in modeling attacks. Constructing the decision function with only nominal unlabeled samples significantly affect the effectiveness of the system against unknown attack models.

A. STATISTICAL PROPERTIES OF THE RECEIVED SIGNAL
We first examine the statistical distributions of received signal at D. A sample illustrative probability density functions (PDFs) of received signal amplitudes |y sd | and |y rd | for different relay attack types, secure relay, and source signals are shown in  the resultant distribution to define the normality in the data space [12]. The opposite perpective of CDF's is the complementary CDFs which show signal values above a particular level called as exceedance probability. As seen in Figure 5, CCDFs of signal amplitudes have close exceedence probabilities and are indistinguishable to be used in statistical attack detection. On the other hand, there are some differences in exceedence probabilities of signal phases.
Conventional outlier detection approaches utilize the statistics of the received signals, and may detect the low density signals as an outlier. Therefore, it is not possible to detect malicious relays as outliers, which have signal statistics that are indistinguishable as shown in Figures 4 and 5. In order to detect such relays, this study explores the feasibility of applying ML algorithms which need properly defined features of signal characteristics. For this reason, we seek for new approaches with multiple features and robust solution for attack detection problem [29]. In the next section, we define the signal featues and propose using outlier detection algorithms to provide possible solution to such a problem. Presented detection models have many benefits when we compare these models with the traditional probabilistic detection models. With our proposed detection model, we can create more complex decision functions and search hidden relation between consecutive symbols.

B. FEATURE EXTRACTION
Let y n,η→D be the n th sample of complex baseband received symbol over the η → D link at destination D in Figure 3, where η ∈ {S, R}. Transmitted OFDM signals from the source S are passed through the wireless channel, and there- fore y n,η→D can be treated as a modified OFDM signal. OFDM symbols at S are based upon digitally modulated symbols such as QPSK, 16-QAM, 64-QAM in LTE-A network. These symbols can be distinguished by their amplitude and phases in the signal constellation. Also, phase differences or relative phases of the symbols provide information about modulated symbol patterns.
These modulated symbols can be defined as points in the constellation with different amplitudes and phases. Therefore, these amplitude and phase information or relative phases of y n,η→D may provide some useful tool in capturing any unordinary effect on this signal. For this reason, we define following features as inputs to the outlier detection algorithms to detect relay attacks: f (1) n,η→D = |y n,η→D |, f (2) n,η→D = y n,η→D , f (3) n,η→D = y n,η→D − y n+1,η→D , VOLUME 8, 2020 where |(.)| and (.) denote the amplitude and phase operators. The features defined in above equations are depicted in Figure 6 for 16-QAM modulation. The feature vector of symbol y n,η→D is then given by, n,η→D f (2) n,η→D f n,η→D .
For the m th half subframe, i.e., slot, we have N consecutive symbols during the learning phase, and therefore feature vector can be written as, Since there are M half subframes, the dataset matrix in the learning phase is constructed by merging the feature vectors, which is given by, where the size of Z is L×M and L=3N.

C. ONE CLASS SUPPORT VECTOR MACHINE
OCSVM introduced by Scholkopf et al. [23] is special version of SVM algorithm to outlier detection. The hypersphere or decision function in OCSVM is constructed for a small region wherein positive targeting samples are labeled as +1.
The decision function or boundary is a smallest hypersphere which takes into account most of the training data which is mapped by a kernel function. However, it allows to locate data points outside of the boundary by including slack variables ξ i, i = 1, . . . ,M.
The number of data points located outside of the boundary is calculated by a penalty factor of 1/vM. Choosing a small penalty factor will cause more data points to be located outside of the boundary. Figure 7 illustrates a non-linear mapping of original data samples with radial basis function (RBF) kernel in OCSVM [20]. Data samples are separated from the origin with a maximum margin kernel function until finding the smallest hypersphere that contains all positive samples.
OCSVM optimization model for our problem definition is formulated as follows: where φ is the non-linear mapping function (kernel) of the feature vectors in the training set, w represents the weight vector in the model, ξ m is the slack variable for regulating weights, ρ represents the maximum deviation from the limit, and v ∈ (0,1] specifies an upper limit for abnormal values while determining the lower limits for the support vectors. Distance and center of the data samples z m are controlled by ξ m in the optimization model. In the test phase of the model, we try to determine whether sample vector z j falls outside of the hyper plane via a decision function given by, where µ is the Lagrange multiplier and K z m ,z j is RBF given by, where ||. || is Euclidian norm opetrator and γ is the ''spread'' parameter of kernel to set the length of bell-shaped curve.

D. LOCAL OUTLIER FACTOR
LOF method estimates the local density deviation of a given data sample with respect to its neighbors. LOF is a scorebased method in which samples whose densities are significantly different than their neighbors are marked as outliers, i.e., LOF ≈1 no outlier, LOF 1 outlier. Local density cells are formed by reaching out of the samples until capturing k neighbor samples in training data set. The decision function in LOF is given by [24], where v k is the average of local density cells, z m ∈ Z, z j ∈ Z are the data vectors in (18). The LOF searches for the neighbors of a point to find its density and then compare it with the density of the other points. While a small k has a more local focus by looking at nearby points, a large k can miss local outliers. If there is too much noise in the data, inaccurate results can be obtained with small k.

E. ISOLATION FOREST
The i Forest algorithm converts the dataset into the subtree ensembles. The i Forest algorithm recursively isolates random individual instances by creating binary tree structures. The average path lengths of these random trees are normality measurement values in the decision function. For our problem, the score of the i Forest algorithm is given by [25], where E denotes average of h(z j ) which is the path length, and c(Z) is the average path lengths of given Z for all data set. The data vector z j with the score greater than 0.5 is marked as an outlier. The most important parameters of this algorithm are the adjustment of a subtree number. If the subtree number is too high, the model overfits and the number of false alarms will increase. However, if chosen large number of subtrees that will decrease the performance of execution therefore, there must be a compromise between performance and high execution time. Therefore, subtree parameter is chosen according to peak performance approaches.

IV. SIMULATION SETUP
Simulations are performed to evaluate the performance of algorithms. We investigate the performance of the proposed attack detection scheme by constructing the signal model in LTE-Advance system. Signal model is generated in MATLAB and used as an input to outlier detection algorithms which are run using ''Scikit Learn'' machine learning library [30]. After creating signal model in (2) through (8), we extract features from consecutive symbols as described by (13) through (18). During the training phase of the algorithms, data set is generated for 5 ms duration (half frame period) in LTE-A system model. Attack scenarios such as garbling (A1), regenerative (A2), and false data injection (A3) type attacks are created as described in Section 2. Monte Carlo simulations are performed 20 times for the same signal setup conditions (i.e., SNR, modulation type, bandwidth) and the same algorithm selected. In each run, data set is obtained from randomly generated information bits and channel conditions. Properties of generated data set and algorithm parameters are summarized in Table 2. In outlier detection algorithms, parameters have been selected to prevent overfitting problems in OCSVM, LOF, and i Forest algorithms. For OCSVM, we employ RBF kernel and the parameter v is set to 0.3, which is an upper bound on the fraction of margin errors and a lower

V. PERFORMANCE EVALUTION
For the performance evaluation of outlier detection algorithms in our problem, we do not only concern the detection of relay attacks but also try to determine the secure relay cases [31]. We consider precision, accuracy, and AUC as the performance evaluation metrics. Confusion matrix for attack and secure relay cases is defined in Table 3.
Precision is the proportion of correctly classified attack cases out of all detected attack cases, which is given by, In our problem, detection of attacked relay case is considered as outlier detection. On the other hand, accuracy measures truely classified attacks and secure cases among all cases, Confusion matrix can also be used as evaluation metrics to measure the diagnostic ability of one class classification such as receiver operating characteristic (ROC) curve. The ROC curves can be derived from CDF of true positive rate (TPR) with respect to false positive rate (FPR). An ROC curve summarizes all of the confusion matrices that each threshold produces and shows TPR vs. FPR at different classification thresholds. By lowering the classification threshold, more items can be classified as positive, thus increasing both false positives and true positives. TPR and FPR are defined as following, TPR is also referred to as recall. The AUC defines the entire two-dimensional area underneath the entire ROC curve, i.e. it provides an aggregate measure of performance across all possible classification thresholds. The statistical meaning of AUC is the probability that the model ranks a random positive sample more highly than a random negative sample. It also provides measure of how well predictions are ranked, rather than their absolute values. AUC can help us decide which classification method is better.
Below, we present the results regarding the performances of aforementioned OCSVM, LOF, and iForest algorithms in terms of precision, accuracy and AUC measures with regard to SNR level, modulation type and data size (M ).

A. EFFECT OF ATTACK TYPES, MODULATION TYPES, AND ALLOCATED RESOURCE
When the performance of the algorithms is examined regarding attack types, the positive effect of increase in SNR level is observed. Even under the low SNR condition, it can be seen that all algorithms achieve at least 90% precision level (Fig.12, 14, and 16). The precision of the algorithms is more significant for the secure relay detection case when compared to the attack detection case. From Figure 10, we observe that precision level for i Forest varies between 0.4 and 0.8 while that of OCSVM and LOF varies between 0.5 and 1.
In secure relay detection with varying SNR level, both LOF and OCSVM exhibits nearly the same precision performance, and precision performance of i Forest is lower than these algorithms. Also, the effect of modulation types on precision level of the algorithms for secure relay detection is insignificant. We see that 70% precision level can be attained at nearly 10 dB SNR level in OCSVM and LOF, and 22 dB SNR level in i Forest.
Precision represents the performance of detecting only relay with attacks (true positive). On the other hand, accuracy provides measure about the accuracy of detecting both relay attacks and secure relays (true positive and true negative). Hence, in order to asses the overall detection performance, we have also obtained accuracy of the algorithms.
The accuracy in secure relay detection in Figure 11 are approximately close to the precision in Figure 10. However, accuracy of iForest algorithm is worse than its precision, i.e., at 20 dB SNR level its precision and accuracy values are about 70% and 60%, respectively. Overall, the effect of SNR variation on accuracy and precision performance is remarkable in secure relay detection.
The performance results of the algorithms incase of relay attack are presented in Figures 12 through 17. Comparing precision and accuracy results of Garbling attack model (A1)  presented in Figure 12 and Figure 13, it is seen that accuracy is lower than its precision especially under low SNR conditions (10 dB and below) for all algorithms. This is because under low SNR conditions secure relay may be evaluated as false alarm and this affects detection accuracy. However, even with these false alarms, 80% and above accuracy is achieved. Considering only precision, over 90% performance can be achieved in all algorithms for A1 attack model. While A1 and A2 attack type detection OCSVM more effected until 15dB SNR level than LOF and iForest with less precise performance. With regard to detection of regenerative attack (A2), as shown in Figure 14, while LOF and i Forest algorithms have precision above 95% for all SNR values, OCSVM can achieve this with SNR of approximately greater than 20 dB. When accuracy of the algorithms is examined in detection of regenerative (A2) attack given in Figure 15, LOF algorithm performs above 95% for all SNR values, whereas  OCSVM and i Forest can achieve the same performance with SNR values greater than 20 dB.
For the detection of false data injection attack (A3), precision and accuracy of the algorithms are presented in Figure 16 and Figure 17, respectively. All algorithms have precision and accuracy performance above 95% and 90%, respectively. In general, all algorithms in A1, A2, and A3 attack detection show more than 90% precision performance. In terms of accuracy performance, it is seen that i Forest algorithm is affected by low SNR level, i.e., 10 dB and below, especially for detecting A1 and A2 attacks.
The performance of attack detection for LOF is able to predict a very accurate density than OCSVM and i Forest with sufficiently high neighbor numbers to detect high number of outliers. A high accurate is also observed for OCSVM by mapping high-dimensional hypersphere. However, i Forest has good average performance even in high dimensionality VOLUME 8, 2020  than LOF and OCSVM but less accurate due to high false alarm rate because of overfitting problem. As result of secure relay detection, OCSVM is a good candidate with small amount of data performance without requiring significant tuning in our benchmark.
When the effect of modulation types (QPSK, 16QAM, 64QAM) on the performance of algorithms are examined, it is seen that accuracy is not affected considerably for the same attack type as indicated in Table 5. This situation can be evaluated as an indicator of the robustness of the attack detection model we proposed in this study.
In the presented results, the parameters of the algorithms specified in Table 2 were used. However large values of the parameters v, k, and base estimator causes overfitting problem and therefore increases the false alarm rate for all three algorithms and decreases attack detection accuracy. It is of great importance that the parameters used in one-class  outlier detection algorithms are to be selected at the point where they result in the most accurate results.
Another parameter used to evaluate the performance of the system is the data size M . The results for the effect of data size is summarized in Table 6. No remarkable effect of data size on attack detection is observed. The data size is especially important in secure relay detection case. As M is increased from 6 to 100, accuracy of the algorithms increases especially for secure relay detection case.
AUC performance of the algorithms is presented in Figure 16. AUC of all algorithms is above 0.9 for the attack detection cases. For all attack types, the performance of the LOF algorithm with 0.95 and above AUC values is superior to other algorithms.
Regarding secure relay detection, when AUC values for 15 dB SNR level are examined, AUC values of LOF, OCSVM, and i Forest algorithms are 0.81, 0.78 and 0.58, respectively. The i Forest algorithm is not so successful in detecting secure relays. The LOF algorithm has the largest AUC performance and obtaining feasible results with the OCSVM algorithm indicates that these two algorithms can be used to detect malicious and secure relays even under low SNR levels around 10 dB.
In Table 4, performance of outlier detection methods are compared with conventional and supervised learning methods. Conventional methods that are Least-squares Anomaly detection (LSA) [32] and Probabilistic Principal Component Analysis (PPCA) [33] have failed to distinguish secure and attack samples with high false alarm rate. The accuracy values of these methods are 0.5 because they detect all secure relay cases as attack cases. As statistical properties are described before, the data set is not distinguishable with such conventional approaches. For malicious relay detection, accuracy performance of supervised learning methods that are Neural Networks (NN) [34] and Support Vector Machine for Regression (SMOreg) [35] with labeled and predefined dataset remains below that of unsupervised methods.
Based on the above presented results, two conclusive remarks can be made about malicious relay detection problem in LTE-A network. First observation is about the comparison of unsupervised learning or outlier detection methods among themselves. High accuracy is observed for OCSVM and LOF. However, due to overfitting problem that causes high false alarm rate, i Forest has average accuracy performance with high-dimensionality. LOF has higher accuracy and precision when compared to OCSVM and i Forest in detecting relay attacks but its performance depends on selecting sufficiently large number of neighbors. Thus, OCSVM is a good candidate for the malicious relay detection problem with small amount of data performance without requiring significant tuning in our benchmark. Second observation is about the performance comparison of studied unsupervised learning methods with supervised learning and conventional methods.  When average detection accuracy of these methods are considered, LOF and OCSVM outperforms supervised learning (NN, SMOreg) and conventional methods. The reason for that is data sparsity in training phase which is an important challenge for the performance comparison of the proposed approach.

B. COMPLEXITY COMPARISON
Another performance assessment in the applications of algorithms is the complexity analysis. In this study, time and hardware needs are analyzed as complexity analysis of machine learning algorithms applied to our attack detection. The experiments are performed on Intel (R) Core (TM) i7-7500U, CPU 2.90 GHz and 8GB RAM hardware computer.
Training time against to changing data size (M ) for all three algorithms applied in this study is obtained and plotted in Figure 19. As expected, larger data size causes larger training time, and OCSVM is observed as the fastest algorithm in terms of training time. Time complexity of the algorithms is shown in Figure 18 during prediction phase in which the decision is made whether the test data is attack or secure. The prediction time cost is observed to be approximately the same for all three algorithms. The memory usage of the algorithms during the training phase is also analyzed with respect to varying data size and it is given in Figure 21. As the data size increases, the amount of memory used by the algorithms increases, but in the worst case, it does not exceed 1MB of memory space. OCSVM algorithm consumes the largest memory space. Since the memory usage of algorithms in the prediction phase is too small to be ignored, they are not shown in the plots.
Considering the memory consumption cost of the algorithms it is observed that applying these algorithms even in small capacity devices may be feasible. In the context of the practicability of the relay attack detection system, the memory and time requirements of the detection algorithms imply their applicability for small-capacity user equipment. As illustrated in Figures 19 and 21, training data size for 6 RB with 1.4 MHz bandwidth requires small memory size and time to train the model.
Complexity of the algorithms in terms of training times and prediction times are given in Figure 19 and 20. In terms of training time which is a dominant factor in time complexity of the algorithms and relatively larger than prediction times, OCSVM outperforms other methods. Other unsupervised methods, i.e., LOF and i Forest have training times which are smaller as compared supervised methods, i.e., NN. Unsupervised methods have larger prediction times when compared to supervised and conventional methods. Memory usage performance of supervised and conventional methods is noticeably worse than unsupervised methods as seen from Figure 21. Memory usage requirements of outlier detection methods is much lower (nearly 10 times) than that of conventional i Forest algorithm has exponential complexity, which is greater than complexity of LOF algorithm but less than OCSVM algorithm. PPCA method has the largest complexity. LSA and NN has similar complexity but it is still higher than that of outlier detection methods. Table 7 summarizes the training time, prediction time, and memory usage dependencies of the algorithms to data size M and feature size L. When we increase the number of features L, its effect on training time, prediction time and memory usage for all three algorithms remains low. On the other hand, the effect of data size is varying depending on the algorithm.

VI. CONCLUSION
In this study, the problem of detecting malicious relays in cooperative LTE-A has been studied. The detection of the attacks was performed without applying demodulation and channel estimation steps on the received signal in the physical layer. Detecting an attack in the first step for cooperative relay systems, i.e., before further processing the received relay signal at the destination, provides a significant gain in wireless communication systems. The physical layer detection approach by means of unsupervised machine learning techniques has been applied for this problem. More specifically, we have considered outlier detection as an unsupervised learning method. We have proposed using one class approach with OCSVM, LOF and iForest algorithms which can distinguish between malicious and secure relays. The effects of SNR, modulation type, data size on the precision, accuracy, AUC performances of the algorithms have been investigated for detecting relay attacks such as garbling, false data injection, and regenerative attacks and secure relay cases. This study has shown that malicious relay attacks can be detected by using the one class method with high precision and accuracy performances even under low SNR conditions. AUC values for all algorithms were above 0.9 for the attack detection cases. For all attack types, the performance of the LOF algorithm has been found to be superior to OCSVM and iForest algorithms. Accuracy performance of the algorithms for detection of malicious relays have not been affected considerably by changing the modulation type (QPSK, 16-QAM, 64-QAM) in LTE-A network. Also, the effect of changing data size M on the attack detection performance was observed to be insignificant. This study has showed that the minimum resources in the training phase, i.e., the lowest amount of RB that can be carried in a frame with the lowest bandwidth in LTE-A systems, can be used detect malicious relay attacks effectively. The results also implied that applying these algorithms even in small capacity devices was feasible.
The major contribution of this work is the demonstration of feasibility of applying unsupervised learning with outlier detection approach for effective detection of malicious relay nodes in the LTE-A network. We have also shown that baseband signal characteristics of malicious and secure relays in physical layer, which is quite difficult to detect using conventional statistical approaches, can be distinguished by proper selection of features in ML based approaches.