Robust Automatic Modulation Recognition Through Joint Contribution of Hand-Crafted and Contextual Features

Automatic modulation recognition (AMR) has become increasingly important in the field of signal processing, especially with the advancements of intelligent communication systems. Deep Learning (DL) technologies have been incorporated into the AMR field and they have shown outstanding performances against conventional AMR methods. The robustness of DL-based AMR methods under varying noise regimes is one of major concerns for the widespread utilization of this technology. Furthermore, most existing works have neglected the contributions of hand-crafted features (HCFs) in boosting the classification performances of DL-based AMR methods. In order to address the aforementioned technical challenges, a novel and robust DL-AMR method is proposed by leveraging the benefits of both contextual features (CFs) and HCFs for a specific range of signal-to-noise ratio (SNR). A novel feature selection algorithm is also proposed to search for the optimal sets of HCFs in order to reduce the dimensions of feature vectors without losing any important and relevant features. Simulation studies are performed to investigate the feasibility of proposed method in classifying 11 types of modulation schemes. Extensive performance analyses revealed the superiority of proposed method over baseline method in terms of the classification performance as well as the excellent capability of proposed feature selection algorithm in determining an optimal subset of HCFs.


I. INTRODUCTION
For the communication systems widely used in both military and civilian applications, the radio signals are encoded by predefined adaptive modulation schemes with respect to the specification of transmission channel. Receiver needs to correctly identify the types of modulation scheme adopted in order to ensure successful demodulation. Most often, the receiver has no prior information about the modulation scheme of received signals in blind detection. Automatic modulation recognition (AMR) is a popular technique used The associate editor coordinating the review of this manuscript and approving it for publication was Xiaofan He .
to provide the blind recognition of the modulation scheme. In literatures, existing AMR techniques are designed and implemented based on two main approaches: (1) likelihoodbased (LB) and (2) feature-based (FB) [1]. Despite being able to achieve the optimum recognition rate, most LB approaches tend to suffer with technical drawbacks such as high computational complexity and strong dependency on the prior information of received signal [1]. In contrary to LB approaches, FB approaches are considered as the more practical solutions due to their ease of implementation for real-life scenario and ability to produce sub-optimal solutions. It is notable that the performance of FB approaches relies tremendously on the hand-crafted features (HCFs) manually extracted from the received signals. Different combinations of HCFs are also found to have crucial roles in recognizing different types of modulation schemes. The incorporation of more advanced feature-learning methods for deep feature extraction is necessary to develop the more robust LB techniques in solving AMR problems.
Recently, different deep learning (DL) models such as the convolutional neural network (CNN), recurrent neural network (RNN), deep neural network (DNN) and etc. have shown their promising advantages to tackle the key emerging issues in research fields of computer vision, healthcare, robotics, internet of things, and communication systems. For instance, CNN is widely used as an image recognition method to solve multiple-object detection problems due to its promising ability to automatically extract multiple levels of features from inputs. Motivated by the successes of DL in different areas, numerous DL-based AMR techniques have been proposed based on CNN, RNN, DNN and other network architectures in recent years and they have been observed to achieve significant results [1].
Jdid et al. [1] presented a comprehensive survey on existing AMR methods developed with different machine learning (ML) and DL models. They compared these ML-based and DL-based AMR methods in terms of their network structures, training and testing hyperparameters, noise conditions, modulation pools and performances. O'Shea and West [2] employed GNU Radio to generate a synthetic dataset known as the RadioML 2016.10A that contains eleven types of digital and analog modulations. A new AMR method was developed by O'Shea and Corgan [3] by using CNN and its performance was investigated with the dataset proposed in [2]. O'Shea et al. [4] amended the tool described in [2] to generate RadioML 2018.01A and more details of this new dataset will be summarized in Section II-B. Their previous works [3], [5], [6] with adaptation of CNNs for AMR were extended to solve the new dataset, where in-depth investigation in terms of the design parameters, channel impairment and training dataset parameters were studied through various simulations including over-the-air measurement of AMR performance.
Another CNN-based AMR was proposed by Zhang et al. [7] by employing the Short-Time Fourier Transform (STFT) to convert the input signal spectrograms into image representation. Similarly, Sun et al. [8] adopted an image dataset with 10 different modulation schemes to build a CNN-based AMR from a popular deep learning model known as VGG-16 [9]. Zhang et al. [10], applied both of Wigner-Ville and Born-Jordan distributions to convert the received signals into two types of images. These converted images were then fed into two CNNs for feature extraction before fusing them with the selected HCFs in order to solve the classification problems.
The work reported by Peng et al. [11] exploited the adaptation of constellation diagrams and a popular CNN model called AlexNet [12] for AMR. Similarly, constellation diagrams were employed as the inputs of feature extractor in [13], whereas a support vector machine (SVM) was used as the classifier for AMR. Given that the qualities of datasets are crucial to improve the performance of AMR, the pre-processing technique such as data augmentation was employed by Tang et al. [14] to ensure sufficient data were available to construct the datasets required for the training and testing of DL-based AMRs. Particularly, constellation diagrams were converted into the contour stellar images in [14] to obtain more color features than those of [11]. Several issues encountered in training process such as the divergence of generator, overfitting of discriminator and mode collapse can degrade the overall performance of traditional generative adversarial nets [15]. Numerous measures were proposed to address the poor performance issues of GANs and simulation studies revealed that the classification accuracies can be improved through the extension of datasets.
Most existing FB-AMR approaches proposed in earlier era [16] were designed based on the extraction of HCFs. Meanwhile, the DL-based AMR approaches proposed in recent literatures were found to heavily rely on the ideas of using deep neural network architecture for feature extraction [17]. Extensive amounts of previous research works in image processing have shown the advantages of concatenating the contextual features (CFs) and HCFs [18]- [20]. In particular, a proper combination between HCFs with CFs is able to achieve the performance boosting via more diverse representation of features. To the best of authors' knowledge, limited investigations were performed to search for the optimal combinations of CFs and HCFs. This is because the quality of CFs is governed by the efficiency and abundance of the used datasets. Meanwhile, the value of each HCF may change with different signal-to-noise ratio (SNR) values. It is also noteworthy that some HCFs tend to have the same values for different types of modulations. These undesirable characteristics of HCFs tend to jeopardize the accuracies of classifiers through the distortion of their hyperplanes. In order to address these technical drawbacks, it is crucial to design an efficient feature selection algorithm that is able to build the robust feature set.
As mentioned earlier, the performances of DL-based AMR methods are governed by the qualities of datasets, hence the latter should be constructed carefully to include a varsity of real-life effects. While the larger number of training examples can lead to better performance of DL-based AMR methods, the involved training processes tend to be time-consuming. This issue may be resolved by splitting the original large dataset into multiple subsets of data for the training process. The hyperplanes of classifiers also tend to be distorted when dealing with datasets that have very wide ranges of SNR. Although training a classifier with the wide range of SNR can eliminate the requirements of SNR estimation, CNNs with highly complex network structures are required to extract the relevant features. The required memory storage of classifiers could increase significantly due the tremendous amounts of training parameters involved in its feature extractor. In this paper, a novel DL-based AMR method is proposed to tackle the aforementioned issues, where the original SNR range is spilt into multiple ranges in order to train the good performing classifiers with lower complexity and smaller storage requirements.
The main purpose of AMR is to configure the parameters and modulation schemes of received signals in order to achieve a successful demodulation process. Greater accuracy of AMR process is expected to achieve through the incorporation of DL due to its excellent feature learning capability. This paper is devoted to study the feasibility of combining the robust set of HCFs with CFs in boosting the classification performance. Furthermore, this work also aims to investigate the performance differences of classification accuracy between simple CNNs trained with smaller SNR ranges and complex CNN trained with wider SNR ranges. For the former case, a set of HCFs are first constructed and the original SNR range is spilt into two smaller SNR ranges based on a predefined threshold. Then, these two sets of HCFs are concatenated with the CFs learned by the two CNNs deployed in their respective SNR ranges.
The main contributions of this paper are summarized as follows: • Introduce new compositions of High-Order Cumulants (HOCs) that can be used to provide good complement for CFs in order to boost the classification performance.
• Propose a novel algorithm that is able to determine the optimal threshold and most relevant features for splitting the wide-range SNR.
• Introduce a novel criterion known as Classification Confusion Index (CCI) which is the main core of the proposed algorithm that is able to select an optimal subset of HCFs to be used for solving the classification tasks.
• Propose a novel DL-based AMR method that is able to achieve high classification accuracy without compromising the desirable characteristics of low computational complexity and storage requirement. The findings of this paper are also expected to provide constructive insights for the following key questions: • Is it efficient to spilt a wide-range of SNR into two smaller ranges?
• Can a narrower range of SNR reduce the complexity and depth of CNN architecture in learning more meaningful CFs?
• Is the presence of HCFs able to boost classification performance of a DL-based AMR method? This work focuses on employing simple CNN architectures for learning CFs and concatenating these features with a set of HCFs in a wide-range of SNR that is split into two different ranges. The outline of this paper is summarized as follows. The background of AMR and specifications of RadioML 2018.01A dataset considered in current study are presented in Section II. Other required information such as the signal parameters and channel characteristics are also covered in this section. Section III explains the algorithm used to extract and select an optimal subset of HCFs, whereas the extraction process of CFs using CNNs are elaborated in Section IV. The overall mechanisms of proposed DL-based AMR method are then explained in Section V and the performance comparisons with baseline methods are presented in Section VI. Finally, the concluding remarks of current study are presented in Section VII.

II. AMR BACKGROUND
AMR is an essential process used to identify the type of modulation scheme adopted by received signals before recovering these received information through the demodulation process. In this section, a brief background of FB-AMR is first presented. Then, we wrap up this section by providing an overview of the most popular signals dataset, RadioML 2018.01A [4], in AMR field. A. FB-AMR BACKGROUND Fig. 1 shows the general block diagram of conventional FB-AMR methods that consist of three main phases: • Pre-processing phase: Several variables such as Carrier Frequency Offset (CFO), baud rate, Phase Offset (PO), SNR and timing offsets can be obtained from the input signals via different pre-processing techniques. The quality of received signal can be enhanced in this phase before proceeding to feature extraction phase.
• Feature extraction phase: Feature extraction is first performed in this phase to obtain the HCFs or CFs. These extracted features can also be categorized as instantaneous time features, wavelet features and statistical features. Feature selection algorithm is subsequently used to choose the most relevant features in order to reduce the dimensional size of feature vector.
• Classification phase: This phase is responsible to decide the types of modulation scheme adopted by received signal with one or multiple classifiers that can be broadly categorized as traditional classifiers (e.g., classification tree), ML-based classifiers (e.g. SVM) and DL-based classifiers (e.g., CNN).

B. SIGNALS DATASET RadioML 2018.01A
The RadioML 2018.01A dataset proposed by O'Shea et al. [4] is considered in this study because it is one of the most challenging datasets for AMR problem. This dataset consists of two different compositions of classes spread across a wide range of SNR values. The ''Normal'' classes of RadioML 2018.01A dataset consists of 11 modulation schemes commonly seen in the impaired environments, whereas the ''Difficult'' classes contain the 24 digital and analog modulation schemes. It is also notable that the RadioML 2018.01A dataset contains more than 2.5 million frames of modulation signals along with different synthetic simulated channel effects such as CFO, symbol rate offsets, delay spread and thermal noise. In this paper, the performance of proposed DL-based AMR method will be evaluated using the ''Normal'' classes. For the self-contained purpose, the parameters of RadioML 2018.01A dataset are presented in Table 1.

A. INSTANTANEOUS FEATURES
Several instantaneous features can be extracted by employing specific parameters such as the instantaneous amplitude a n , instantaneous phase φ NL and instantaneous frequency f N . Table 2 presents the derivation of the five instantaneous features [26] considered in this work.

B. HIGH ORDER STATISTICS FEATURES
Both of the Higher-Order Moments (HOMs) and HOCs [27], [28] are reported to be the best candidates for signal recognition. Mathematically, the HOMs of a signal x are defined as: where k is the order of the moment. The cumulant of order k of the zero-mean signal x is defined as: The mathematical relation between cumulants and moments can be expressed as: where φ ∈ {1, . . . , n}, v refers to the list member in the partition φ; c is the number of elements in the partition φ. For instance, the cumulants of up to 6 th orders are defined as follows [26]: It is evident from (9) to (15) that the estimation of moments will lead to the estimation of the cumulants as well. VOLUME 9, 2021 However, given a signal x with N samples, the moments are estimated as: Without loss of generality, the normalized signal x is assumed to have a unity energy, i.e., C 21 = 1. The normalization process is used to address the scaling issues of estimators. In practical, the self-normalized HOMs and HOCs are calculated as:M Estimating a moment of order k requires only around N complex addition and k × N complex multiplications. Theoretically, when k is close to N , the complexity of cumulant estimation is of order N 2 . Practically, the cumulant order k is far smaller than N and, thus, the complexity of cumulant estimation will be of order N [16]. Notably, the computational cost incurred by feature calculation is of the same order with those in estimation of cumulants and moments. Therefore, the feature extraction process is considered to have very low complexity of O(N ) in practical scenarios [16]. In this work, we introduce a total of ten new cumulants compositions. These compositions were found by some simple experiments and they are defined as follows: To this end, a total of 15 HCFs are obtained to construct the original HCFs set denoted as F, where: 42 , σ aa , γ max , P 1 , P 2 , P 3 , P 4 , P 5 , P 6 , P 7 , P 8 , P 9 , P 10 } (29)

IV. CFs EXTRACTION USING CNNs
CNN is a popular deep learning architecture used to automatically extract CFs from the input data. A typical CNN architecture consists of several convolution (Conv) and pooling (Pool) layers connected in series along with at least one fully-connected (FC) layer. The depth of CNN architecture increases with numbers of convolution, pooling and fullyconnected layers.
The convolution layers of CNN play essential role in feature extraction of input data, where these features can be learnt from a set of convolutional filters constructed by a group of neurons that are arranged as the rectangular gird. When the 2-dimensional (2D) input data are forward passed within CNN, a set of 2D activation maps are produced by sliding all convolution filters across the input data. During the training of CNN, the weight values of convolution filters can be updated through the forward propagation and backpropagation of error. This enables the activation of learned filters when encountering desired types of input signals.
Pooling layers are commonly located between the single or multiple convolutional layers. Max pooling and average pooling are two popular pooling layers adopted. The former pooling takes the maximum values within the kernels, whereas the latter one considers the average values within the kernels. The pooling layers of CNN can avoid network overfitting to certain extent by reduce the dimension sizes of feature progressively. The presence of pooling layers also enables CNN to becomes invariant to small translations of input, therefore improving its network accuracy. Finally, the fully connected layers are used as the classifier part of CNN and they consist of several nodes connected to all activations received from previous layer.
An appropriate activation function needs to be defined for all CNN layers except for its input layer. Both of the rectified linear unit (ReLU) and parametric rectified linear unit (PReLU) are commonly used in CNN by executing a threshold operation on each input element. Softmax is another activation function widely used to solve the multiclassification problems by normalizing the input vectors into a probability distribution consisting of several probabilities proportional to the exponentials of the input numbers and maps them into the (0, 1) interval.

V. THE PROPOSED DL-BASED AMR TECHNIQUE
This section is organized in a top-down manner. An overview of the proposed DL-based AMR method is first presented, followed by the explanations of technical details for each compositional sub-system in the subsequent subsections. Fig. 2 shows the block diagram of the proposed DL-based AMR method. Initially, the received signals are fed into feature extraction subsystem, where three main groups of HCFs are generated. The first group (G 1 ) is used for SNR splitter, 1 while the other two groups (G 2 and G 3 ) are used along with CFs for modulation schemes classification. For the HCFs belonging to G 1 that are fed into the SNR splitter sub-system, a specific threshold (Thr) is defined to divide the overall SNR range (R) into two smaller ranges of R 1 and R 2 , where R = R 1 ∪ R 2 . Depending on the identified SNR range, these signals are then fed into the corresponding CNN for CFs extraction. The CFs extracted from CNN 1 and CNN 2 are then concatenated with the G 2 and G 3 groups respectively. Finally, the classification of modulation schemes can be performed by one of the SVMs denoted as SVM 1 and SVM 2 that are deployed at both SNR ranges, R 1 and R 2 , respectively.

A. NOVEL ALGORITHM FOR SNR WIDE-RANGE SPLITTING
In this subsection, we propose a novel algorithm to split the wide range of SNR values into two smaller ranges based on a set of HCFs and an SVM classifier. The SVM classifier that can be utilized as a SNR splitter is first presented, followed by the descriptions of novel algorithm used to select the most relevant HCFs by accurately splitting the overall SNR range based on the most effective splitting threshold. The proposed algorithm is then evaluated by using both ''Normal'' and ''Difficult'' classes. Finally, a new training method known as cross-SNR training is presented to reduce the classification errors produced by SNR splitter in order to enhance the classification accuracy of the proposed DL-based AMR method.

1) SNR SPLITTER
In practical, SVM classifiers can provide good computation speed and memory, and work relatively well when there is a clear margin of separation between few classes. Therefore, a fast linear SVM classifier is adopted as the SNR splitter of proposed DL-AMR method. For each signal obtained from the dataset, a vector of HCFs denoted as G 1 is extracted. In offline training, all HCFs vectors that are extracted from the signals with SNR values lesser than a predefined threshold (Thr) are labeled as L 1 , whereas those with SNR values greater than Thr are labeled as L 2 . After the SVM is trained, it will be integrated in the proposed AMR method as the SNR splitter. It will identify the range of SNR value of the received signals before HCFs and CFs extraction system. The complete pseudo-code of SNR splitter is presented using Algorithm 1. It is worth mentioning that the same criterion is used to label the modulation schemes of all signals with L 1 and L 2 .

2) NOVEL HCFs SELECTION ALGORITHM FOR SPLITTING SNR RANGE
The total SNR range are divided into several consecutive ranges to investigate the variation of each HCF value under these ranges. This procedure aims to select the best features and threshold values that can split the overall SNR range into smaller ranges with higher splitting accuracy.
Train SVM classifier using L tr , then test it using L t . Finally, store the accuracy in A c and the predicted labels in L p .
of Variation (CV) measure is used to demonstrate the variation of each HCF value under the varying SNR conditions. CV is defined as the ratio of standard deviation σ to mean µ and it is used to measure the dispersion of data sample around the mean value of population. CV is able to compare the degree of variations between different data series efficiently even through their mean values are drastically different from one another [29]. It is also notable that CV is sensitive to small changes in the mean if the latter has near zero value. For a group of signals denoted as S m , the CV of k-th feature (f k ) is VOLUME 9, 2021 For simplicity, we denote the minimum and maximum values of SNR as r 1 and r L+1 , respectively. Hence, the overall range of SNR denoted as [r 1 , r L+1 ] is equally split into L ranges as shown in Eq. (32). The group of signals for specific SNR range of [r m−1 , r m ] are denoted as S m as shown in Eq. (33). Hence, the signals are divided into a total of L groups as: where R m is calculated as: There are many signals with SNR value equal to r m−1 are shared in any two consecutive groups of S m−1 and S m . Under this circumstance, the values of each k-th feature f k in these two groups are approximately same if both CV km−1 and CV km values are approximately same. The second derivative of each CV k vector is hence calculated to determine the SNR range where the CV of an HCF starts to vary. In principle, the positive and negative values of second derivative imply that a curve starts to concave up and concave down, respectively. The SNR range can be detected when a feature value starts to change drastically with increasing rate of variation along with the concave upward trends of curve. Table 3 shows the CV km values of each k-th HCF for all groups of S m , where the highlighted cells indicate the CV k of particular feature is in a concave upward trend during S m . Notably, two consecutive groups of CV km−1 and CV km should be selected to obtain a center value of SNR as the required threshold. In order to fulfill the objectives of SNR splitting, it is also desirable to define a threshold value that is able to split the signal dataset into two subsets with similar sizes. The SNR ranges in which most of HCFs show the concave upward trend are then selected as threshold value for splitting. Algorithm 2 presents the full pseudo code of HCFs and SNR threshold selection algorithm that considers the following inputs: • A 3D array of HCFs denoted as F, where the first dimension represents the size of modulation pool (l p ), second dimension represents the total number of signals per type (N t ) and third dimension represents the HCFs count (l f ), respectively.
• Minimum value of SNR denoted as S min n . • Maximum value of SNR denoted as S max n . • Step size of SNR denoted as step. • Percentages of signals used for testing denoted as P ct . The following variables are first initialized in Algorithm 2, i.e., number of SNR values (N snr ), number of testing signals per type per SNR (N s t ), number of testing signals per type (N t ), number of training signals per type per SNR (I s t ),

Algorithm 2 The Algorithm of HCFs and SNR Threshold Selection for SNR Splitter
Input: F, N t , S min n , S max n , step, P ct Output: Selected HCFs group G 1 for SNR range splitter. SNR Splitter accuracy A c . SNR splitter Threshold Thr. In order to prevent the errors in SNR splitting, the current feature needs to be excluded if it has relative minima or maxima when the CV vector is in a concave upward trend as indicated by the sign of first derivative. Each pair of consecutive signal groups (i.e., S m and S m+1 ) that indicate the changes of CV vector in an increasing rate of variation are then identified. Finally, the potential HCFs are fed into a SNR splitter function in order to determine the most relevant HCFs and best threshold (Thr) that can lead to the best splitting accuracy.

3) EVALUATION OF PROPOSED SNR-SPLITTING ALGORITHM
The proposed SNR-splitting algorithm is evaluated on both ''Normal'' and ''Difficult'' classes of adopted dataset. Tables 4 and 5 show some potential splitting thresholds obtained by the proposed algorithm together with their corresponding selected HCFs and splitting accuracy for both ''Normal'' and ''Difficult'' classes, respectively. For ''Normal'' class, Thr = + 2 dB is identified as the best SNR splitting threshold that can produce the selected HCFs of G 1 with the highest splitting accuracy of 99.49%. Fig. 3 shows the variation of mean envelope for each selected HCF in G 1 along with the overall SNR range for all modulation schemes considered in the ''Normal'' class. Each selected HCF of modulation schemes are observed to have different values around the splitting threshold of Thr = + 2 dB that divide the total SNR range into two smaller ranges. When the whole SNR ranges are considered in training process, the hyperplanes of classifiers tend to be distorted and more complex CNN is needed to learn the most relevant CFs. On the other hand, the strategy of splitting the total SNR range into two smaller ranges can achieve higher accuracies by training the classifiers with more robust and relevant features extracted using the CNNs with lower complexity. The feasibility of this method will be further investigated in the following sections.

4) CROSS-SNR TRAINING METHOD
When a cross-SNR training method is used to train both CNNs on two different SNR ranges, at least one of the SNR value can be shared between these two models. The selection of shared SNR can be done by identifying the SNR ranges of splitter that produce majority of wrong labels.
The distributions of wrong labels are subsequently presented in the histograms of Fig. 4 to identify the SNR values that produce most numbers of wrong labels. For our case, the CNN 1 and CNN 2 are trained on the SNR ranges of R1 ∪ [+2, +4] dB and R2 ∪ {0 dB}, respectively.

B. HCFs SELECTION FOR AMR
In this section, the characteristic of each HCF under different noise regimes is studied before designing a novel HCFs selection algorithm. The variance of each normalized HFC is first calculated for the overall SNR range. A careful study is then performed on the HCFs in the SNR ranges of R 1 and R 2 in order to identify the best combinations of HCFs that are able to boost the performance of AMR in the corresponding SNR ranges.

1) THE VARIANCE OF THE NORMALIZED HCFs
In order to construct a more robust and relevant feature subset, the behavior of each feature stored in original feature set F are investigated under different SNR values. Denote D k as the variance of each k-th normalized HCF (F k ) for all modulation schemes under a specific SNR range, then: where M is the size of modulation pool; N t is the number of signals per modulation for the total SNR range; F k (i, m) is the normalized feature; µ k (m) is the mean value of normalized feature F k for m-th modulation and it is always equal to one. Both of F k (i, m) and µ k (m) are calculated using Eqs. (36) and (37), respectively. Table 6 shows the variance of each normalized HCF stored in F.

2) LIMITATIONS OF NORMALIZED VARIANCES FOR HCFs
Substantial amounts of literatures have revealed that it is not desirable to consider the features that can change with SNR during training process due to their high tendency of distorting the hyperplanes of classifiers and affecting their accuracies [30]. Wu et al. [30] considered the variance of normalized features to select robust features that can be used under varying noise conditions by identifying the cluster with minimum variance value. In contrary to [30], this paper does not only consider the minimum variance criterion in selecting the relevant features that can lead to best classification performance of modulation schemes. From practical point of view, some normalized features with relatively large variances can also be considered to boost the classification accuracy of proposed method because the value of an HCF can fall into different specific ranges when different modulation schemes are applied. In this context, it is also notable that the variance of normalized features [30] did not consider the mean value. Therefore, more investigations are required to construct the robust and relevant HCFs set.

3) CLASSIFICATION CONFUSION INDEX (CCI)
In practical, some features can have varying values under specific modulation scheme but they remain as constant values under other schemes. If the variance of normalized feature is the only criterion to be considered in selecting HCFs, some useful features that only demonstrate large variation in certain modulation schemes could be wrongly eliminated. In order to select the features that can lead to the best modulation recognition rate, the variance of all normalized means (µ k (m)) is calculated as follows: where E µ k (m) is the mean value of the normalized mean µ k (m) for the k-th feature F k and under the m-th modulation scheme. Fundamentally, any features with larger values of D µ k imply for the larger distances between modulation schemes. Meanwhile, any feature with smaller variance D k tends to have better robustness throughout the given SNR range. In other words, the better classification accuracy can be achieved through larger D µ k and smaller D k . For any k-th feature, the associated classification confusion index of CCI k can be calculated as:  After splitting the overall SNR range of [r 1 , r L+1 ] into two smaller ranges of [r 1 , r Thr ] and [r Thr+1 , r L+1 ], the recalculation of variance for each normalized HCF under these two new SNR ranges can produce new clusters because the behavior of each feature can change under different SNR ranges. Table 7 and Table 8 show the variance of each normalized HCF produced under the SNR ranges of R 1 = [−20, 0] dB and R 2 = [2, 30] dB, respectively. When K-means clustering is performed on all HCFs under the SNR range of R 1 , the following four clusters are produced: Meanwhile, the four clusters produced by K-means clustering in the SNR range of R 2 are: The CCI k of each k-th feature is calculated and the results of all features for SNR ranges of R 1 = [−20, 0]dB and R 2 = [2, 30]dB are presented in Table 10 and Table 9, respectively. The outliners that present in both SNR ranges are removed. For the SNR range of R 1 , the HCFs of µ a 42 and, P 5 and P 6 are considered as the outliers and eliminated from the original feature set. K-means clustering is subsequently applied on the remaining features to produce the following four clusters:  For the SNR range of R 2 , P 2 ,and σ aa are identified as the outliers and eliminated from the original feature sets. Four clusters are obtained from the remaining features by using K-means clustering, where: The complete pseudocode of HCFs selection based on the CCI criterion is summarized using Algorithm 3. Evidently, different clusters of HCFs can be produced under the SNR ranges of R 1 and R 2 when the criteria of CCI and normalized variance are considered. The performance of these HCFs selection algorithms will be evaluated in the next sections.

A. SIMULATION SETTINGS
RadioML 2018.01A dataset is selected for performance evaluation. The performances of proposed HCFs selection algorithm and proposed DL-based AMR method are compared with the methods used by O'Shea et al. [4]. The parameter settings required for simulation studies are presented in Table 11.
The method introduced by Wu et al. [30] to select a robust set of HCFs based on the normalized variance is considered for performance validation of the proposed HCFs selection algorithm. In addition, a Baseline (BL) AMR method presented in [4] is also selected for AMR performance validation of the proposed AMR based on the proposed HCFs selection algorithm and SNR splitter, where the former method considers a total of 28 HCFs including HOMs, HOCs and other features. Finally, both of the CNN/VGG and RN methods introduced in [4] are also considered for performance validation of the proposed DL-based AMR method.

Algorithm 3 Pseudo Code of HCFs Selection Based on the CCI Criterion
Remove outliers from CCI 1 and CCI 2 .

10
Apply K-means clustering on both CCI 1 and CCI 2 .

11
Select the cluster K 1 with minimum values in CCI 1 .

12
Select the cluster K 2 with minimum values in CCI 2 .

B. SIMULATION RESULTS FOR HCFs SELECTION ALGORITHM
The effectiveness of the proposed HCFs selection algorithm in identifying the most relevant features is investigated in this section. First, we make a fair comparison between the performance of AMR based on the proposed HCFs selection algorithm and that based on the normalized variance used in [30]. The proposed CCI criterion and the normalized variance [30] are used separately to select the HCFs subset on the split SNR ranges R 1 and R 2 to be then fed into SVM 1 and • HCFs-CCI-1cl: Adopt the proposed HCFs selection algorithm with one cluster of HCFs, i.e., A 3 and A 4 for R 1 and R 2 , respectively.
• HCFs-VAR-1cl:Adopt the variance of normalized HCFs for HCFs selection [30] with one cluster of HCFs, i.e., A 1 and A 2 for R 1 and R 2 , respectively. On the other hand, the performances of the aforementioned methods are also compared with the BL method presented in [4]. Table 12 presents the dimension of HCFs vector produced by each compared HCFs selection algorithm for both SNR ranges of R 1 and R 2 , whereas the classification performances of all compared methods are illustrated in Fig. 5. It is observed that the proposed HCFs selection algorithm with two clusters of HCFs (i.e., HCFs-CCI-2cl) has achieved the best classification accuracy among all compared methods. In addition, the proposed HCFs selection algorithm is also observed to outperform the compared HCFs selection algorithms based on normalized variance [29]. For instance, the classification accuracy of HCFs-CCI-2cl is more than 81% when the SNR is equal to 2 dB but HCFs-VAR-2cl can only produce the accuracy level around 60%. It is also notable that HCFs-CCI-2cl is able to produce the maximum classification accuracy rate of 97% at the high SNR regions, whereas the best classification accuracy rates obtained by both HCFs-VAR-2cl and BL [4] are only 96.7% and 94.6%, respectively. Similar observations are made when only one cluster of HCFs is considered, where the classification accuracy rate of HCFs-CCI-1cl is around 8% better than that of HCFs-VAR-1c1 throughout the SNR ranges. Referring to the simulation results presented in Table 12 and Fig. 5, it can be concluded that the proposed HCFs selection algorithm has more competitive performance for being able to select the most relevant HCFs and produce high classification accuracy with the lower dimension of HCFs vector.

C. SIMULATION RESULTS FOR THE PROPOSED DL-BAESD AMR TECHNIQUE
In this section, simulations are conducted for comparative studies between the proposed DL-based AMR method and VOLUME 9, 2021 FIGURE 5. Performance comparison of the proposed HCFs selection algorithm, the normalized variance, and the baseline method in [4].  both of the CNN/VGG and RN methods presented in [4]. The same parameter settings summarized in Table 11 are used in this performance analysis. Table 14 present the network structures of two CNN models denoted as CNN 1 and CNN 2 used to perform the contextual features extraction from signals belong to the SNR ranges of R 1 and R 2 , respectively. Notably, both of CNN 1 and CNN 2 have fewer numbers of trainable parameters as compared to both CNN/VGG and RN employed in [4] as shown in Table 13.  function, considering that AMR is essentially a multi-class classification task. The function of CE is shown as: where y c is the ground truth vector that can be achieved through one-hot encoding of sample label. y c is the predicted vector. C is the number of the samples' types.  The classification accuracies produced by the proposed DL-based AMR method and other benchmark methods are presented in Fig. 6. It is proven that the presence of HCFs can indeed contribute to the performance boosting because the networks that consider the concatenation of HCFs and CFs are able to achieve higher classification accuracy than those which only consider the CFs. The proposed method is also able to produce better accuracy than other benchmark methods regardless of the presence of HCFs. Particularly, the classification accuracy rate of proposed method is more than 87% and 95% when SNR ≥ −2 dB and SNR ≥ 0 dB, respectively. It is also notable that the proposed method is able to achieve the maximum classification accuracy rate of 100% for SNR ≥ 6 dB.
The classification performances of proposed DL-based AMR method in handling each individual modulation scheme is further illustrated in Fig. 7. Specifically, the modulation schemes with lower information rates and unique structures such as AM and FM can be classified by the proposed method with high accuracy at the low SNR regions. For instance, both of the AM and FM schemes are identified by the proposed method with accuracy levels greater than 88% and 99.8% for SNR = −4 dB and SNR ≥ −2 dB, respectively. Meanwhile, the proposed method is able to identify the low order schemes such as GMSK, 4ASK and BPSK with classification accuracy rate around 99% when SNR = −2 dB. The modulation schemes with more sophisticated structures and higher orders require higher SNR for robust performance. From Fig. 7, it is notable that the proposed method is able to classify all modulation schemes with the accuracy rate of 100% when SNR ≥ 8 dB. Six confusion matrices produced by the proposed DL-based AMR method across all 11 classes for multiple SNR values are presented in Fig. 8 in order to further analyze the classification performance of proposed method. On one hand, it can be observed that the majority of classification errors can be found between phase-shift keying (QPSK, 8PSK, OQPSK), especially with substantial noise. This may be due to lack of information at low SNR values [31]. Besides, they have similar symbol structure and statistical information when using features computed by using the IQ samples. On the other hand, some signals with lower information rates and unique structure such as AM and FM are more readily classified at low SNR.

VII. CONCLUSION
In this paper, a novel DL-based AMR method that is able to perform robustly under varying noise regimes is proposed by addressing the key issues of feature extraction and selection criteria of features. Three groups of HCFs are first extracted from the received signals by using the proposed algorithms. The first group of HCFs is used to identify the SNR range of received signals, whereas the remaining two groups are concatenated with the CFs extracted by two corresponding CNNs in order to solve the modulation classification tasks. A novel algorithm is also introduced to select the best SNR threshold and HCFs that can split the total SNR range into two smaller ranges. Furthermore, an HCFs selection algorithm is proposed to select the most relevant features in order to reduce the dimension of feature vector.
Extensive simulation studies have verified the reliability and effectiveness of proposed HCFs selection algorithm in reducing the redundant features without compromising the classification performances. These desirable characteristics enables the proposed method to be implemented through simpler CNNs and verified using 11 modulation schemes. The competitive classification performance demonstrated by proposed method have proven that it is more efficient to split the total SNR range with positive and negative values into two smaller ranges. When training with the total SNR range, the hyperplanes of classifiers tend to be distorted and hence compromising their classification accuracies. Furthermore, the presence of HCFs is also proven useful to boost the classification performance of the proposed DL-based AMR method. The proposed DL-based AMR method is expected to make significant contribution in wireless communication because it is able to solve the AMR tasks with higher classification accuracy while incurring lower computation complexity due to the adoption of simpler CNN architecture. Current research work can be extended by incorporating more optimized CNN architectures into the proposed DL-based AMR method to solve larger sets of modulation schemes.