Cross Model Deep Learning Scheme for Automatic Modulation Classification

Deep Neural Networks (DNNs) have achieved remarkable accuracy improvements for automatic modulation classification. However, the employed networks often have millions of parameters and need very high computation, which makes it difficult to deploy these models on portable devices with limited resources. We propose a cross model deep learning scheme to build a lightweight deep network for accurate modulation classification. Firstly, a large Hybrid DNN (HDNN) that is composed of convolutional and recurrent layers is constructed and trained for automatic and accurate classification of signals. Then we build a smaller Layered Resnet Network (LRN) with shallow layers and few nodes. The HDNN and LRN are taken as a Teacher Model (TM) and a Student Model (SM) respectively. Finally, a knowledge distillation method is proposed to guide the learning of the SM, by formulating a teaching loss from the prediction of the TM to train the SM. The performances of the proposed HDNN and LRN are investigated on the public RadioML2016.10a and RadioML2016.10b data sets. The experimental results show that the trained HDNN presents state-of-the-art classification results and the LRN trained in this scheme takes only about a sixth of the HDNN’s inference time and consumes only 472.3KB for storage, with a slight accuracy decrease compared with the large HDNN.

Traditional AMC methods can be roughly divided into two groups: feature-based methods and likelihood-based methods [7]. The feature-based methods [8]- [10] seek for some handcraft features to distinguish different types of signals, such as higher-order moment [11], instantaneous frequency [7], instantaneous phase [11] and cyclic cumulant [11]. However, finding reliable features relies too much on manual selection, resulting in unstable results [7], [12]. In the likelihood-based methods, likelihood functions of different hypotheses are first calculated using received signals. Then the results are compared with a certain threshold to make final classification decisions [12]. Compared with feature-based methods, likelihood-based methods treat both noises and channel models which reflect the propagation characteristics of signals as prior information, nevertheless, channel models are usually unavailable in practice [12]. Consequently, likelihood-based methods cannot adapt to a dynamic or unknown channel [7]. In addition, their computational complexity is very high [7].
In [24], Rajendran et al. introduced a LSTM-RNN model for AMC and indicated that LSTM-RNN models outperform CNN models with oversampled received signals at small or medium scales. Later Swami and Sadler [11] designed a new LSTM-RNN model comprised of a LSTM-RNN layer and two Fully-Connected (FC) layers. It can achieve high accuracy for automatic classification of six types of digital modulation signals with varying noises. What's more, Sainath et al. [25] proved that CNN is good at reducing frequency variations and LSTM-RNN is good at temporal modeling. CNN and LSTM-RNN are complementary for sequential data processing. Using this work, West and O'Shea proposed a model comprised of inception modules and LSTM-RNN to identify the modulation types of signals, and the results showed remarkable accuracy improvements over both CNN models and LSTM-RNN models in [26]. In [27], a model composed of CNN, LSTM-RNN and Gated Recurrent Unit Recurrent Neural Network (GRU-RNN) achieved a state-of-the-art performance. In [28], Sharan et al. employed a CNN and LSTM-RNN model for AMC, and investigated the feasibility and effectiveness of deep learning algorithms for AMC. In existing models combined by CNN and LSTM-RNN, the original CNN structure [11], [28] and the inception structure [26] are utilized. However, these CNN structures are difficult to train [29] and there is the vanishing gradient in their training processes.
On the other hand, the available networks often have a large number of parameters, which makes it difficult to implement on portable devices with limited resources. For example, in [28] the network contains 313,603 parameters with one LSTM-RNN layer, four convolution layers, and two fully connected layers. In addition, the time complexity of the models comprised of CNN and LSTM-RNN layers is high in training or prediction as the LSTM-RNN operation is timeconsuming [30]. Thus, these networks take a long time to automatically predict the types of signals.
Deep Residual nets (Resnet) [29] have been applied extensively in the field of computer vision. Resnet can simplify the training complexity of deep networks as the shortcut connection is adopted [29]. In order to limit the training complexity, in this paper, we utilize 1-D Resnet and LSTM-RNN layers to build a Hybrid Deep Neural Network (HDNN) for AMC. HDNN can present promising results on multiple AMC data sets. Moreover, in order to reduce the storage and computational cost of HDNN for real-time applications, we propose a Cross Model Deep Learning (CMDL) scheme to build a lightweight deep model for accurate prediction. We first construct a smaller Layered Resnet Network (LRN) with shallow layers and few nodes. Then, inspired by the Knowledge Distillation (KD) that builds a small and efficient model with reasonable performance degradation from a large and complex model, we define HDNN and LRN as a Teacher Model (TM) and a Student Model (SM) respectively. A Knowledge Distillation (KD) method is proposed to pilot the learning of the SM by formulating a teaching loss from the prediction of the TM to train the SM. In the training of the LRN, the inter-class similarity learned and revealed by the HDNN, is used to develop a more reliable LRN.
Compared with the available works, the contributions of our work can be summarized as follows: • We propose a HDNN composed of Resnet and LSTM-RNN. We employ 1-D Resnet to reduce the training complexity of this model. To the best of our knowledge, this is the first attempt to combine 1-D Resnet and LSTM-RNN for AMC.
• In order to implement HDNN rapidly, we construct a lightweight deep model, LRN. This model is comprised of only three Resnet stacks and one FC layer.
• We propose a CMDL scheme to make the LRN achieve accurate prediction, where we utilize a KD method to guide the learning of the LRN, by formulating a teaching loss from the prediction of HDNN to train the LRN.
We analyze the performance of the trained LRN on the public RadioML2016.10a and RadioML2016.10b data sets. The experimental results show that the trained HDNN presents state-of-the-art classification results, and the trained LRN achieves a slight accuracy decrease compared with the HDNN and is beneficial to the rapid signal classification with limited resources. The trained LRN takes only about a sixth of the HDNN's inference time and occupies 472.3KB storage. In order to further compress LRN, the model is also quantized in experiments, the quantized LRN consumes only 301.2KB for storage, with the same performance or a slight accuracy increase compared with the trained LRN.
The remaining part of this paper is elaborated as follows. In Section II, CMDL is described in detail. Experimental results and related analysis are presented in Section III. Finally, a conclusion is drawn in Section IV.

II. CROSS MODEL DEEP LEARNING
In this section, a large HDNN and a small LRN are first constructed as TM and SM respectively. Then, the CMDL scheme is proposed to train LRN from HDNN.

A. CONSTRUCTIONS OF TM AND SM
In this subsection, a large HDNN is first constructed, which consists of three Resnet stacks, one LSTM-RNN layer, and one FC layer. A LSTM-RNN cell, which contains input gate (i t ), forget gate (f t ), output gate (o t ) and cell gate (C t ), is illustrated in Fig. 1. The gates are calculated as follows: The LSTM-RNN cell has the memory (C t ) and the hidden state (h t ) along with four gates. The memory C t and the hidden state h t at time t are updated as follows: Utilizing this gating mechanism, LSTM-RNN cells can preserve information for a longer duration, thereby extracting temporal features.
In addition, 1-D convolution with low computation, as a complement to LSTM-RNN, also is employed to reduce frequency variations of signals for AMC. A 1-D convolution kernel can be described by where s denotes the input of the 1-D convolution kernel, and its length is n. Equation. (7) represents the operation of the q − th 1-D convolution kernel in a 1-D convolution layer and k is the size of this convolution kernel where k is generally odd. w q [l] and b q indicate the weights and the bias of this convolution kernel respectively. g(·) is an activate function and Glorot et al. [31] is used in our models. y is the output of the 1-D convolution layer and p is the number of its channel. The number of the trainable parameters in a 1-D convolution kernel is k +1, and it only is about one k −th of that in the 2-D convolution kernel with the size of k × k (The number of the trainable parameters in the 2-D convolution kernel is k × k + 1).
Based on LSTM-RNN cells and 1-D convolution, the TM, a HDNN, is designed for AMC. As shown in Fig. 2 (a), a signal with the size of [2,128] is fed into this model first, where the signal is a 128-sample complex (baseband I/Q) time-domain vector. Then we employ three ResNet stacks (Res1, Res1, and Res3), one LSTM-RNN layer (LSTM1) comprised of 100 LSTM cells, and one FC layer (FC1) to build the TM. The structure of the ResNet stack composed of 1-D convolution layers is presented in Fig. 2 (c). The standard cross-entropy loss is used as the loss for the training of the TM. It can be described as where x ∈ X, X is the training data set and l is the true label of x. TM (x) denotes the output of the TM. softmax(·) is the softmax function.
In addition, as shown in Fig. 2, two ResNet stacks in the training TM, Res1 and Res2, are directly transferred to build the SM. Then, they are followed by one pooling layer (Pooling1) and one FC layer (FC2).

B. CROSS MODEL DEEP LEARNING
In fact, it is particularly difficult to make the SM obtain similar performance to the TM by using (9) as the SM, LRN, only has 84,939 parameters and it needs more parameters and more layers for higher accuracy.
In the CMDL scheme, we extend KD to train the SM from the TM, HDNN. We employ knowledge, TM (x) and F TM (x), learned by the TM to improve the performance of the SM.
Based on KD, a loss where TM (x) is introduced is designed to train the SM, and it can be described by where SM (x) denotes the prediction of the SM. L s m (x, l) is an inconsistency loss defined by the standard cross-entropy loss between SM (x) and TM (x), and it is utilized to make the SM learn from TM (x).
Hinton et al. [32] proved that softmax(·/T ) with a higher temperature value T produces a softer probability distribution over classes than that with a temperature value T of 1. The softer probability distribution makes the distillation pay more attention to matching the negative logits below the average, which is beneficial to train an efficient model [32]. With a high value T , the derivative w.r.t. SM (x) is calculated in [32] by where num is the number of the modulation types. It is obvious that minimizing L s (x, l) can make SM (x) equal to TM (x) in (11). Thus, the SM is trained to approximate to the TM. x is a signal as the input of models and its size is [2,128]. num is the number of the modulation types to be identified. BN is a batch normalization layer. LSTM_100 denotes a LSTM-RNN layer comprised of 100 LSTM cells. In our scheme, HDNN is considered as the TM and LRN is the SM. HDNN is first trained. Then, we utilize TM(x) to guide the training of the SM.
In addition, in order to better learn knowledge from the TM, the SM is trained to learn the feature distribution of the TM by using (12) as features extracted by shallow layers are more generality [33].
where F TM (x) and F SM (x) denotes the feature distribution of the TM and that of the SM respectively. The derivative w.r.t.
where we can find that minimizing L s f (x) can make F SM (x) approximate to F TM (x) in (13).
In this paper, the teaching loss to train the SM is defined as The CMDL is a multi-stage training process. Firstly, the TM is trained. Then, the SM is trained by knowledge generated by the trained TM. One important consideration in CMDL is summarized as follows.
• The construction and training of the TM are not necessary. Existing models with high accuracy for AMC can be considered as the TM.

III. SIMULATIONS, RESULTS, AND DISCUSSION
In this section, several experiments on RML2016.10a data set and RML2016.10b data set [23] are implemented to show the performance of the proposed CMDL. The hardware of the test platform is HP z840 workstation with Intel E5-2600 3.2GHz CPU, 128G memory and two GTX 1080 GPU. All training processes are performed on two GPU and all testing processes are on one GPU. All experiments are implemented by Python 2.7 based on the Keras framework. The training log can be downloaded at this link, and all codes and data will be available.
In our experiments, all models are trained in an end-toend manner by the Adam optimization algorithm in all experiments. The batch-size is set to 384. The learning rate is 0.01 and the learning rate decay rate is 0. In Adam optimization algorithm, exponential decay rates follow those provided in the original paper [34]. In addition, the training epoch is set to 300. T is set into 10.

A. DATASETS
The RML2016.10a data set generated with GNU Radio is adopted to evaluate the modulation recognition task, and it consists of 220,000 signals at SNRs of −20∼18 dB with 11 classes of modulations (8 digital and 3 analog types: 8PSK, AM-DSB, AM-SSB, BPSK, CPFSK, GFSK, PAM4,  QAM16, QAM64, QPSK, and WBFM). The task is to utilize a signal represented by a 128-sample complex (baseband I/Q) time-domain vector to identify its modulation scheme out of 11 possible classes. The sample is fed into models in a 2 × 128 vector.
In order to better evaluate the proposed CMDL, a larger version of the RML2016.10a data set, the RML2016.10b data set, is also employed for AMC in our experiments. The RML2016.10b data set consists of 1,200,000 signals at SNRs of −20∼18 dB with 10 classes of modulations (7 digital and 3 analog types: 8PSK, AM-DSB, BPSK, CPFSK, GFSK, PAM4, QAM16, QAM64, QPSK, and WBFM).
For ease of comparison, all data sets are download at https://www.deepsig.io/datasets. We utilize an official code downloaded from https://github.com/radioML/examples to split the data sets, where 90% of the data is considered as training data subset and 10% of the data is testing data subset in each data set.

B. THE TRAINING PROCESS
In this subsection, models are first trained. Then, we analyze the SM's performance variation with the temperature value. Next, we introduce the performance analysis of CMDL.

1) TRAINING OF MODELS
The proposed TM is first trained on the RML2016.10a training data subset in the standard cross-entropy loss. After the training in each epoch, the RML2016.10a testing data subset is employed to test the model. The model with the best accuracy on the testing data subset is saved as the trained model for prediction in the training process. As shown in Table 1, the TM achieves 62.98% accuracy on the testing data. Then, the proposed SM is trained on RML2016.10a training data subset in the standard cross-entropy loss, and it has an accuracy of 55.20% on the testing data. Next, the SM is trained on RML2016.10a training data subset in the proposed teaching loss, and the performance (62.41%) similar to that (62.98%) of the trained TM is obtained by the trained SM. Finally, in order to further compress the model, all parameters in the SM trained in the teaching loss are encoded by float16. The quantized SM obtains the same accuracy as the model without parameter quantization. The training loss variations of the TM and the SMs are illustrated in Fig. 3.
In order to show the SM's performance variation with the temperature value, when T is set to 1, 3, 10, the SM is trained on RML2016.10a training data subset, respectively. As illustrated in Table 1, prediction accuracy on RML2016.10a testing data subset increases with the increase of T. When T is 10, the SM achieves the best performance. Hence, for the remaining experiments, we will use a T of 10. VOLUME 8, 2020 FIGURE 5. The visualization results of features from the last layers in SM and TM. * denotes that this model is trained in the standard cross-entropy loss and # means that this model is trained in the teaching loss. Compared with the feature difference between the SM * and the TM * , the feature difference between the SM # and the TM * is small, and their features change approximately in the same trend.

2) PERFORMANCE ANALYSIS OF CMDL
In fact, we expect the performance of the trained TM is better than that of the SM trained in the standard crossentropy loss. As shown in Table 1, the trained TM exhibits significant performance improvement over the SM trained in the standard cross-entropy loss. Nevertheless, it is worth noting that the training time of the SM is only about a quarter of the TM's training time, and the testing time of the SM is only about one-sixth of the TM's testing time as the LSTM-RNN operation is very time-consuming [30]. What's more, as illustrated in Table 1, the teaching loss makes the SM obtain the performance improvement of 7.21%, which shows the proposed CMDL is effective for the performance boost of a lightweight model. One important reason is that the teaching loss employs knowledge learned by the trained TM to guide the training of the SM. However, an unexpected fact is that the quantized SM spent the same time as the SM without parameter quantization, which may be due to implementation issues. In theory, the quantized SM can cut computation time in half.
We utilize the trained models to predict all signals in the RML2016.10a testing data subset and plot the accuracies of the proposed models in predicting for each class. As shown in Fig. 4, the TM trained in the standard crossentropy loss achieves the better performance than the SM trained in the standard cross-entropy loss for eight modulation types (8PSK, AM-DSB, AM-SSB, BPSK, CPFSK, GFSK, PAM4 and QAM64), and the SM trained in the standard cross-entropy loss shows the better performance for other modulation types (QAM16, QPSK and WBFM), which shows the characteristics of two models for the prediction of different modulation types. In addition, the SM trained in the teaching loss also obtains the better performance than the SM trained in the standard cross-entropy loss for eight modulation types (8PSK, AM-DSB, AM-SSB, BPSK, CPFSK, GFSK, PAM4 and QAM64) and its characteristic for the prediction of different modulation types is changed compared with the SM trained in the standard cross-entropy loss, which is similar to the characteristics of the trained TM. This proves that knowledge obtained by the TM is learned effectively by the SM using the proposed teaching loss.
In addition, when we use the trained models to test randomly selected signals in the RML2016.10a testing data subset, features from the last layers in the trained SMs and the trained TM are visualized in Fig. 5. The feature difference between the SM trained in the teaching loss and the trained TM is small, and their features change approximately in the same trend, which further proves that knowledge obtained by the TM is effectively learned by the SM in the teaching loss and the features extracted by the HDNN is beneficial to performance improvement of the LRN for AMC.
The experimental results are shown in Fig. 6. It can be noticed that the trained TM and the SM trained in the teaching loss achieve much higher accuracy than Resnet, Inception, Densenet, FCNN, and C2LDNN on testing signals at SNRs of −5∼18 dB and they result in similar performance compared with other methods.
In view of the average accuracy, the best performance is illustrated by the trained TM, and either superior or equal performance is shown by the SM trained in the teaching loss.

D. THE GENERALITY OF THE PROPOSED CMDL
In this subsection, A Novel TM (NTM) and a Novel SM (NSM) are designed to show the generality of the proposed CMDL scheme on the RML2016.10b data set. The structures of the NTM and the NSM are shown in Fig. 7.
In order to further evaluate the performance of models, the NTM and the NSM are first trained in the standard cross-entropy loss. Then, the NSM is trained in the teaching loss. Finally, the trained NSM is quantized. The experimental results are shown in Table 2. Compared with the NSM trained in the standard cross-entropy loss, the NSM trained in the teaching loss has a performance boost of 5.8%, which illustrates the feasibility and the generality of the proposed CMDL  on this model. The trained NSM with reasonable performance degradation from the trained NTM also saves a lot of time in the training process and the testing process. In addition, the quantized NSM has a slightly better performance than the original NSM trained in the teaching loss, which is probably because parameter quantization improves the generalization performance of this model.

IV. CONCLUSION AND FUTURE WORK
In this paper, in order to better train a lightweight model for AMC, we propose a novel scheme, CMDL. Firstly, we construct a large HDNN for AMC and this model achieves stateof-the-art performance compared with its counterparts. Then a lightweight model, LRN, is built. Next, a KD method is proposed by formulating a teaching loss from the prediction of the HDNN to train the LRN. The trained LRN that consumes only 472.3KB for storage achieves great performance improvement compared with the LRN trained in the standard cross-entropy loss, and results in either superior or equal performance compared with its counterparts, which proves that the proposed CMDL scheme is beneficial to train a lightweight model. In order to further compress the lightweight LRN, in experiments, parameter quantization is employed and the model size is reduced to 301.2KB with either higher or equal accuracy. WEI WANG received the B.S. degree from the Department of Arms Engineering, Beijing, China. He is currently with the Department of Arms Engineering. VOLUME 8, 2020