IRLNet: A Short-Time and Robust Architecture for Automatic Modulation Recognition

Automatic modulation recognition with deep learning (DL) is challenging in distinguishing high-order modulation modes and balancing complexity against recognition accuracy. In this paper, we propose a novel dual-path modulation recognition framework named IRLNet, which consists of the improved residual stacks (IRS) and long short-term memory (LSTM). The IRS maintains the more initial residual information, learns the signal features in deep and shallow, and achieves various degrees of feature extraction. The model learns from the time domain I/Q, amplitude and phase information presented in the training data. The simulation results on RadioML2016.10B show that IRLNet performs stable in the training stage and has low spatial-temporal complexity. It also achieves a recognition accuracy of more than 93% at high signal-to-noise ratios (SNRs). Transfer learning is introduced to improve the efficiency of retraining, and robustness is proved by transferring the model to RadioML 2018.01A and HisarMod 2019.1. The simulation result shows that the training time is greatly shortened by about 25.6% in RadioML 2018.01A by introducing transfer learning. Moreover, IRLNet improves the confusion of high-order modulations and achieves a recognition accuracy of more than 90% in high SNRs.


I. INTRODUCTION
With the remarkable development of wireless communication, the signal modulation modes are becoming more complicated and diversified, and the radio environment is becoming increasingly harsher [1]. As a critical step between signal detection and demodulation, automatic modulation recognition (AMR) can also detect physical threats such as pilot jamming, deceptive jamming, and Sybil attacks [2]. In the civilian field, it is primarily used in spectrum management for signal authentication and interference identification. In the military field, it is one of the important means in the struggle and control electromagnetic power. Besides, it is the previous step to intercept enemy signals, destroy and suppress enemy communication, and implement electromagnetic interference.
Traditional AMR methods are generally divided into two categories: likelihood-based (LB) methods and feature-based The associate editor coordinating the review of this manuscript and approving it for publication was Wei-Wen Hu .
(FB) methods [3]. Considering AMR as a multiple composite hypothesis testing problem, LB methods rely on the correct modeling of unknown quantities. The number of hypotheses equals that of classified modulations [4]. The classification result is given according to the compared results of the likelihood ratio of each possible hypothesis and the threshold. LB methods are optimal in the Bayesian sense as they maximize the probability of correct classification with the full knowledge of channel conditions [5]. However, LB methods depend heavily on prior knowledge and parameter estimation, not to mention their high computational complexity. Moreover, LB methods are not robust with respect to model mismatch such as phase and frequency offsets, residual channel effects, timing errors, and non-Gaussian noise distributions [6]. Instead of using the probability distribution function (PDF), FB classifiers calculate certain statistical features of signal samples [7]. The FB methods generally consist of three steps: data preprocessing, feature extraction, and classification. In the process of feature extraction, various statistical features such as high-order cumulants [8], wavelet VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ transform [9], and cyclostationary features [10] are widely used. Generally, FB methods are sub-optimal in the Bayesian sense, but they are popular due to their easy implementation [11]. However, they need to manually extract expert features from plenty of samples, which leads to a high computational complexity [12]. Furthermore, FB methods are generally designed for a certain modulation mode in a specific environment, making them difficult to extend. Due to the poor versatility and high complexity of traditional techniques, there is an emerging demand for the quick discrimination of the modulation modes. Recently, machine learning techniques such as deep learning (DL) and deep reinforcement learning (DRL) have achieved great success in the fields of computer vision [13], natural language processing [14], and network resource allocation [15], [16]. To a great degree, they have facilitated the research of signal modulation classification. O'Shea et al. [17] firstly introduced the convolutional neural networks (CNN) for classifying modulation types and proposed a 4-layer CNN2 model. Ramjee et al. [18] repeatedly adjusted the depth and filter settings of the CNN2 in [17] and proposed a 6-layer CNN4 model. The results shown that the neural network not only outperforms the traditional methods in terms of accuracy, but also in terms of flexibility in identifying various modulation types. However, as the network continues to deepen, CNN has the problem of gradient vanishing or exploding, which causes the decline of the recognition accuracy and leads to confusion between higher-order modulations. Subsequently, some other neural networks such as convolutional long shortterm deep neural networks (CLDNN), inception architecture, and residual network (ResNet) were applied to modulation recognition. By combining the advantages of CNN and long short-term memory (LSTM) modules, West and O'Shea [19] proposed CLDNN. Besides, an inception architecture was also constructed by connecting three different CNN networks in parallel. And in [20], O'Shea et al. proposed the ResNet by adding the bypass connections which created the identity mapping. Although the recognition accuracy has been improved to a certain extent, the confusion between high-order modulation modes remains unresolved for some models.
With the development of fifth generation (5G) wireless communication technology, the number of communication devices has increased sharply, and spectrum resources have become increasingly scarce. As a practical solution to improve the effectiveness of the spectrum, high-order modulations have been widely used in wireless communication systems. It is known from open protocols and standards [21] that high-order modulations such as 64QAM and 256QAM have been applied to 5G mobile communications. With the prospect of 5G mobile communication for dynamic spectrum access, high-order modulation types possess more practical significance. To solve the issue of confusion between highorder modulations, some experts and scholars have conducted various studies. Data preprocessing is the common method to improve recognition accuracy. Rajendran et al. [22] preprocessed the I/Q signal into amplitude and phase information before using LSTM to feature extraction, which improved the recognition accuracy to a certain extent but increased the execution delay of the classifier. Wang et al. [23] merged two CNNs with different structures to improve the overall average classification accuracy. The second CNN used constellation diagrams as input, which significantly improved the recognition accuracy of high-order modulations (such as 16QAM and 64QAM) while causing long execution times. Teng et al. [24] proposed an AMR method based on accumulated polar features and supplemented by the channel compensation mechanism. The model used the constellation diagrams with the polar feature transformation and temporal accumulation as input. It greatly enhanced network recognition performance and robustness in time-varying fading channels, but at the same time increased the time cost.
The existing modulation recognition methods may have poor performance, especially in identifying high-order modulations and their high complexity. To address these issues, we proposed a new modulation recognition model hybridized CNN and recurrent neural network (RNN) to extract the spatial and temporal characteristics of signals. Besides, to enrich the feature dimension, we carried out feature conversion on the input of LSTM by transforming the baseband I/Q signal into amplitude and phase information. The main contributions of this paper are summarized as follows: 1) An improved residual stack (IRS) is proposed. The IRS maintains the more initial residual information, learns the signal features in deep and shallow, and achieves various degrees of feature extraction. 2) Design an AMR architecture named IRLNet based on the IRS and LSTM, which achieves high recognition accuracy with low complexity. Even the high-order modulations that are easy to be confused can be also properly recognized. 3) Demonstrate robustness of IRLNet by testing its recognition performance for diverse modulation modes and different channel environments. The remainder of the paper is organized as follows, section II clearly states the problem of modulation recognition. Section III describes the IRLNet used for AMR. Section IV introduces the experimental setting. Section V introduces the classification results in details and discusses the advantages of the proposed model. Finally, we conclude the work in Section VI.

II. PROBLEM STATEMENT
Modulation is an important part of wireless communication which can expand the signal bandwidth and improve the anti-fading and anti-interference capabilities. The modulation process transforms the baseband signals into highfrequency signals, which allow for accurate and low-noise data transmission between distant transmitters and receivers. The modulation recognition module is generally deployed at the receiving side of the communication system to provide modulation information for the demodulator, as shown in Fig. 1. In a wireless communication system, the received signal is generally expressed as [22] r(t) = s(t) * c(t) + n(t) (1) where s(t) is the signal transmitted by the transmitter after modulated, c(t) is the impulse response of wireless channel, n(t) is additive white gaussian noise (AWGN) with zero mean and variance σ 2 n , and r(t) is the received signal, which is commonly expressed in I/Q format and sampled n times by the analog to digital converter at a rate f s = 1 T s . The real and imaginary parts of r(t) represent I and Q components, respectively. Specifically, r(t) can be modeled as [25] r(t) = α(t)e j(2πf 0 +θ 0 (t)) s(t) + n(t) (2) where α(t) denotes the Rayleigh fading channel, f 0 and θ 0 (t) represent the frequency and phase offsets introduced by the disparate local oscillator and Doppler effect, respectively. The transmitted signal has different mathematical expressions when modulated by different modes. The modulation modes presented in [1] were considered to investigate the differences. If the signal is modulated by amplitude-shift keying (ASK), frequency-shift keying (FSK) or phase-shift keying (PSK), s(t) is expressed as follows: where A m is the modulated amplitude, a n is the symbol sequence, g(·) is the signal pulse, and T s is symbol period. f m is the modulated frequency, and f c is the carrier frequency. φ 0 represents the initial phase, and φ m represents the modulation phase. However, the quadrature amplitude modulation (QAM) is different from the aforementioned modulation modes. Specifically, it has two orthogonal carriers modulated by a n and b n , and s(t) is expressed as follows: Modulation recognition is generally considered as an N-class classification problem, where N denotes the number of modulation modes for the transmitted signals. The aim of the modulation recognition task is to classify these modulation categories blindly from the n-sample received symbol vector Y = [y 0 , y 1 , . . . , y n−1 ] T . As the most common evaluation metric for classification problems, accuracy is used to measure the model performance. Denote S = [s 0 , s 1 , . . . , s N −1 ] as the pool of known N candidate modulations. The actual adopted modulation and the predicted modulation for the i-th sample are labeled as s i and s i , respectively, where i = [1, 2, . . . , K ] and K is the sample number of the testing set. The accuracy indicator for the i-th recognition is defined as The overall recognition accuracy rate for K testing samples is then measured as

III. MODEL DESCRIPTION A. SIGNAL PREPROCESSING STAGE
Since the features extracted from the time domain I/Q signals are limited, we preprocess the input data of LSTM by converting the received signals from Cartesian coordinates to polar coordinates to better and more comprehensively extract the features. Learning features from the r-θ domain encodes specific communication system information and makes the following network more resilient to fading channels. After that, the I/Q signals are converted to the corresponding amplitude and phase values. Fig. 2 describes the constellation diagrams of five common modulation modes with SNR = 10 dB under AWGN channel before and after the coordinate transformation. For rectangular coordinates and polar coordinates, the horizontal and vertical axes represent I/Q and radius/theta, respectively. As the images of 16QAM and 64QAM depicted in Fig. 2, the pattern in I-Q plane is more regular, while in r-θ plane is more diverse. The 16QAM constellation can be seen as a sub-picture of 64QAM in the I-Q plane, which is referred to as nested modulations. The shared constellation points in I/Q based images cause a lot of misclassification. We transform the I-Q domain data into the r-θ domain before sending it to the LSTM in our simulations by leveraging existing expert knowledge in communication. To construct the relation between I and Q components, we associate with polar coordinates which substitute the I-Q axis with the r-θ axis. Algorithm 1 summarizes the transformation procedure, where I, Q denote the real and imaginary parts of received complex symbols, r, θ represent the transformed polar coordinates of radius and theta, and n is the symbol length.

Algorithm 1 Polar Feature Transformation
After conversion, the constellation diagrams of 16QAM and 64QAM showing a significant difference.

B. CLASSIFICATION STAGE 1) IRS
As a variant of CNN, ResNet improves the accuracy of feature extraction by adding skip connections between the input and the output of different convolutional layers, as well as performing feature operations at multiple scales and depths. The typical basic unit of ResNet called residual stack (RS) is depicted in Fig. 3(a). Each RS is made up of one convolutional layer, two residual units, and a max-pooling layer.
To maintain more initial residual information, learn the signal features in deep and shallow, achieve various degrees of feature extraction, and escape the network from complexity, we design the IRS, as shown in Fig. 3(b). The IRS adds an extra skip connection between the input and output of the two residual units. Same with RS, the residual unit in IRS is made up of two convolutional layers and a ξ -times skip connection. A λ-times skip relation is also associated between the input of the former residual unit and the output of the latter residual unit.
The inputs of RS and IRS are both x, and the outputs are y and y , respectively. Thus [27], [28],   where W i and b i are the weight and bias of the convolutional layer, f (·), g(·) and h(·) denote the function to be learned, ξ and λ are multiplier coefficients of skip connection, and maxp(·) represents max-pooling operation.

2) LSTM
RNN is a branch of deep learning that has significant advantages in processing time-series data. Different from the feedforward neural networks (e.g., CNN), RNN pays more attention to feedback and uses the internal state to process the sequence data. However, due to the disappearance or explosion of the gradient, the processing sequence data of RNN is limited by the length. As the variant of RNN, LSTM can significantly eliminate the problem of gradient vanishing or explosion for its unique cell structure shows in Fig. 4. Each LSTM cell contains three types of gates: the input gate (i), the forget gate (f ), and the output gate (o) [29]. In the process of forwarding computing, the forget gate (f t ) decides whether or not to keep a cell state memory (c t ) at time t. On the input x t at time t and output (h t−1 ) at time (t-1), the forget gates are constructed according to equation (13).
where σ represents the sigmoid activation function, W f and b f denote the associated weight and bias between the input (x) and the forget gate (f ). The input gates (i t ) specify the cell states (C t ) to update after f t decides which memories to forget, as shown in equations (14) and (15).
where tanh stands for the tanh activation function, W i and b i denote the associated weight and bias between the input (x) and the input gate (i), and W C and b C denote the associated weight and bias between the input (x) and the cell memory (C). Using forget gates (f t ) and input gates (i t ), the old cell state (C t−1 ) is changed to the new cell state (C t ) as per equation (16).
where · is the Hadamard product. Finally, based on the cell states (C t ), the output (h t ) through output gates (o t ) according to equations (17) and (18).
where W o and b o denote the associated weight and bias between the input (x) and the output gate (o).

3) IRLNet
Considering the strong representative abilities and the role in alleviating the vanishing gradients problem of IRS, as well as the advantages in processing temporal data of LSTM, we propose the IRLNet to make the network more lightweight while ensuring high recognition accuracy. The architecture of the IRLNet is shown in Fig. 5. To simplify the figure, we omit the batch normalization (BN) and rectified linear unit (ReLU) in the IRS. BN follows each convolutional layer in the IRS. Except for the first convolutional layer which uses the linear activation, the rest convolutional layer all use the ReLU activation. The kernel size of every convolutional layer is set to 5, and the kernel is set to 32 as required. The BN operation is defined as [30]: where µ B is the mean and σ 2 B is the variance of the mini-batch samples, µ l and σ l are the offset and scale factors that are VOLUME 9, 2021 updated continuously during training. The ReLU function is expressed as [24], [31]: One input layer, two IRSs, one LSTM, one fully connected (FC) layer, and one softmax layer make up the network. Thus, there are ten convolutional layers, two maxpooling layers, one LSTM, and three FC layers as a result (softmax layer is also a FC layer). The structure with 2 IRSs and 1 LSTM is selected after detailed analysis the influence of the depths on the recognition accuracy in Section V-A. I/Q samples are sent to IRSs and amplitude-phase samples are sent to LSTM, respectively. Studies have proved: LSTMs  are not able to extract any meaningful representations from I/Q samples but perform well in amplitude-phase samples; The amplitude-phase supplied to the CNN model does not show better performance than I/Q samples [22]. Then, a concatenate layer merges the parallel branches. The data further passes a FC layer with 256 units and ReLU activation, which is followed by a dropout layer with the drop probability of 0.6. The final layer is also a FC layer, where the neuron number is consistent with the considered modulation number, and softmax activation. The BN layer is used to minimize training time and prevent gradients from exploding or disappearing [30]. To alleviate overfitting and compel the nodes to be more independent than usual, the regularization strategy of ''dropout'' is used to prevents updating the weights of part nodes [24]. The detailed structure of IRLNet is shown in Table 1.

IV. EXPERIMENTS BACKGROUND A. DATASETS
We first serve the well-known DeepSig dataset named RadioML2016.10B to train or test the performance of the IRLNet and make a comparison with the state-of-the-art (SoA) models. More information about this dataset generated can be found in [32]. Then, we consider transfer learning [33] on two most challenging datasets named RadioML2018.01A [20] and HisarMod2019.1 [34]. All considered modulations in the datasets are split equally. The  label of input data includes SNR and modulation mode. The used modulations and the parameter list of three datasets can be found in Table 2, Table 3, and Table 4. In the training phase, 70% of the data is scratched at random, with 25% of them being used for validation. Except for training data and validation data, the remaining 30% is reserved for testing. The detail of the division of the three datasets is shown in Table 5.

B. SOA MODELS
For comparison, we utilize the following representative SoA DL AMR models: • CNN2-two-dimensional (2D) CNN model with two 2D convolutional layers and two FC layers as it is described in [17]; • CNN4-2D CNN model with four 2D convolutional layers and two FC layers as it is described in [18]; • CLDNN-CLDNN model with three 2D convolutional layers, one LSTM layers and two FC layers, such it is given in [19]; • Inception-Inception model with three parallel 2D convolutional layers, totally has five convolutional layers and concatenate by two FC layers, such it is given in [19]; • ResNet-ResNet model with six RSs and three FC layers as it is described in [20]; • LSTM-LSTM model with two LSTM layers and two FC layers as it is given in [22].

C. THE TRAINING AND INFERENCE ALGORITHMS
The training aims to optimize parameters of the network based on the training dataset to achieve good performance on the training set while also attempting to generalize it to other data. Building the training set is the first step for i = 1, 2, . . . , n do P max = 0; for j = 1, 2, . . . , N do Compute the softmax output of each class P j ; if P max < P j then P max = P j ; end if end for y (i) = P max ; end for Output:Ŷ test = ŷ (1) ,ŷ (2) , . . . ,ŷ (n) .
in network training. The training set for the IRLNet is D = [X train , Y train ], where X train = x (1) , x (2) , . . . , x (m) , Y train = y (1) , y (2) , . . . , y (m) , and the number of samples in the training set is given by m. The training set also includes N modulation modes.
The loss function is the key to training and cross-entropy is the most commonly used loss function for classification tasks. For a minibatch consisting N B samples, we define the loss function as [24]: where y represents the true label andŷ represents the predict label. Adaptive moment estimation (Adam) [35] has been proved to have a faster convergence rate than the most commonly used stochastic gradient descent (SGD) method [36] in theory and practice, and it is more suitable for large datasets.
As an example, in iteration t, the updating process for weight w l to be optimized is as follows [37]: (1) Calculate the loss function for gradient in regard to w l .

VOLUME 9, 2021
(2) Update the biased first moment estimate m l t and biased second raw moment estimate V l t .
where l is the number of the layer located the weight w l , lr = 0.001 is the learning rate, and β 1 = 0.9, β 2 = 0.999 are the exponential decay rates. According to the analysis, the training and inference algorithms for IRLNet can be expressed as Algorithm 2.

D. TRANSFER LEARNING
One DL model is generally trained for a specific task, such as identifying several candidate modulation modes in a certain environment. Once the scenario or modulation set changes, the model needs to be retrained with new training data. The full model training requires considerable time and effort, transfer learning is adopted in this paper which could reduce computational overhead. The concept of transfer learning is motivated by the fact that people can tackle new issues faster or obtain better solutions by applying learned previously knowledge [33]. For the modulation recognition problem, we perform transfer learning for the following two cases. 1) Diverse modulation modes: Depending on the requirements of different classification tasks, some modulation modes need to be added or removed. In this case, in addition to changing the training data, the model only needs to adjust the output layer of the network according to the number of modulation modes. That is to say, all the network parameter weights except for the ones in the top layer can be loaded as the initialization instead of randomly setting. 2) Different channel environments: The real-world channels vary over time, one-time training to contain all kinds of channel environments seems unrealistic. For the new channel environment, the model needs to be retrained from scratch, but the model structure can be completely retained or only fine-tuned.

E. IMPLEMENTATION DETAILS
The experiment consisted of two components. The first component optimizes the performance of the IRLNet based on the RadioML2016.10B and compares IRLNet with several SoA models in Section IV-B. In the second component, the model is transferred to RadioML2018.01A and HisarMod2019.1 to verify the robustness of new modulation modes and transformed channels. For all of our experiments, we use Keras with Ten-sorFlow as the backend. We have two computers: one is equipped with AMD Ryzen 7 3800X CPU, 16 GB RAM, and NVIDIA GeForce RTX 2070 SUPER for training The training stage is set at 200 epochs, and the gradient update batch size is set at 1024. An early stopping strategy is also adopted. Under the condition that the validation loss does not decrease within five epochs, we multiply the learning rate by a factor of 0.01 to increase training performance. After adjusting, if the validation loss still does not decrease in the next ten epochs, the training will stop. In the testing stage, we split the testing set by SNR. The model calculates the probability that each sample belongs to each modulation mode, and the highest probability category is the predicted label of the corresponding sample. The recognition accuracy is calculated by comparing the real and the predicted labels. The ability of large neural network models to accurately reflect complex features is influenced by the model size. Too shallow a model will result in poor performance, while too deep a model will result in too high a complexity but no significant benefit to performance improvement. To achieve the best performance with the lowest complexity, we measure the classifier performance by varying the number of IRSs and LSTMs. Based on the recognition results plotted in Fig. 6, we see the increasing classification accuracy as we introduce more IRSs and LSTMs within the network architecture (i.e., making the network deeper). Surprisingly, when the number of LSTMs increase to four, the performance of the model suddenly decline, which may be overfitting due to the continuous deepening of the network. The recognition accuracy is almost no longer improved (or even decreased) to add more IRS or LSTM, that is to say, the model contains two IRSs and one LSTM is the best among the explored combinations for RML2016.10B.

2) TRAINING PERFORMANCE
The accuracy and loss in the training stage are used to evaluate the training performance of models. All seven models had completed their training in less than 200 epochs. The accuracy and loss in the training stage can be seen in Fig. 7(a) and Fig. 7(b), respectively. The result of training and validation shows, in the early training stage, IRLNet achieves the best training effect at the fastest convergence speed, followed by LSTM. ResNet has a strong early training effect, but there is noticeable jitter during the training process. The training effects of the other four models are not ideal, and even if more epoch times are spent, they cannot achieve the same effect as IRLNet.

3) RECOGNITION PERFORMANCE
The performance of the seven models is measured using accuracy, the most common evaluation metric for classification problems. As shown in Fig. 8, except for the LSTM model, the classification accuracy of IRLNet is significantly better than the other five models. When the SNR is greater than 0 dB, the recognition accuracy of IRLNet all exceeds 90%, and the average recognition accuracy of the SNR in the range of 0-18 dB is higher than the other five models by 3%-13%. However, the recognition accuracies of all the models under the low SNRs (e.g., -20 dB) are approximately identical. This may likely due to the fact that at low SNRs, a significant amount of interference information overwhelms much of the signal characteristics, resulting in low recognition accuracy for most modulation modes and thereby lowering overall recognition accuracy.   Following further exploration, we find that the main reason for the five models which are less effective in modulation recognition is that they are difficult to recognize high-order modulations. As shown in Fig. 9, even at a higher SNR (SNR = 18 dB), models other than IRLNet and LSTM have a certain degree of confusion in recognizing 16QAM and 64QAM, some are even unable to distinguish them at all. There is another area of confusion between WBFM and AM-DSB in the figures. Both belong to continuous modulation make the variations between them are limited on the complex panel. Furthermore, generated from sampling analog audio signals which contain silent intervals, exacerbating the confusion between WBFM and AM-DSB.

4) COMPUTATIONAL COMPLEXITY
We assess the computational complexity of the models in addition to their recognition accuracy. Both spatial and temporal complexity are considered in computational complexity analysis, and the comparison results among seven models is shown in Table 6. Spatial complexity refers to the trainable parameter number. As indicated by the table, except for the CLDNN and ResNet, IRLNet has the smallest number of trainable parameters. IRLNet has about 96.9% fewer parameters than Inception, 90.2% fewer parameters than CNN4, 42.1% fewer parameters than CNN2 and 12.2% fewer parameters than LSTM.
We measure the temporal complexity by the training and inference time. The comparison among seven models is shown in Table 6 and Fig. 10. The IRLNet takes approximately1589 seconds for model training, longer than CNN2 and ResNet, but shorter than the other four models. The CLDNN costs the longest training time. Since the model is always trained offline and once deployed is not often retrained, the inference time after model deployment is more critical. The IRLNet takes about 4.05 seconds to  infer 360,000 samples in the testing set, which is about 75% faster than LSTM. IRLNet requires the least time in the inference stage except CNN2, but it can achieve the optimal results. On the contrary, LSTM consumes about four times for inference compared to IRLNet. While comparable accuracy can be achieved, it is intolerable for real applications.

B. TRANSFER LEARNING PERFORMANCE
Transfer learning aims at enhancing the performance of target learners on target domains by migrating knowledge from distinct but linked source domains [38]. In this section, we conceive signal classification on a different dataset as a transfer learning issue. As shown in section IV − A, the channel environments of RadioML2018.01 and RadioML2016.10B are the same, but the modulation modes are different. Therefore, for RadioML2018.01, we load the network parameter weights of all layers except for the last fully connected layer as initialization, and the number of neurons in the last fully connected layer is set to 24 according to the number of modulation modes. As shown in Fig. 11 and Fig. 12, the descent speed of the loss curve is faster than retrain from scratch and the training time of the network is significantly shortened. As for HisarMod2019.1, the model needs to be retrained due to the completely different channel environments. In addition, the modulation modes are also different. Therefore, the number of neurons in the last fully connected layer also needs to be adjusted, i.e., adjusted to 26.
To verify the transfer learning ability of the IRLNet model, we consider diverse modulation modes and different channel environments. In addition, we compare it with the LSTM model which demonstrated the similarity recognition accuracy with IRLNet on RadioML 2016.10B in section V-A.

1) CLASSIFIER PERFORMANCE BY SAMPLE LENGTH
Since the sample length of RadioML2016.10B dataset is limited, we discuss the impact of sample length on the model recognition accuracy on two new datasets. Fig. 13 shows  how the model performance varies by sample length. The performance of the network steadily improves as the length of the training samples increases. The variation trend of the accuracy of different model sizes explained as follows: the longer the training sample data is, the more signal features fed back to the network, and the higher model accuracy achieved. When the frame size decreased from 1024 to 512 at 0 dB, the recognition accuracy on the two datasets decreased by 3.24% and 4.80%, respectively. As the sample length further decreases, the accuracy continues to decrease. Although a longer sample length brings higher recognition accuracy, it comes at the cost of more computation and memory. Thus, the trade-off between the two is critical for real applications. In the following experiments, we all take the sample length of 1024.

2) DIVERSE MODULATION MODES
As communication advances, the modulation modes increase rapidly and multiply, the modulation recognition model requires to capable of reliably recognizing new modulation modes. Some new modulation modes are included in RadioML 2018.01A and HisarMod 2019.1, containing highorder modulations and analog modulations that are easy to confuse. In Fig. 14 and Fig. 15, we show the performance of the classifier for each modulation mode in the RadioML 2018.01A and HisarMod 2019.1. Almost all modulation modes in the two datasets can achieve robust performance around SNR = 10 dB. The results of the two datasets are both grouped into three categories for better visualization, where RadioML 2018.01A is divided into PSK + APSK, ASK + QAM, and low-order + analog; HisarMod 2019.1 is divided into PSK + FSK, QAM + PAM, and analog.
In the RadioML 2018.01A, IRLNet has a better recognition performance for all signals with lower information rates and achieves faster convergence to the maximum value compared with high-order modulation modes. High-order modulations are greatly affected by various channel impairments, and their recognition rate continue to decrease as the order increases, especially for high-order QAMs. For instance, when the QAM order increased from 32 to 64, 128, and 256, the recognition accuracy decreased significantly, by 20.05%, 10.10%, and 23.33% at the same SNR (i.e., +10 dB), respectively. Unexpectedly, for AM-SSB-WC, IRLNet has poor recognition performance. Although recognition accuracy can reach 77.66% when the SNR is -2dB, its performance decreases as the SNR increases obviously. This is due to confusion between AM-SSB-WC and AM-SSB-SC, as shown in Fig. 17(a). Both of them are Single-Sideband AM, which means the differences between them on the complicated panel are negligible. Furthermore, the simulated AM-SSB-WC and AM-SSB-SC data in the dataset were created by sampling analog audio signals, which have silence periods, worsening the issue. By fine-tuning the network, such as deepening the network, the accuracy of high-order modulations and AM-SSB-WC has been improved to varying degrees. As shown in Fig. 16, when the number of IRS increase from 2 to 4, the accuracy of 64QAM, 128QAM, and 256QAM increase by 6.31%, 5.91%, and 12.91% at SNR = 16 dB. As for AM-SSB-WC, it improves the most when the SNR is 4 dB, increasing by 66.41%.
In the HisarMod 2019.1, the recognition accuracy of all modulation modes increases as the SNR increases, and when the SNR is around 10 dB, all modulation modes can reach an accuracy of 100%. At very low SNR (i.e., -20 dB), some modulation modes (2FSK, 4FSK, BPSK, 8PSK, 16PSK, 4PAM, 8PAM, 8QAM, 16QAM, 256QAM, AM-DSB-SC, FM) can achieve greater than 50% recognition accuracy, and even more surprising is that some modulation modes (such as 32PSK, 32QAM) can achieve 100% accuracy. Fig. 17 and Fig. 18 show the confusion matrices for IRLNet at SNR = 16 dB. From the perspective of modulation recognition, except for the slight confusion among some high-order QAMs and analog modulations, IRLNet has a high recognition accuracy of over 90% for almost all classes in RadioML 2018.01A. IRLNet has the worst recognition effect on AM-SSB-WC and misjudges 50% of AM-SSB-WC as AM-SSB-SC as depicted in Fig. 17(a). It may because that the signal involves a small carrier component, making it more difficult to determine whether the carrier should be suppressed or not. Unsurprisingly, there is confusion among high-order QAMs (64QAM, 128QAM, and 256QAM). 3% and 10% of 64QAM are misjudged as 128QAM and 256QAM, respectively. 256QAM is misjudged as 64QAM and 128QAM with 9% probability. This may likely due to higher-order QAMs are more susceptible to severe channel conditions. After fine-tuning, the confusions are obviously improved, as shown in Fig. 17(b). Surprisingly, in the HisarMod 2019.1, no confusion among modulation modes occurred, as shown in Fig. 18. All modulation modes, either high order modulations or analog modulations, can be identified correctly at 100%. Overall, IRLNet has a high level of robustness for various modulation modes.

3) DIFFERENT CHANNEL ENVIRONMENTS
The real-world channels vary over time, and design a network model explicitly for each channel seems impractical for the high cost of model designing and optimization. The robustness of the model when dealing with time-varying channels is particularly significant. The over-the-air captured data and Nakagami channel simulated data are included in the datasets of RadioML2018.01A and HisarMod2019.1, respectively, which are not included on RadioML 2016.10B. Since LSTM has been proved to have similar recognition accuracy with IRLNet in RadioML 2016.10B, we use it as a comparison to conduct robustness tests in the two datasets. Besides, we also use the ResNet and CNN model proposed by the datasets generators as a comparison in RadioML2018.01A and His-arMod2019, respectively.
As shown in Fig. 19, IRLNet shows better recognition performance on both datasets. In RadioML2018.01A, IRLNet shows obvious advantages over ResNet and LSTM, both in low SNRs and high SNRs. As we can see in Fig. 19(a), compared with ResNet and LSTM, IRLNet improves the recognition accuracy by 20.5% and 29% when SNR = 10 dB. Compared with ResNet and LSTM, the recognition accuracy of IRLNet maximum increases by 22.2% and 29%, respectively. In HisarMod2019, IRLNet shows excellent performance, but LSTM is almost impossible for modulation recognition. Compared with LSTM, the recognition accuracy of IRLNet maximum increases by 69.33%. Compared with CNN, IRLNet improves the recognition accuracy by 22%-60% at the range from -20 dB to 0 dB of SNRs. That is to say, IRLNet is extremely robust in a variety of channel environments.

VI. CONCLUSION
To solve the signal modulation recognition problem, we first add an extra skip connection based on the basic RS and construct the new IRS. The IRS maintains the more initial residual information, learns the signal features in deep and shallow, and achieves various degrees of feature extraction.
Then, we connect several IRSs and LSTM in parallel to form the modulation recognition model IRLNet. Compared with traditional algorithms, IRLNet can automatically extract deep and shallow features from data without manual extraction. And compared with the SoA DL algorithms, the model training process is more stable, and has low spatial-temporal complexity which makes the model much easier to deploy in a variety of real scenarios. In addition, IRLNet shows excellent recognition accuracy. Firstly, on the dataset of RadioML2016.10B, the average recognition accuracy reaches 93% from 0 dB to 18 dB of SNR. Secondly, it also has strong robustness, when transferred to RadioML2018.01A and HisarMod2019.1, the recognition accuracy can reach 95% and 100% at high SNR, respectively. Finally, IRLNet also shows good recognition performance for high-order modulations. The recognition accuracy of the model is more than 90% on both datasets for high-order modulations, such as 128QAM when SNR = 10 dB. Although we have proven the robustness of IRLNet for various channel environments, the real-world channels constantly change over time which may degrade the performance of the model. Therefore, in future work, we plan to design a channel estimator based on neural network to recover the distorted channel.