Automatic Modulation Classification Based on Hierarchical Recurrent Neural Networks With Grouped Auxiliary Memory

As a valuable topic in wireless communication systems, automatic modulation classification has been studied for many years. In recent years, recurrent neural networks (RNNs), such as long short-term memory (LSTM), have been used in this area and have achieved good results. However, these models often suffer from the vanishing gradient problem when the temporal depth and spatial depth increases, which diminishes the ability to latch long-term memories. In this paper, we propose a new hierarchical RNN architecture with grouped auxiliary memory to better capture long-term dependencies. The proposed model is compared with LSTM and gated recurrent unit (GRU) on the RadioML 2016.10a dataset, which is widely used as a benchmark in modulation classification. The results show that the proposed network yields a higher average classification accuracy under varying signal-to-noise ratio (SNR) conditions ranging from 0 dB to 20 dB, even with much fewer parameters. The performance superiority is also confirmed using a dataset with variable lengths of signals.


I. INTRODUCTION
Automatic modulation classification (AMC) is the process of deciding the modulation to be used by transmitter, based on observations of the received signal. It is becoming increasingly important in cooperative communications, especially since the advent of the software-defined autonomous radio [1]. Furthermore, AMC plays an crucial role in many applications, such as the identification of interference signals and jammers, tracking the activities of specific users [2]- [5].
An automatic modulation classifier can be defined as a system that automatically identifies the modulation type of the received signal, given that the signal exists and that its parameters are distributed in a known range. This needs a universal modulation recognizer capable of classifying a comprehensive list of modulation schemes. Quite a few scholars have presented a variety of excellent approaches, which can be roughly divided into two categories: likelihood-based (LB) approaches and feature-based (FB) approaches [6]. LB methods are based on the likelihood function of the received The associate editor coordinating the review of this manuscript and approving it for publication was Matti Hämäläinen . signal, wherein the decision is made by comparing the likelihood ratio against a threshold [2]. Even though LB methods usually exhibit high accuracy and minimize the probability of mismatches, such methods suffer from high latency or require complete priori knowledge like the clock frequency offset. LB approaches are also prone to mismatching when applying the theoretical system model to the actual scene. FB approaches usually extract certain features from the received signals; then, selected classifiers are used to classify different modulation signals [6]. Traditional feature methods mainly include instantaneous time features, statistical features (moments, cumulants, and cyclostationarity), and transform features. These features, with an efficient classifier for AMC, have achieved satisfactory performances.
In recent years, deep neural networks have received much attention in many application domains, such as computer vision [7], natural language processing [8] in which recurrent neural networks (RNNs) occupy an extremely important position in sequence processing and classification because they are efficient feature extractor and classifier. In most cases, the features extracted by the neural network are more effective than those extracted manually. Moreover, manually extracting features from data may cause the loss of information that is necessary for the classifiers [9]. Therefore, many researchers try to use deep neural networks such as convolutional neural networks (CNNs) and long short-term memory (LSTM) [10] for AMC problems.
CNNs exploit spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers. CNNs share weights among all neurons in a particular feature map and each neuron is connected only to a subset of the input. This helps to reduce the number of parameters in the whole system and makes the computation more efficient. Despite its great achievements in spatial feature extraction, CNNs do not perform well in modeling time-series changes. To learn the changes in time series, error signals must travel for a long temporal distance when backpropagating through time. The difficulties arising from the large temporal lengths of RNNs are significantly alleviated by LSTMs [11]. However, training LSTMs to work deep in both time and space still poses a challenge, and there are shortcomings, which hinder the learning of long-term temporal changes, such as the vanishing gradient problem.
In this paper, we propose a modified hierarchical recurrent neural networks with an grouped auxiliary memory (GAM-HRNN). We use a hierarchical structure by stacking the layers with shortcut connection from grouped auxiliary memory(GAM [11]) to each layer. Recurrent structures such as LSTM, gated recurrent unit (GRU) [12], and update gate recurrent neural network (UGRNN) [13] can be used in HRNN. The contributions of this paper are as follows: firstly, we develop a new recurrent-model-based deep learning solution, wherein the experimental results show high accuracy on a standard dataset compared with LSTM, GRU, Recurrent highway networks (RHNs) [14] and recurrent highway networks with grouped auxiliary memory (GAM-RHNs) [11]. Secondly, even with fewer parameters, the model achieves comparable and decent results. Thirdly, we explore the norm of gradients of the mentioned models during the training process. Finally, we find that the proposed model also provides an advantage when dealing with signals of unfixed lengths. The rest of the paper is organized as follows. Section II introduces related studies. Section III formulates the problem and introduces the standard and modified datasets. We introduce the LSTM model, GRU model and RHNs in Section IV and explains our GAM-HRNN in Section V. Section VI presents the experiments and analysis, and Section VII concludes the paper.

II. RELATED WORKS
In this section, a brief introduction of the works using traditional methods is provided first. Then, we review the work that relates to our method in detail.

A. TRADITIONAL METHODS
Traditional methods can be primarily divided into two categories: LB and FB approaches. Chavali et al.used LB algorithms for modulation classification in fading non-Gaussian channels [15]. Then, they modeled the additive Gaussian noise with a Gaussian mixture distribution model. FB methods use instantaneous time features, statistical features, transform features, and other features, including constellation shape. Yuan et al. [16] developed an algorithm using wavelet transform and pattern recognition for analog and digital modulation classification. Wavelet transform was used to estimate the symbol rate of the received signals to separate analog signals from digital signals. Ananthram et al. proposed a method based on elementary fourth-order cumulants [17]. The cumulant-based classification is particularly effective when used in a hierarchical scheme. In [18], the authors used various statistical moments of the signal amplitude, phase, and frequency with a fuzzy classifier; their technique performed well at low signal-to-noise ratios (SNRs).
Researchers also used artificial neural networks (ANNs) as a classifier. In [19], a common ANN achieved good performance dealing with analog and digital modulation-type classification. Although ANN had achieved success in modulation recognition, its overdependence on sample training data and easily settling into a local optimum solution restricted its performance and application. In [20], the authors proposed a hierarchical support vector machine(SVM)-based structure and used higher order moments and cumulants for AMC. It improved the performance of the recognizer efficiently. Aslam et al. [21] explored the use of genetic programming (GP) in combination with K-nearest neighbor (KNN) for AMC. KNN was used to evaluate fitness of GP individuals during the training phase. As we can see, each FB method had its own advantages, and all traditional machine learning methods, such as KNN, SVM, and ANN, had been used as a classifier. As mentioned previously, in many domains, the features learned automatically were more effective than those extracted manually. The separation of the feature extraction and classifier always led to information loss. This led researchers to consider deep neural network methods.

B. DEEP NEURAL NETWORK METHODS
With the experimental conditions generally standardized, many researchers used deep neural network methods, which mainly included CNNs and LSTMs for the AMC problem and achieved excellent performance. Deep neural network methods depend on sample training data. O'Shea et al. built the RadioML 2016.10a dataset in 2016 [22], and achieved a good performance on it with a simple CNN model. Peng et al. indicated that their proposed CNN-based model achieved good accuracy without the necessity of manual feature selection compared with SVM for AMC [23]. They also used two famous CNN-based models, AlexNet and GoogLeNet for AMC and converted complex signals into data formats in a grid-like topology, e.g., images that facilitated the use of prevalent deep neural network models and frameworks for classification [24]. The experiments indicated that CNN models show a significant performance advantage and application feasibility. Teacher and student networks were used to shrink the size of the model, with a slight accuracy decrease in [25]. VOLUME 8, 2020 Huang et al. used compressive CNN for modulation classification [5]. Furthermore, CNN models were also used to classify the modulation types in an orthogonal frequency-division multiplexing system, wherein the modulation classification accuracy was limited [26].
Temporal dependencies are important in AMC problems, and an LSTM can learn those features effectively. Rajendran et al. proposed an LSTM for AMC on the standard RadioML 2016.10a dataset, without requiring expert features like higher order cyclic moments [27]. The simple LSTM model yielded an average classification accuracy of approximately 90%, under varying SNR conditions, ranging from 0 dB to 20 dB, which will be compared herein. They showed that an LSTM-based model can learn good representations of the time-domain sequences for the AMC problem. However, the forget bias of the model should be set to 1.0 manually, and to achieve high accuracy, many parameters were necessary. Pascanu et al. presented the methods for constructing deep RNNs [28]. Simply increasing the recurrence depth yielded RHNs [14], which settled the vanishing gradient problem in spatial depth to some extent, but the representation capability was low. For hierarchical structure, each layer had its own states, which promoted the variety and improved the representation capability of the model. However, LSTM also faced problems that degraded its performance, for example, the vanishing gradient problem as the model grew deeper in both time and space. In [29], neurons in the same layer are independent of each other and they connected across layers. Hu and Zheng tried to modify the memory update method of LSTM network and achieved good results on prediction tasks. [30] Tensorized LSTM are employed to model the temporal patterns and an adaptive shared memory was used to help the networks learn the relatedness among tasks in [31].
Combining the CNN with LSTM in parallel mode and serial mode achieved better performance compared with independent networks [9]. Huang et al. applied a few data augmentation methods, such as rotation, flip, and Gaussian noise, on the RadioML Dataset. Their experiments showed that the rotation method yielded the best accuracy [32], and needed fewer data to achieve relatively good results. For the convenience of application, exploring lightweight networks is also very important for AMC problems. Wang et al. proposed a pruning method for networks and obtained a model with fewer parameters and comparable accuracy [33].

III. PROBLEM FORMULATION AND DATASET INTRODUCTION A. COMMUNICATION SIGNAL FORMULATION
Consider digital modulation as an example. At the receiver side, the received signal y(t) can be formulated as follows: Here, x(t) is the modulated signal from the transmitter, c(t) represents the time-varying impulse response of the transmitted wireless channel. n(t) denote the additive white Gaussian noise. The generation of x(t) is shown in Fig. 1,  The modulating signal and carrier signal are mixed to obtain the modulated signal x(t), which is formulated as follows: Here, v(t) denotes the modulating signal, A c represents the amplitude of the carrier, and f c denotes the carrier frequency. The aim of AMC is to use the received signal y(t) to maximize the value of P(x(t) ∈ N i |y(t)), where N i is the i th category of all the modulation types.
The standard RadioML 2016.10a dataset [22] consists of 11 modulations: 8 digital and 3 analog modulations, with 4 samples/symbol and a sample length of 128 samples. All are widely used in wireless communications systems globally. Radio channel effects are relatively well-characterized. Realistic non-ideal effects, such as thermal noise, oscillator drift, symbol timing offset, sample rate offset, carrier frequency offset, and phase difference, are reflected in the data. These parameters are shown in Table 1.

2) VARIABLE LENGTH SEQUENCES
To explore the model's ability to handle the modulation of different parameters with variable lengths, we used the RML2018.01a, which was first used in [34] to obtain variable length signals. In this dataset, each sequence had 1024 samples, with SNR ranging from −20 dB to 30 dB. We took the first 11 modulation types in the dataset. Then, we randomly selected the sequences to split them into lengths of 128, 256, and 512 with SNR ranging from −20 dB to 20 dB. The number of samples was 99,000 in both the training and testing sequences.

IV. RELATED MODELS NOTATION
Boldface letter are used for vectors and matrices;1 denotes vectors of ones. σ denotes the sigmoid activation function, tanh denotes the hyperbolic function, and represents element-wise multiplication.

A. LONG SHORT-TERM MEMORY
LSTM is one of the most popular models of RNNs. It was first proposed by Hocreiter and Schmidhuber in 1997 [10]. The case idea of LSTM is to protect the integrity of messages with the control of writing memories. LSTM uses three mechanisms to achieve this: write control, read control, and forget control. Write control uses some units to cancel out some useless information, read control cancels out the irrelevant information, and forget control selectively forgets the least relevant old information. The mechanisms are realized by three gates shown in Fig. 2, and formulated as follows: Candidate values: Drop the useless information and add new information: LSTM first gives the candidate writeC t , then uses the forget gate and input gate to update the state. Finally, it uses the output gate to provide the output of the model. However, LSTM still has problems, such as write conflicts and read conflicts [10]. This hinders the model from keeping the memory for long time steps.  [12]. GRU explicitly links the state, coordinates the writes, and forgets, as presented in Fig. 3. The formulation is as follows: Instead of performing selective writes and selective forgets, GRU foregoes some expressiveness and selectively overwrites by setting the forget gate equal to 1 minus the write gate. The update gate z t is the same as the forget gate from the prototype LSTM, f t and the input gate is calculated by 1 − z t . This works because it turns s t into an element-wise weighted average of s t−1 ands t , which is bounded if both s t−1 ands t are bounded. GRU is an alternative to the LSTM, but GRU outperformed LSTM on nearly all tasks except language modeling with the naive initialization [35] C. RECURRENT HIGHWAY NETWORK Recurrent highway network (RHN) was first proposed in 2017 by Zilly et al [14]. Many Sequential processing tasks require complex nonlinear transitions from one step to the next, and recurrent neural networks with deep transition functions remain difficult to train, even when using LSTM networks. RHN extends the LSTM architecture to allow stepto-step transition depths larger than one. It can use generic RNN structures such as UGRNN, LSTM, and GRU for the networks. For example, as mentioned above, LSTM receives c t−1 and h t−1 from its last time step to compute the current c t−1 and h t−1 and hands over them to the next time. However, as shown in Fig. 11, when we use a LSTM kernel for an L-layer RHN model, c 1 t and h 1 t are initialed to zero for the first layer and x t is the input of the first layer. For the second layer, the c 2 t−1 and h 2 t−1 are taken from the last layer (c 1 t and h 1 t ) at the same current t. Then, c 2 t and h 2 t are calculated and delivered to the VOLUME 8, 2020 next layer. At the last layer, c L t and h L t are passed to the next time. The following is the detailed formulation (LSTM used as an example): where denotes the current layer. In a general way,

V. HIERARCHICAL RECURRENT NEURAL NETWORKS WITH GROUPED AUXILIARY MEMORY
In this section, we propose an HRNN with grouped auxiliary memory named GAM-HRNN. The main body of the model was built with a hierarchical structure using the generic kernel mentioned above. Shortcut reading blocks used the output of the last layer as the key to read from the auxiliary memory to obtain information as the input to the next layer. The details are presented in Fig. 5. Group auxiliary memory (GAM) denotes the auxiliary module. K represents generic RNN structures, such as LSTM, GRU, and UGRNN. Input x, previous states, and auxiliary memory were first written into GAM. The memory module m t was partitioned equally into N groups, which is favorable for dealing with long short-term information, and each of these groups is a length vector S (The structure of GAM is shown in Fig. 6.): We denote the softmax over groups activation as ζ S×N : Update m t : Here, s 1 t−1 denotes the state of the first layer, the m t serves as a 'candidate' state for the calculation, G w (·) is an attentional mechanism implemented using the softmax over groups activation ζ , which can be defined as follows: in a t is between 0 and 1, and A is an affine transformation. h w t is the vector for writing, and will be reused by R 1 through the shortcut connectivity. The duplicating matrix U S×N ∈ R N m ×N in (12a) is given by Then, we updated s t . This can comprise generic layers of RNN structures, such as LSTM, GRU, and UGRNN. (16a) Here, [·] denotes the concatenation of the elements in it, the V S×N is defined as the transpose of U S×N given in (15a), K denotes generic RNN kernels. The reading blocks usesh t as the key to get useful information from GAM. In (16c), when > 1, previous state s l−1 t−1 can be an alternative key for the reading blocks.
In the networks, information is sparsely written in the GAM based on the group mechanism and delivered to the next time step directly. This means that only a very limited portion of states in the GAM is overwritten in each time step. Thus, the information can be efficiently maintained. Subsequently, for each layer, the required information is also read from GAM sparsely using the new state of the current layer as the key. Finally, the required information is sent to the next layer, along with the output of the previous layer. In the backpropagation view, the networks offer a shortcut for the error signals that back-propagate in both temporal and spatial dimensions.

VI. EXPERIMENTS AND DISCUSSION
In this section, we evaluate the performance of the proposed model on both RadioML2016.10a dataset and the variable length dataset. As shown in Table 1, we equally divided the standard data set into training and testing sets with the data in IQ format. Using the IQ data, we derived the amplitude sequence and phase vector. The amplitude vectors were 213056 VOLUME 8, 2020  L2 normalized and the phase vectors, which were in radians, were normalized between -1 and 1. Finally, we obtained a R 2×128 matrix for a single signal sequence. A single group of amplitude and phase was given to the model for each timestep. Details are shown in Fig. 7.
For all the models, we used the Adam optimizer [36] with a minibatchsize of 400 vectors. The learning rate was set to 0.001. Weights were initialized via the Xavier uniform initializer [37], and the models were implemented using Tensorflow [38]. The standard backpropagation through time algorithm was used for the RNN's training. For all the models, dropout of 0.2 was used for each layer. For LSTM, we initialized the forget gate bias to 1.0. This implied that we encouraged the LSTM to write the information into memory at the start. For the GAM-RNN model, a 2 × 20 size was set to GAM, while the size of vector for writing(h w t in (12c)) was 20. We also added a dropout of 0.3 for the auxiliary module. GAM-HRNN used 128 units for each layer. And different number of units for each layer were used for different models to keep the parameters of all models are roughly the same. For independently RNN(IndRNN), the recurrent weight is constrained in the range of |u n | ∈ (0, T √ 2) [29], where T represents the time steps of the sequence. Furthermore, we denoted the GAM-HRNN model with the kernel of GRU as GAM-HRNN-GRU, and this notation was used for all other models with kernels. The initialized forget bias of 1.0 was not used for the GAM-HRNN-LSTM model. The classification accuracies of all SNRs are presented in Fig. 8 and Fig. 9. The number of model parameters with average accuracy for an SNR range from 0 dB to 20 dB are shown in Table 2. As we can see, 2-layer GAM-GRU exhibited the best accuracy for all SNRs compared to other VOLUME 8, 2020  models, and the 2-layer GRU and LSTM models obtained similar results on the dataset. All the models have insufficient capability for classifying the modulation types at low SNRs. At -18 dB (the lowest SNR), the accuracy is nearly 1 11 , which illustrates that the model returns an almost random category for this SNR (there were a total of 11 categories of various modulation types). The proposed 2-layer GAM-HRNN-GRU model achieved an average accuracy of 92.2 % in the SNR range from 0 dB to 20 dB, with fewer model parameters than the 2-layer LSTM model. Simultaneously, the single-layer GAM-HRNN-GRU obtained 91.6% average accuracy with fewer parameters than the LSTM, which is a positive aspect when considering practical applications. Meanwhile, the accuracy of single-layer GAM-HRNN-GRU is about 8% better than that of the single-layer LSTM. For RHN models, the two-layer RHN-GRU model achieved accuracy that was slightly lower than that of the two-layer GRU. The two-layer RHN-LSTM achieved the worst accuracy, which indicates that the recurrent highway structure was not suitable for this work. The results showed that the hierarchical structures were more efficient than the RHN structures for this problem. Accuracy of 2-layer GAM-HRNN-GRU was higher than that of 2-layer GAM-RHN-GRU with about the same parameters and GAM-HRNN-GRU was more stable. . In summary, simple RNN models, such as LSTM or GRU, could not achieve excellent accuracy, which was probably due to long-term memory problems. The GAM-HRNN model provided the best results, as GAM could efficiently utilize the temporal features of the sequence and then latch information for a long time period. Furthermore, the hierarchical structure with reading shortcuts was more efficient.
We also present confusion matrices at three SNRs to further explain the performance of the proposed model. The results showed that at -8 SNR (Fig. 10.), the model can gradually return the correct category, but the accuracy was still low for practical application. The model shows excellent performance when the SNR is greater than 0 dB (Fig. 11.); nonetheless, the model could not separate AM-DSB and WBFM signals very well, even at high SNRs (Fig. 12.). This is mainly due to the small observation window (0.64 ms of modulated speech per example) and low information rate with frequent silence between words [39]. Distinguishing between QAM16 and QAM64 also suffered from short-time observations over only a few symbols. However, once the constellations were of higher order and shared common points [39], the proposed model performed well. At the three SNRs, compared with QAM16 and QAM64, we could find that the order of the modulation type is an influential factor for the accuracy of the classification. This was true for all investigated methods.

B. GRADIENTS ANALYSIS
To further explain the memory mechanism in GAM-HRNN, we analyzed the gradient changing curves in the training process. The key to being able to learn long-term dependencies is in the control ∂L ∂s t , which implies keeping the gradients within an appropriate range. In RNNs, the gradient suffers from the vanishing problem in both temporal and spatial dimensions. To further explain the experimental results, we compared ∂L ∂s t  for 1-layer GAM-HRNN-LSTMs and 1-layer LSTM in the training process. The forget biases of LSTM were initialized to 0 for both models.
Curves are presented in Fig. 13. At the beginning of the training, gradients norms of GAM-HRNN-LSTM were at an acceptable level. After 300 updates, GAM-HRNN-LSTM still maintained an appropriate value of gradients along the timesteps, but LSTM suffered from a gradient vanishing problem. Hence, the GAM module can adaptively maintain the gradients at a suitable level for the RNN model. This was good for the training process and the model would eventually obtain a satisfactory result.

C. MODIFIED DATASET
We further evaluated the model on the modified variable length dataset. For AMC modulation classification, the ability  of the model to deal with variable length signals with different parameters was also important. In this subsection, we evaluate 2-layer GAM-HRNN-GRU and 2-layer LSTM on the variable length dataset. The details of the dataset are presented above. The configurations of the models were the same as that in subsection A. The models were trained for SNRs ranging from -20 dB to 20 dB, and input sample lengths varied from 128 to 512 samples. The results are presented in Fig. 14.
At low SNRs, both models had poor capability to return the correct criterion for the dataset. With the increases in SNRs, the proposed model performed better than LSTM at all lengths. The accuracy increased as the data obtained by the model accumulated because the model could learn temporal dependencies from the information. Moreover, this requires the ability of the model to retain memory for a long time, which is critical for this problem.

VII. CONCLUSION AND FUTURE WORK
In this paper, we proposed a recurrent structure named GAM-HRNN for AMC problem. Subsequently, we evaluated the GAM-HRNN model on the standard and variable length datasets. Experiments verified that the proposed model exhibited excellent performance on this problem. And our 1-layer model can also give competitive result and beat other models of 2 layers with much fewer parameters. The model had sufficient ability to handle variable length signal inputs with different parameters. We also emphasized on the importance of maintaining the gradient within an appropriate range in the training process for obtaining good results.
There are some limitations to the proposed model. The proposed model has insufficient capability to deal with inputs at low SNRs. In addition, the parameters can be further reduced by modifying the structure of the model or using some pruning methods.