DeepSIG: A Hybrid Heterogeneous Deep Learning Framework for Radio Signal Classification

Deep learning has been widely used in automatic modulation classification (AMC) recently. Most of deep learning-based AMC uses a single network model to deal with radio signals with a single input format. In this paper, we propose a hybrid heterogeneous modulation classification architecture named DeepSIG, which integrates Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) and Graph Neural Network (GNN) models in a single framework to process radio signals with heterogeneous input formats, i.e., in-phase (I) and quadrature (Q) sequences, images mapped from IQ signals and graphs converted from IQ signals, to extract and integrate the features from different perspectives. A fusion training mechanism is presented to train DeepSIG. We use three different radio signal datasets for simulations. Results show that our proposed DeepSIG performs the best in terms of classification accuracy compared with the three methods with single input, i.e., sequence, image or graph. The performance gain is larger in few-shot scenarios.


DeepSIG: A Hybrid Heterogeneous Deep Learning
Framework for Radio Signal Classification Kunfeng Qiu , Shilian Zheng , Luxin Zhang , Caiyi Lou, and Xiaoniu Yang Abstract-Deep learning has been widely used in automatic modulation classification (AMC) recently.Most of deep learning-based AMC uses a single network model to deal with radio signals with a single input format.In this paper, we propose a hybrid heterogeneous modulation classification architecture named DeepSIG, which integrates Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) and Graph Neural Network (GNN) models in a single framework to process radio signals with heterogeneous input formats, i.e., in-phase (I) and quadrature (Q) sequences, images mapped from IQ signals and graphs converted from IQ signals, to extract and integrate the features from different perspectives.A fusion training mechanism is presented to train DeepSIG.We use three different radio signal datasets for simulations.Results show that our proposed DeepSIG performs the best in terms of classification accuracy compared with the three methods with single input, i.e., sequence, image or graph.The performance gain is larger in few-shot scenarios.

I. INTRODUCTION
W ITH the rapid development of Internet of Things (IoT), 5G technology and other related technologies in the field of wireless communications, the electromagnetic environment has become increasingly complex, which makes the problem of lack of spectrum resources more serious [1], [2].Therefore, improving spectrum utilization becomes more and more significant.Cognitive radio technology is an effective method to improve spectrum efficiency, which can dynamically manage the use of spectrum resources.Modulation classification facilitates spectrum sensing which is an indispensable technology in cognitive radio [3], [4].
The modulation recognition of radio signals can be divided into two parts: feature extraction and classification.In the traditional modulation classification methods, the features are extracted according to manually designed procedures [5], [6], [7].Researchers need to specially design corresponding algorithms to calculate the features, and then input the feature vectors into a classifier such as Support Vector Machine Manuscript received 26 November 2022; revised 13 April 2023; accepted 22 May 2023.Date of publication 8 June 2023; date of current version 9 January 2024.This work was supported in part by the National Natural Science Foundation of China under Grant U20B2038.The associate editor coordinating the review of this article and approving it for publication was J. Xu.(Corresponding authors: Shilian Zheng; Xiaoniu Yang.) The authors are with the Science and Technology on Communication Information Security Control Laboratory, Jiaxing 314033, China (e-mail: yexijoe@163.com;lianshizheng@126.com;lxzhangMr@126.com;loucycetc@163.com;yxn2117@126.com).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TWC.2023.3281896.
Digital Object Identifier 10.1109/TWC.2023.3281896(SVM) [8] to realize modulation classification.These manual feature extraction methods generally need the support of professional knowledge and the final classification accuracy may not meet the required value in complex environment [9], [10].With the rapid development of artificial intelligence, there are many Automatic Modulation Classification (AMC) methods based on deep learning [11], [12], [13].These methods automatically learn and extract features and classify through deep neural network layers, which can realize end-to-end modulation classification [14], [15], [16], [17].
According to the input format, deep learning-based AMC methods can be roughly divided into three types: AMC based on sequence input, image input and graph input.Among them, AMC based on sequence input is relatively common.The input of this method is generally the in-phase (I) and quadrature (Q) components of the received signal, and the neural network used for classification is generally a Recurrent Neural Network (RNN) model, which can obtain the time-dependent information of the IQ signal.In addition, there are also some specially designed Convolutional Neural Network (CNN) models to classify the sequence input IQ signals.As for the AMC methods based on image input, the IQ signals are mapped into the corresponding images according to certain rules or trainable neural network layers, and then the CNN models with excellent performance in the field of image classification can be directly used for modulation classification.This type of classification methods use images to represent the original radio signals, and then introduce the relevant technology in the image field to extract the potential image representation features in the original signals.However, in the process of mapping radio signals into images, if the corresponding conversion method is designed unreasonably, some information hidden in the original signals may be lost, which may affect the accuracy of modulation recognition.Similarly, the AMC methods based on graph input also convert the radio signals into other types of data for classification.Specifically, this type of classification methods map the signals into graphs with the help of the existing graph mapping algorithms, and then classify the graphs to realize radio modulation classification.In the current complex network field, there are some types of methods to classify graph data.For example, graph embedding methods [18], [19] can extract the feature vectors corresponding to the graphs, and then the classifier in machine learning can be used to classify them.We can also directly take the graph data as the input and use the graph neural network [20] models to realize end-to-end classification.Graph is a kind of data structure with topological relationship composed of nodes and edges connecting nodes.By using graphs to represent the original radio signals for classification, the signal sampling values at different time points can be correlated through the edge relationship, and the potential graph representation features can be extracted for subsequent classification.However, at present, the conversion speed of graph mapping algorithms based on fixed rules is slow, and the existing trainable mapping algorithms need to adjust hyperparameter to control the graph structure, so they all have certain disadvantages in processing radio signals with long time-of-duration.
Generally speaking, the current AMC methods based on deep learning are basically based on a single input format, i.e., sequence, image or graph, which are superior to the traditional modulation classification methods in terms of accuracy.Compared with the classification methods based on the original IQ signals, although the methods based on images and graphs may lose or destroy some implicit temporal information in the original signal data, they will also generate some potential information from other perspectives, and the potential hidden features from these perspectives can be extracted in the subsequent processing.In a word, the AMC methods based on different input formats can generally extract the features of the original signals from different perspectives.For better classification performance, we can use the input of these three formats at the same time.In order to make full use of the potential information contained in these forms of data, in this paper, we propose a hybrid heterogeneous framework, namely DeepSIG (S for Sequence, I for Image and G for Graph), which can be divided into four parts: sequence network, image network, graph network and fusion classification network.The sequence network is the modulation recognition method based on IQ signal input.In this paper, the Long Short Term Memory (LSTM) [21] network is used to process the input IQ signals to extract the time-related feature information.Image network is a classification method based on image input, which includes image mapping algorithm and CNN model.In this paper, in order to retain the implicit information in the original signal data as much as possible and extract the potential information on different scales, we propose a new image mapping method to map the IQ signals into dual channel narrow edge images by means of Decomposition and Reconstruction of Discrete Wavelet Transform (DWT) (DWTDR) [22].After obtaining the images according to the IQ signals, we use ResNet18 [23] model to extract the potential image representation feature information.The graph network is the classification method based on graph input, including our proposed graph mapping algorithm called Limited Fixed Graph Mapping (LFGM) and a GNN model.In order to reduce the computational complexity required for mapping IQ signals into graphs, in LFGM, we set the structure of graphs representing different signals to be the same, and in order to conserve the information of IQ signals as much as possible, we take the oversampling ratio of the signal as the limiting condition when determining the edges of the mapped graphs.After obtaining the graphs, we use DiffPool [24] to extract the potential graph representation features.In particular, the networks of above three parts do not contain the final classification layers, but extract the features of the corresponding input format.The last part is the fusion classification layer, which is based on the feature vectors obtained from the previous three networks.After the same normalization and splicing fusion processing, the fusion classification layer uses the fully connected layer for modulation classification.
In conclusion, the main contributions of this paper are as follows.• We conduct simulations on three datasets and their corresponding few-shot scenarios to verify the performance of the proposed DeepSIG.Compared with the classification methods based on single sequence, image or graph input, our proposed DeepSIG which integrates the three performs the best.The rest of this paper is organized as follows.In Sec.II, we introduce the related work of modulation classification methods based on sequence, image and graph.In Sec.III, we introduce the architecture of DeepSIG in detail.In Sec.IV, we provide the simulation results and finally in Sec.V, we make a summary of this paper.

A. AMC Methods With Sequence Input
In the current field of radio modulation classification, AMC methods based on sequence input, i.e.IQ signals, are the most common.O'Shea et al. [25] used a narrow 2D-CNN model to process IQ signals and realize modulation classification.In addition, O'Shea et al. [26] also designed a 1D-CNN model according to the VGG [27] architecture, and proposed a modified model based on ResNet [23] which achieved good modulation classification performance.Liu et al. [28] proposed a seven staged CNN model by tuning the CNN architecture proposed in [29], and they designed two models for radio modulation classification based on ResNet and DenseNet [30] respectively.What's more, they proposed a classification model called Convolutional Long Short-term Deep Neural Network (CLDNN) which achieved higher classification accuracy.Jariwala and Captain [31] proposed a deep learning model based on CNN architecture for AMC which has a regularization effect during training and can reduce over-fitting by making use of a Gaussian noise [32] layer after convolution layers.Hong et al. [33] proposed a structure based on two-layer Gated Recurrent Unit (GRU) network which was able to make full use of the temporal sequence characteristic of radio signals.Rajendran et al. [21] designed a data-driven model for AMC based on LSTM.In general, these methods need to manually design special classification network models for the signal data.

B. AMC Methods With Image Input
In addition to AMC methods with raw IQ sequence input, there are also some AMC methods based on image input.The method of converting radio signals into images is one of the key steps, which can be divided into two categories: image mapping methods based on fixed rules and image mapping methods based on trainable neural network layers.At present, the former mapping methods are common.For example, Peng et al. [34] converted the raw IQ signals into images with grid-like topology by means of constellation diagram and used the slightly modified AlexNet [35] model for modulation classification.What's more, Peng et al. [36] proposed three other methods to map signals into images based on constellation diagram, namely Gray Image, Enhanced Gray Image and Three-Channel Image, then they used modified AlexNet and GoogLeNet [37] models to implement AMC.Teng et al. [38] proposed a new polar-based deep learning architecture with channel compensation network which can map the complex symbols of radio signals to images in polar coordinate through their polar feature transformation method.Guo et al. [39] introduced a spatial transformation to map the signals into pixelated sample distribution density images, and they used a custom CNN model with eight layers to realize AMC.Wang and Oates [40] proposed two methods called Gramian Angular Fields (GAF) and Markov Transition Fields (MTF) to encode time series as different types of images, which can be used on radio signals.Liang et al. [41] used short-time Fourier transform (STFT) [42] to convert the received signals into time-frequency spectrograms, and then used the modified ResNeXt [43] with two dual attention mechanism modules and customized classification module to perform AMC.In addition, there are also methods to map signals into images through trainable network layers.Chen et al. [44] proposed a modulation classification framework SigNet, in which the proposed trainable signal-to-matrix (S2M) operator can map signals into images.Chen et al. [45] proposed the Convolution Mapping Method (CMM) composed of a convolution layer and a maximum pooling layer to convert radio signals into images.Generally speaking, the current image mapping methods based on fixed rules are almost easy to lose some information of the original radio signal in the conversion process.The image mapping methods based on trainable neural network layers need to be designed artificially to combine with specific network models.

C. AMC Methods With Graph Input
In addition, there are also AMC methods based on graph input.Among them, the method of mapping signals into graphs is important and indispensable.Lacasa et al. [46] introduced a graph mapping algorithm called Visibility Graph (VG) which can transform time series into graphs.Based on VG, Luque et al. [47] proposed Horizontal Visibility Graph (HVG) algorithm which just needed fewer statistics and simpler limited rules in the process of transforming time series into graphs.Ning-De and Bin [48] proposed a graph mapping method called Limited Penetrable Visibility Graph (LPVG), which performed well in the classification task of oil-gaswater three-phase flow.All the above graph mapping methods can be applied to radio signals, and according to them, Xuan et al. [49] proposed a new graph mapping method called Circular Limited Penetrable Visibility Graph (CLPVG), which can achieve good performance in modulation classification.Xuan et al. [50] also proposed an end-to-end automatic modulation classification framework AvgNet, in which the proposed Adaptive Visibility Graph (AVG) algorithm can adaptively map the radio signals into graphs.It is undeniable that the existing graph mapping methods have a common problem that they can not be well applied to long signals.

A. Problem Formulation
In the field of wireless communications, modulation classification aims to infer the modulation scheme of the transmitted baseband signal s(n) through the analysis of the received signal r(n).The relationship between the received signal r(n) and the transmitted signal s(n) is as follows: where g(n) represents the filter response of pulse shaping, h(n) is the response of wireless channel, ∆f is the carrier frequency deviation caused by Doppler shift or synchronization of the clock between the receiver and the transmitter, θ is the phase deviation, w(n) is additive white Gaussian noise (AWGN), N is the signal length, and * denotes the convolution operation.
In the modulation classification problem, the modulation scheme of the received signal can be classified into one of the elements in the set where C is the number of modulation schemes.Therefore, the modulation classification problem can be regarded as a classification problem with C categories.In order to facilitate subsequent operations, the received baseband signal r(n) in complex form is generally given in the form of IQ components as shown below: where real(•) and imag(•) represent the operation of extracting the real and imaginary parts of a complex number, respectively, I(n) and Q(n) are IQ components of signal r(n).

B. Overall Framework
Our proposed DeepSIG framework can realize end-to-end radio signal modulation recognition which is mainly composed of three different types of neural network architectures: RNN, CNN and GNN.The overall framework is shown in Fig. 1.
Different from the existing AMC models, DeepSIG carries out follow-up work on three forms of input: sequence, image and graph.The sequence input is IQ signal, the image input is obtained by decomposing and reconstructing the IQ signal by DWT [22], and the graph data is obtained by transforming IQ signal through our proposed graph mapping method.LSTM [21], ResNet18 [23] and DiffPool [24] are used in DeepSIG to deal with the three input.There is also a fully connected layer which serves as classification layer.We do some processing in the training process of DeepSIG, which can be divided into two steps.At first, with the IQ signal sequences, image data obtained by DWT and graph data obtained by LFGM, we train LSTM, ResNet18 and DiffPool, respectively.According to the three models saved after the preliminary training, three feature vectors F S , F I and F G of each radio signal can be extracted, which are the corresponding input vector of the last fully connected layer in LSTM, ResNet18 and DiffPool models.After the same normalization process is performed on F S , F I and F G , the fused feature vector F H can be obtained by splicing process, and it is used as the input vector to train the newly added fully connected layer to obtain the final classification result.Especially, in the fusion training process, the newly added fully connected layer is trained with a relatively large learning rate, and the other neural network layers are fine-tuned with a small learning rate.When performing modulation classification inference after completing the model training, DeepSIG includes the steps of mapping signals into images and graphs, thus we can complete end-to-end inference by inputting IQ signals.Our proposed hybrid heterogeneous deep learning framework can make full use of different types of neural network models and extract feature vectors from input data in different forms, so as to fuse heterogeneous features as effectively as possible to obtain superior classification performance.

C. Sequence Network
In the sequence network part of DeepSIG, the input data is IQ signal which belongs to a kind of time series data and the network model used is LSTM.At present, RNNs are commonly used when dealing with time series by means of deep learning.Among them, LSTM is more to prevail which can effectively solve the long-term dependence problem of RNN.In general, LSTM mainly controls the flow and loss of feature information extracted from time series through the three gate mechanisms of forget gate, memory gate and output gate, so as to solve the gradient disappearance and gradient explosion problems of ordinary RNN in the process of training long time series.
The structure of LSTM cell is shown in Fig. 2. At first, the forget gate will choose to retain and discard the corresponding content after receiving the output V t−1 from the previous LSTM cell.Specifically, the choice of information is determined by multiplying a forgetting factor f t and the last unit state output V t−1 , where f t is calculated from the short-term memory g t−1 and the input data i t at time t as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where W f and b f represent the trainable weights and biases of the current neural network layer, respectively.And then, memory gate determines new feature information and stores it in the cell state as follows: where W u , b u , W Y and b Y represent the weights and biases of the corresponding neural network layer.Y t is the cell state update value, u t is used to control which features of Y t are used to update V t−1 .After the updated cell state V t is obtained, the output gate can be used to determine the output value g t based on V t .The output gate uses a Sigmiod function to determine which part of the cell state V t needs to be output, then processes V t through the Tanh layer and finally obtains the desired output information through a multiplication operation: where W o and b o represent respectively the weights and biases of the current neural network layer.We use two LSTM layers and one fully connected layer to classify IQ signal data normalized to [−1, 1] in the sequence network.The normalization process is shown in Algorithm 1.The first LSTM layer consists of 128 LSTM cells and the second has 4 cells.The final fully connected layer is the classification layer, which is used to map the output features of the second LSTM layer to the number of modulation categories of the dataset.

D. Image Network
In this paper, we propose an image mapping method to process IQ signal data through DWT to obtain dual-channel narrow edge image data and then use the ResNet18 model for modulation recognition tasks.
1) Algorithm: DWT is a time-frequency analysis method with the characteristics of multi-resolution analysis, which has higher frequency domain resolution and time resolution in the low frequency region and high frequency region, respectively.DWT has the ability to characterize the local characteristics of Algorithm 1 The Process of Normalizing IQ Signals Input: 5 Record the maximum absolute value in sequence IQ as M AX abs .
8 return the normalized IQ signal signal in both time and frequency domains.The hierarchical structure of decomposing signal with DWT is shown in Fig. 3.
The operations of the i-th order DWT decomposition can be represented as where g(k) and h(k) represent low pass filter and high pass filter respectively, and K is their length.Especially, K is related to the wavelet function used in DWT.Different wavelet functions have corresponding length K.For example, the length K of wavelet function Haar is 2. Wavelet function Daubechies has different orders with corresponding lengths, the length K of P -order Daubechies is 2P .Similarly, the wavelet function Coiflets also has some orders, with the corresponding length of 6P for P -order.For a sequence S(n) of length N , the corresponding approximation component S 1,L (n/2) and detail component S 1,H (n/2) can be obtained by using low-pass filter and high-pass filter respectively.These two series are both down-sampling of the signal S(n) with a scale of 2, where S 1,L (n/2) represents the low frequency information and S 1,H (n/2) represents the high frequency information of S(n).
For the approximation component S 1,L (n/2) obtained from the first decomposition, the same second DWT decomposition can be performed to obtain the second-order approximation component S 2,L (n/4) and detail component S 2,H (n/4).With this, the approximation component of length greater than 1 can be further decomposed.It is worth noting that each time the DWT decomposition is performed, the length of the two components obtained will be reduced by half, thus we can perform at most log 2 N orders DWT decomposition on sequences of length N .In practice, for many sequence data, including radio signal, the low frequency component is significant which often contains the characteristics of the signal, and the high frequency component shows the details in the signal.The process of DWT reconstruction based on an approximation component and its corresponding detail component can be expressed as Two components of the same length can be converted into a component of twice the length through a DWT reconstruction.Therefore, based on the log 2 N high frequency components obtained by log 2 N orders decomposition operations and the low frequency component obtained through the log 2 Norder decomposition, a reconstructed signal of length N can be obtained after log 2 N DWT reconstruction operations.We call this process as a signal reconstruction operation in the following.It is worth noting that unlike the traditional complete signal reconstruction mentioned above, the signal reconstruction operation in our proposed DWTDR includes certain preprocessing operations on components.
In this paper, the process of mapping IQ signal to image is shown in Algorithm 2, and the reconstruction process of channel I is shown in the Fig. 4. For the I-channel data I with length N , we perform log 2 N times of DWT decomposition as shown in Fig. 3 to obtain log 2 N high frequency components and a low frequency component.Based on these high and low frequency components, we perform a total of (1 + log 2 N ) signal reconstruction operations.Specifically, at the j-th signal reconstruction operation, we keep the j-th (1 ≤ j ≤ 1 + log 2 N ) component unchanged and set the other components Algorithm 2 The Process of Our Proposed Image Mapping Method DWTDR Input: .By processing the Q-channel data in the same way, we can get the corresponding matrix M Q with shape N ×(1+log 2 N ).The mapped dual channel narrow edge image M IQ can be obtained by concatenating M I and M Q .
After getting the narrow edge image M IQ corresponding to each IQ signal, we use the ResNet18 network for modulation classification.Theoretically, the deeper the network layer of neural network model, the more information can be obtained and the richer the features.However, in practice, when the number of network layers exceeds a certain level, continuing to deepen the network will cause the problem of gradient explosion or gradient disappearance, which will adversely affect the optimization of the network and cause the problem of poor performance.In response to this problem, ResNet is proposed whose core is the designed residual block.Fig. 5 shows the structure of residual block.The input feature x can be propagated forward through the data line across the network layers.Stacking multiple residual blocks can form a deep network without the problem of gradient explosion and disappearance.
2) Complexity Analysis: We now analyze the complexity of DWTDR here.We take Haar wavelet as an example, the filter length of which is K = 2.The entire DWTDR Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.process can be divided into discrete wavelet decomposition and reconstruction.At first, when performing a Haar discrete wavelet decomposition on a channel of an IQ signal with length N , a total of N/2 addition and N/2 multiplication operations are required.Therefore, the time complexity of Haar discrete wavelet decomposition on a channel data of length N is O(N ).For IQ signal with 2 channels, the time complexity is O(2N ).In the DWTDR process, a total of log 2 N decompositions are required, and the sequence length for each decomposition is half the length of the previous decomposition.Therefore, the time complexity of the entire decomposition process of an IQ signal in DWTDR is Secondly, in the reconstruction process of DWTDR, taking processing a channel sequence of an IQ signal with length N as an example.Based on the one low-frequency component and N high-frequency components obtained from the above decomposition, it is necessary to recursively perform log 2 N DWT reconstruction operations to obtain a sequence with length N , that is, one signal reconstruction.The time complexity of each DWT reconstruction is the same as that of the corresponding DWT decomposition, which is O(the length of the reconstructed sequence).Therefore, the time complexity of performing a signal reconstruction operation on a channel sequence is For an IQ signal, there are two channel sequences, and each channel sequence requires (1 + log 2 N ) signal reconstruction operations throughout the entire DWTDR reconstruction process.Therefore, the time complexity of DWTDR reconstruction for an IQ signal is Finally, we can know that the time complexity of DWTDR is Similarly, the space complexity of DWTDR can also be divided into two parts: decomposition and reconstruction.Taking a channel sequence of an IQ signal as an example, during the decomposition process, (1 + log 2 N ) array spaces are needed to store one low-frequency component and N highfrequency components obtained after decomposition.Therefore, the space complexity of an IQ signal in the DWTDR decomposition process is During the reconstruction process, we need an array space to store the component obtained from the current DWT reconstruction, a single channel signal reconstruction includes log 2 N DWT reconstruction, and a DWTDR reconstruction includes (1 + log 2 N ) signal reconstruction.Therefore, the space complexity of an IQ signal during the DWTDR reconstruction process is Finally, we can get the space complexity of DWTDR

E. Graph Network
In addition to mapping IQ signals into images, there are also methods for characterizing IQ signals with graphs.However, these graph mapping methods are time-consuming, especially when dealing with long signals, so we propose a new and simple method called LFGM to map IQ signals into graphs.Through LFGM, we represent different IQ signals with the graphs of the same structure.The difference is the feature vectors of the nodes in different graphs.The process of mapping IQ signal to graph is shown in Algorithm 3. Specifically, for a radio signal sample S IQ of length N , we convert it into the corresponding graph G = ⟨V N , E⟩, where V N and E represent the node set and edge set of graph G.In graph G, the N nodes correspond to the time sampling points of the corresponding IQ signal, and the attribute value of the node is a vector composed of two values of the IQ channel corresponding to the sampling point.The edge of graph G is determined by the distance of time sampling points.Time sampling points with an interval of less than k can form edges, where k is a hyperparameter that can be used to control the sparseness of the connection relationship of graph G.The larger the k, the denser the generated graph.In order to better preserve the implicit information of the original IQ signal, we set k to be the oversampling ratio of the radio signal.
After mapping each IQ signal into a corresponding graph, the existing GNN for graph classification can be directly used for radio signal modulation classification.In general, graph classification with GNN is to learn the embedding representation vector of each node according to the topological connection relationship, and then aggregate the embedding vectors of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Let the i-th node and the j-th node form an undirected edge.
7 Let the all undirected edges form the edge set E of graph G. 8 Return the mapped graph G = ⟨V N , E⟩. nodes to obtain the representation of the entire graph.In this paper, we choose DiffPool to classify the mapped graphs which is actually a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end manner.The architecture diagram of DiffPool is shown in Fig. 6, which learns differentiable cluster assignments for nodes in each layer, maps nodes to a set of clusters, and then forms the roughened input for the next GNN layer.
In addition, we analyze the time complexity and space complexity of LFGM.In the process of mapping an IQ signal of length N into a graph, it is necessary to assign an attribute vector with length 2 to each node of the graph, so the time complexity of LFGM is O(2N ).In general, the edge connection of a graph is represented by a matrix with shape N × N .In order to save storage space, we can use S E binary groups to represent the edge set, where S E represents the number of edges in the graph.According to Algorithm 3, we can know that S E is thus the space complexity of LFGM is O(2S E ).

F. Fusion Training
After training the above three AMC models with corresponding input formats separately, we perform fusion training based on the three saved trained models.Through the three trained models, the feature vectors F S , F I and F G of the classification models whose input formats are sequences, images and graphs can be obtained respectively: where Γ s , Γ i and Γ g represent the trained ResNet18, LSTM and DiffPool models without the last classification layer respectively.Afterwards, the same normalization process is performed on the three obtained feature vectors F S , F I and F G and the three normalized vectors are spliced to obtain the fused feature vector F H as follows: where N (•) represents the normalization operation as shown below: where v is the one-dimensional feature vector to be processed, l is the length of vector v, v i represents the i-th element of v, ϵ is a small value used to avoid division by zero, ∥v∥ p means L-p norm operation on v, p represents the exponent value in the norm formulation, and |•| is the absolute value operation.Subsequently, a fully connected layer for classification is added, and the fused feature vector F H after stitching is used as input of the new layer: where L F C represents the added fully connected layer, and O is the predicted label.Finally, we remove the corresponding last fully connected layer of the first three networks based on sequence, image and graph and integrate the neural networks and the new added fully connected layer for fusion training.
It is worth noting that in the process of fusion training, the newly fully connected layer needs to be trained with a larger learning rate, and the other neural network layers Γ s , Γ i and Γ g need to be fine-tuned with a smaller learning rate.

A. Datasets
We demonstrate the performance of our proposed DeepSIG using three radio signal datasets RML2016.10a[25], SigData-18 and SigData36, the first of which is a public dataset and the other two are generated by ourselves.Their basic information is shown in Table I.
Among them, RML2016.10a has 11 modulation schemes.The signal-to-noise ratio (SNR) range for each modulation is −20 dB to 18 dB in 2 dB intervals.The length of each signal is 128 and the oversampling ratio is 8.This dataset has a total of 220,000 samples.In the simulation, we take 800 samples Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.from each scheme at each SNR as the training samples and the rest as the test samples.
SigData-18 contains 18 modulation schemes.The signal length is 512 and the oversampling ratio is 8. Raised cosine filtering with roll-off factor randomly chosen from the range [0.2, 0.7] is used for pulse shaping.The SNR range is −20 dB to 30 dB, and the interval is also 2 dB.We randomly pick 800 signals from this dataset under each SNR for each modulation scheme as the training set, and merge the remaining samples into the test set.

B. Simulation Setting
All simulations are run on NVIDIA Tesla T4.In the training process, the deep learning framework we use is PyTorch [51].
In particular, we use PyTorch Geometric [52] to build the GNN model DiffPool.The optimizer and loss function is Adam [53] and cross-entropy respectively.When training the models, we use the fixed step learning rate decay strategy, in which the initial learning rate is 0.001, and the learning rate is adjusted every 5 epochs to decay to 80% of the previous value.In addition, when performing fusion training on DeepSIG, the initial learning rate of other network layers except the final classification layer are set to 0.00001.When training on datasets RML2016.10a,SigData-18 and SigData-36, the mini-batch size is set to 128, 128 and 32 respectively.The wavelet function we used in the process of mapping IQ signals into images through DWTDR is Daubechies (db1), namely Haar [54].When training DeepSIG on dataset RML2016.10a of 0.5%, 1% and complete training set, the p mentioned in Eq. 28 is set to 50, 1 and 1 respectively, and the exponent value p is set to 2 in other cases.

C. Performance of DWTDR and LFGM
In order to verify the performance of our proposed image mapping method DWTDR and graph mapping method LFGM, both of which are based on fixed transformation rules, we use GAF [40] mentioned in Sec.II-B and AVG [50] mentioned in Sec.II-C for comparison.GAF can convert IQ signals into images by a fixed mapping rule, and according to the simulation results in [45], it generally performs better than other image mapping methods based on fixed conversion rules when used in modulation classification tasks.Therefore, we choose GAF to compare with our proposed image mapping method DWTDR.As for the current graph mapping methods based on fixed conversion rules, such as VG [46], HVG [47] and LPVG [48], etc, they need a lot of time to process large datasets with long time-duration signals, resulting in poor feasibility in realization.Therefore, we choose the trainable mapping method AVG based on neural network layers to compare with our proposed LFGM.
For fairness, after mapping signals into images with GAF and DWTDR, ResNet18 model is used for classification.As for AVG and LFGM, they are both combined with DiffPool model for classification.The results are shown in Table II and Fig. 7.Among them, the "FLOPs" and "Parameters" in Table II represent the corresponding floating point operations in computation and number of parameters in storage when inferring an IQ signal sample, respectively.We can see that compared with GAF, the modulation classification method based on our proposed DWTDR has higher accuracy on the three datasets, especially on RML2016.10a and SigData-18.Because the size of the narrow edge matrix obtained by LFGM is much smaller than that of the matrix obtained by GAF, the overall computational and space complexity of LFGM based modulation recognition method are significantly smaller than those based on GAF.By comparing the results of AVG based on trainable mapping algorithm and LFGM based on fixed mapping rules, it can be seen that our proposed LFGM has better performance on the complex long signal dataset SigData-36 with an accuracy increase of about 4%, but there is no denying that the average classification accuracy on RML2016.10a and SigData-18 is about 0.4% and 1.4% lower.Compared with AVG, which can learn and adjust mapped graph structure through neural network layers, although LFGM's performance is slightly lower on the two datasets with relatively short signals, LFGM is simpler and more efficient which directly uses the graph of the same structure to represent different signals and does not need to spend redundant resources and time on the neural network layers used to learn the graph structure.The "FLOPs" in Table II demonstrates that the computational complexity of classification method based on LFGM is superior to that of classification method based on AVG.It is worth noting that

TABLE III CLASSIFICATION ACCURACY OF DIFFERENT METHODS
the "Parameters" of the AVG based classification method in Table II does not include the graph edge information, because as an adaptive mapping algorithm, AVG generally results in different edge quantities in the graph when processing different IQ signals.The "Parameters" of the LFGM based classification method is the number of parameters obtained by taking into account the edge information parameters and model parameters.In the simulations of the three datasets mentioned above, when processing the same radio signal, the number of edges in the mapped graph obtained based on AVG is generally more than that obtained based on LFGM.Therefore, in fact, the number of parameters of LFGM based classification methods is less than that of AVG based classification methods.In general, our proposed DWTDR and LFGM have good performance in terms of classification accuracy and realization feasibility.

D. Results on Complete Datasets
We first discuss the modulation classification performance of DeepSIG on the above three datasets.We use the three classification methods of sequence, image and graph mentioned in Sec.III-C, Sec.III-D and Sec.III-E respectively to compare with the DeepSIG which combines them.The modulation recognition accuracy on the test set are shown in Table III.We can see that among the three methods of modulation classification using data in a single input form, graph-based method, image-based method and sequence-based method perform best on dataset RML2016.10a,SigData-18 and SigData-36 respectively, while our proposed hybrid framework DeepSIG outperforms all three methods on all three radio signal datasets.Especially on dataset SigData-36 with the largest number of categories, DeepSIG can greatly improve the classification accuracy by nearly 2.5%.Furthermore, Fig. 8 shows the modulation classification accuracy on different SNRs.It is clearly that on dataset RML2016.10a,our proposed DeepSIG has slight advantages in low SNRs such as −12 dB and −8 dB, and its performance at other SNRs is basically the same as that of other methods.According to the detailed classification results on dataset SigData-18 as shown in Fig. 8(b), we can see that the performance of DeepSIG is significantly better than that of the other three separate classification networks, especially at high SNRs above 8 dB.It can be seen from Fig. 8(c) that the DeepSIG, which integrates three separate classification networks, can also improve the modulation classification accuracy of dataset SigData-36, especially in the SNR region between −12 dB and 2 dB.To a certain extent, it can be explained that DeepSIG can extract the temporal information by processing IQ sequence data through LSTM, and it is also able to mine the potential image representation feature and potential graph feature of the original signals by processing the mapped images and graphs.It can make full use of the features on these three different scales to improve the modulation classification accuracy.
To further analyze the recognition performance of these methods on different modulation types, we plot the confusion matrices when SNR = 10 dB as shown in Fig. 9.
As can be seen from Fig. 9(a) to Fig. 9(d), on dataset RML2016.10a with a relatively small number of modulation categories, the classification accuracy of these methods is similar, and the difference of confusion matrices is not obvious.All of the methods are easy to misidentify WBFM as AM-DSB.Fig. 9(e) to Fig. 9(h) are the confusion matrices on dataset SigData-18.It can be observed that compared with sequence network and graph network, image network and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.DeepSIG can better distinguish 8PAM and 16PAM.Image network is easy to classify OFDM-QPSK into OFDM-16QAM by mistake, in contrast, graph network and DeepSIG have better recognition accuracy of OFDM-QPSK.As for dataset SigData-36, we can see from Fig. 9(i) to Fig. 9(l) that compared with the other three separate classification networks, the hybrid heterogeneous DeepSIG has a certain improvement in the recognition ability of 64QAM, 128QAM and 256QAM.
Except for graph network, the other three classification networks have high classification accuracy for OFDM-BPSK.
In general, the modulation recognition network based on deep learning can essentially be divided into feature extraction and classification.The former can automatically learn and extract feature representation that can represent the input signal, while the latter performs classification processing based on the extracted feature vector.Generally speaking, the better the extracted feature representation, the more advantageous it is for classification.Our proposed hybrid heterogeneous framework DeepSIG includes three different types of neural networks, which can handle three different types of inputs representing the same signal sample and extract three feature vectors containing implicit information from the original signal data in different aspects.The fusion of these three feature vectors results in richer and more comprehensive feature representation, allowing for better classification performance by fully utilizing the abundant features.

E. Results in Few-Shot Scenarios
We now analyze the modulation classification performance of DeepSIG in few-shot scenarios.The simulations are also based on the above three datasets.The difference is that we only take part of the data in the training set to train the network models.We consider three few-shot cases where training samples are randomly extracted from the original training set with different proportions, 0.5%, 1% and 5% of  10.
From Table IV, we can see that in the few-shot scenarios of the three datasets, DeepSIG has the best classification performance.Among them, as the categories in dataset RML2016.10a is small and it is relatively easy to distinguish these modulation schemes, there is limited improvement in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
classification accuracy on RML2016.10a.On the other two datasets, i.e., SigData-18 and SigData-36, DeepSIG has significantly better classification performance in few-shot scenarios in terms of classification accuracy.In the three few-shot scenarios of dataset SigData-18, the performance of DeepSIG will be improved by at least 2% compared with the three separate classification networks.On dataset SigData-36, the performance gain is at least 4%.As can be seen from Fig. 10, DeepSIG has higher classification accuracy than other models in almost all SNRs of datasets SigData-18 and SigData-36.These results show that DeepSIG can fully mine and use the feature information on different perspectives to improve the modulation classification performance in few-shot learning of complex radio signals.
In the real modulation recognition scenarios, there is often a shortage of training samples.At this point, we can use the three different forms of inputs mentioned in this paper to represent the same signal sample, and then extract, fuse, and use signal feature representation from different perspectives to improve modulation recognition accuracy as much as possible in scenarios with limited training samples.

V. CONCLUSION
In this paper, we have proposed a hybrid heterogeneous deep learning framework for radio modulation classification, namely DeepSIG.DeepSIG receives input data which can represent IQ signals in three different formats: sequences, images and graphs.What's more, DeepSIG integrates network models in the fields of RNN, CNN and GNN.The simulation results on three radio signal datasets have shown that our proposed DeepSIG has better classification performance than the models based on the three separate fields.Its performance gain is more obvious especially in few-shot scenarios.
In the future, we will try other mapping methods to map IQ signals into images and graphs and investigate other RNN, CNN and GNN architectures in DeepSIG.Furthermore, we will conduct over-the-air experiments to further verify the performance of the proposed method.

Fig. 1 .
Fig. 1.The structure of the proposed classification framework DeepSIG.

Fig. 4 .
Fig. 4. The process of getting matrix by signal reconstruction based on the decomposed components of channel I.

Algorithm 3
The Process of Our Proposed Graph Mapping Method LFGM Input: The hyperparameter k, IQ signalS IQ = [(I(0), Q(0)), (I(1), Q(1)), • • • , (I(N − 1), Q(N − 1))].Output: Graph G = ⟨V N , E⟩.1 Let the N time sampling points of the IQ signal form the node set V N of the graph G, the node number corresponding to the h-th time sampling point is h.2 for i = 0 to N − 1 do 3Set the attribute value of the node numbered i to (I(i), Q(i)).
empty lists wave I and wave Q for storing components.Output: Dual channel narrow edge image M IQ .Add S I log 2 N,L (n) to list wave I , add S Q log 2 N,L (n) to list wave Q .7 for j = 1 to 1 + log 2 N doLet all components in wave I and wave Q except the j-th component be zero vectors, obtain the sequences subwave I j and subwave Q j after signal reconstruction.
6 8 9 Obtain M I and M Q by respectively concatenating subwave I j and subwave Q j , j ∈ [1, 1 + log 2 N ]. 10 Return the mapped dual channel narrow edge image M IQ by concatenating M I and M Q .

TABLE II COMPARISON
OF MODULATION CLASSIFICATION METHODS BASED ON DIFFERENT MAPPING ALGORITHMS

TABLE IV CLASSIFICATION
ACCURACY IN FEW-SHOT SCENARIOS the original respectively.The simulation results are shown in Table IV and Fig.