Fully Complex Deep Learning Classifiers for Signal Modulation Recognition in Non-Cooperative Environment

Deep learning (DL) classifiers have significantly outperformed traditional likelihood-based or feature-based classifiers for signal modulation recognition in non-cooperative environments. However, despite these recent improvements, the conventional DL classifiers still have an unintended problem in handling the received signal in which the in-phase and quadrature components are separated. Even though the two components seem to be individually uncorrelated to each other, they are definitely the theoretical real and imaginary parts of a signal sample in the complex domain. Thus, it may be helpful for a classifier to regard and treat the modulated signal as a complex data array representing beneficial mutual information between the two real data arrays. In this paper, we propose two types of fully complex convolutional neural network (CNN) and residual neural network (ResNet) classifiers that deal with the complex data instead of the two separated data. First, we organize and define the core complex operations for implementing the complex DL classifiers. Next, the architectures of the proposed classifiers are realized by applying the structural optimizations and the regularization technique. Then, the various aspects of the classifiers’ performance are analyzed and explained for providing comprehensive and deep understandings: (a) the effectiveness of the complex signal handling, (b) the in-depth evaluation of classification accuracy, (c) the impact of the data size on performance, and (d) the computational complexity. The experimental results show that the proposed classifiers can provide the faster learning speed, the higher optimization, and the better generalization with sacrificing acceptable learning and classification costs. Especially, our approaches remarkably improve the performance on the mutually correlated modulation types (i.e., the phase-related modulations: PSK, APSK, and QAM) even in the less-scale datasets.


I. INTRODUCTION
Signal modulation recognition (SMR) is a technique that can effectually classify the modulation type of an observed signal. This technique was initially developed for military purposes and has been still widely used in various applications, including the jamming attack and the radar signal reconnaissance [1], [2]. Also, in civilian applications, SMR has been perceived as one of the important technologies for transmission optimization and interference mitigation in cognitive radio The associate editor coordinating the review of this manuscript and approving it for publication was Renato Ferrero. of the frequency spectrum resources [6], [7]. Moreover, modern wireless communication systems have become increasingly complicated by the techniques of anti-multipath fading, frequency selection, time-varying channel, etc. Accordingly, a more accurate and robust SMR method is required to counteract the upcoming harsh environments.
A few years ago, deep learning (DL) classifiers emerged as one of the new SMR alternatives. The several DL classifiers [8]- [10] showed positive feasibilities by applying the well-known DL architectures, whose performance has been validated in other fields (e.g., computer vision). Besides, the more advanced DL classifiers have outperformed both the traditional likelihood-based (LB) and feature-based (FB) classifiers in the noisy environments [11], [12]. Furthermore, the studies in [13]- [16] have proved that their classifiers could distinguish enough even when the modulation types are complicated and diverse.
However, despite their performance enhancement, the above-mentioned DL classifiers have an unexpected problem dealing with the dataset. In general, released datasets mostly consist of modulated signals with the separated inphase (I) and quadrature (Q) channel data. Although the two channel data seem irrelevant to each other, these values are theoretically real and imaginary parts of complex signals [17]. Let's take a quadrature phase-shift keying (QPSK) modulation example. From Fig. 1 representing the separated I and Q channels of the QPSK modulated signal, we may identify only the straightforward information such as the signal's amplitude, shape, and period; as a result, it is not easy to determine at a glance whether the modulation type is QPSK or not. On the other hand, if the two separated channels are treated as the complex data, we can effectively recognize its modulation type as QPSK through the shape of the constellation diagram or the phase variation, as shown in Figs. 1b and 1c. Unfortunately, the conventional classifiers have overlooked the benefits of complex-valued information.
In this context, the major motivation is the valuable insight that the classifier's accuracy may be further enhanced if we regard two seemingly unrelated data as one mutually correlated meaningful data (i.e., a complex number). Also, the other encouraging motivation is the various positive effects on handling the complex number; the studies in [18], [19] have shown that the complex-valued data handling could provide faster learning, higher optimization, and better generalization. Although these results have been based on simple and shallow neural networks, it is sufficient to imply that the positive effects may be obtainable even in DL.
In this paper, we propose the two types of fully complex-valued classifiers based on convolutional neural network (CNN) and residual neural network (ResNet) architectures. At the same time, the aim of our study is to provide comprehensive experimental results and careful explanations regarding how the complex-valued data management delivers helpful benefits to the DL classifiers.
The main contributions of this study are summarized as follows: • We newly define the complex-valued (a) max pooling and (b) softmax operations to actualize the proposed classifiers; also, (c) the complex-valued gradientweighted class activation map (Grad-CAM) is introduced to verify the benefit of the complex data handling.
• We propose the architectures of the two complex-valued classifiers improving the SMR capability by using the structural optimization and regularization techniques.
• We provide thoughtful analyses and detailed descriptions about the performance of the suggested classifiers in various aspects, such as the effectiveness of the complex signal handling, the in-depth evaluation of the classification accuracy, the impact of the data size on performance, and the computational complexity.
The rest of this paper is organized as follows: Section II introduces recent research trends in SMR. In Section III, the core complex-valued operations are defined. Based on these operations, we propose the complex-valued CNN and ResNet classifiers in Section IV. Then, Section V presents the experimental results and discussions. Finally, the conclusion is given in Section VI.

II. RELATED WORK
Technically, SMR comprises pre-processing and classification processing steps. The purpose of the pre-processing is to make a received signal suitable (e.g., normalization) for the next step. Then, the classification processing tries to extract significant features and finally determine the best one among possible candidates. In general, the SMR algorithms are categorized in LB, FB, and DL approaches according to the inner classification processing way.

A. LIKELIHOOD-BASED SMR
LB SMR is an algorithm conducting multiple hypothesis test problems [7] based on the likelihood function of the observed signal r, as shown in Fig. 2. The algorithm assumes a set M in which C finite candidate modulation types are available. Then, the LB classifier estimates likelihoods L(r|H M (c) ) according to a hypothesis H M (c) for each modulation type M (c), where 1 ≤ c ≤ C. The next decision step involves comparing all the estimated likelihood values and making a classification decision. An intuitive and straightforward approach may be to choose the expected modulation typeM representing the maximum likelihood as follow: (1) The advantage of the LB classifier is that it can provide the optimal solution under the following ideal assumptions: (a) all modulation properties are already known, and (b) the estimation of the wireless channel model is perfect. However, employing the LB classifier in a realistic environment is challenging because the ideal conditions cannot be achievable in the non-cooperative environments. Moreover, such a classifier inevitably suffers from high computational complexity as the number of classification types increases.
The FB classifier typically shows a relatively low computational complexity than the LB classifier. However, its classification accuracy may be sub-optimal because the features of specific modulation types are manually handcrafted depending on the expert's experience. Besides, as the number of classification types enlarges, the increment in the system complexity is also unavoidable due to the demands for more simultaneous classification processings. Consequently, it may be challenging to apply the FB classifier in the real-world environment where various modulation types need to be expanded promptly and classified accurately.

C. DEEP LEARNING SMR
Recently, the classifiers based on DL techniques have overcome the limitations of the LB and FB classifiers. Unlike the two classifiers, the DL classifier merges the feature extraction (or likelihood) and decision steps in the single process, as shown in Fig. 2c. The fusion of the two steps in DL provides the following benefits: (a) multiple layer-based learning from low-level features to high-level abstracts features [30] and (b) capability of variation factors handling [31].
The early DL approaches in SMR tried to apply the famous architectural concepts that showed impressive performance in other fields. For example, some DL classifiers in [8]- [10] adapted various models, such as the visual geometry group (VGG), convolutional long short-term deep neural network (CLDNN), ResNet, densely connected network (DenseNet), and long short-term memory (LSTM). Although these classifiers were relatively simple, their performance results have been sufficient to show a positive prospect for SMR using DL.
Subsequently, several studies have demonstrated that the DL classifiers outperformed the traditional LB and FB classifiers in the sensitive environments. In [11], the suggested VOLUME 10, 2022 LSTM-based classifier showed better accuracy than the LB classifier at the impairment channels (e.g., frequency offset). Also, the CNN-based classifier in [12] surpassed the HoS-based FB classifiers in the incoherent environment.
Meanwhile, the other researches have attempted to handle more signal modulation types accurately. In [13], it was proposed the serial or parallel fusion classifiers based on CNN and LSTM; these classifiers could handle 11 different modulation types. In addition, in [14], the CNN-based classifier, capable of recognizing 15 different modulation types, was proposed. Furthermore, [15] proposed the ResNet-based classifier, which discriminated among 19 different modulation types. Recently, the most remarkable radioML2018.01a dataset [16] was released. Using this dataset, the ResNet and CNN based classifiers suggested in [16] could recognize 24 different modulation types.

III. COMPLEX-VALUED OPERATIONS
This section mathematically represents the core complexvalued operations to realize the complex-valued classifiers. In Sub-sections III-B to F, the five pre-defined complexvalued operations are organized: the complex-valued full connection (CFC), convolution, batch normalization (CBN), ReLU (CReLU), and average pooling (CAP). After that, we propose the two new operations: complexvalued max pooling (CMP) and softmax (CSoftmax) in Sub-sections III-G and H.

A. COMPLEX DATA REPRESENTATION
In the subsequent sub-sections, an arbitrary complex data array (or matrix) r is represented by a real data array (or matrix) h with the real and imaginary parts of r, as follow: Using this representation, all complex-valued operations can be logically conducted with the real data and the already well-defined real functions (e.g., matrix multiplication, max, argmax, etc.).

B. COMPLEX-VALUED FULL CONNECTION
The full connection is one of the most important operations in a DL model. The concept of real-valued full connection (RFC), described as the product of an input and a trainable parameter (also known as a weight), can be extended to CFC. Let Equation (3) indicates that CFC can be carried out as the real-valued matrix multiplications for the real-valued components of z and w.
Using the complex data representation in Sub-section III-A, z can be expressed as follow: where the symbol T denotes the transpose of a data array (or a matrix). The bias terms in (3) and (4) are omitted to simplify the equations without losing generality.

C. COMPLEX-VALUED CONVOLUTION
The real-valued convolution (RCONV) is another vital operation to extract the key features of the modulated signals. CCONV can be extended from the RCONV idea, which is explained as the behavior of the kernel scanning for the input. Considering the number of channels C, it is assumed that z i,c is the i th element at the c th channel of an I × C complex input z, where i = 1, 2, . . . , I and c = 1, 2, . . . , C. Also, for the number of kernels N and the kernel length K , a convolutional kernel w is defined as the N × K × C complex matrix; in w, w n,k,c denotes the k th element in the n th kernel of the c th channel, where n = 1, 2, . . . , N and k = 1, 2, . . . , K . Then,z o,m , which is the o th complex output element at the m th channel, can be given as (5) [32], [33].
where o = 1, 2, . . . , O and m = 1, 2, . . . , N ; P and S indicate the padding and stride sizes, respectively; O can be determined from 1−K +2P S + 1. To simplify (5) without loss of generality, the bias components are ignored. Similar to CFC, CCONV can also be realized using the matrix multiplications and the real data.

D. COMPLEX-VALUED BATCH NORMALIZATION
The batch normalization allows a DL model to be trained stably by using the input normalization. Following [33], CBN can be developed from the notion of the real-valued batch normalization (RBN).
Suppose that Z = (z 1 , . . . , z b , . . . , z B ) is the input batch set with the batch size of B, where z b is the complex input at the b th batch. Then,z b (i.e., the complex output at the b th batch) can be defined as follow [33]: where b = 1, 2, . . . , B. In (6), µ denotes the complex mean of Z and the covariance matrix V is defined by β is the complex trainable shift parameter and γ is the 2 × 2 complex trainable scaling matrix as follow: For the complex normalized output with mean 0 and variance 1, γ and β are initialized to γ re,re = γ im,im = 1 √ 2 , γ re,im = γ im,re = 0, and (β) = (β) = 0.

E. COMPLEX-VALUED ReLU
One of the most successful activations is the rectified linear unit (ReLU), holding only the positive values in the input. This real-valued ReLU (RReLU) principle can be utilized to design CReLU.
Let us z i be the i th element of a complex input z with the length I , where i = 1, 2, . . . , I . Then, for z i , an o th complex output elementz o can be determined by (9) [32], [33].

F. COMPLEX-VALUED AVERAGE POOLING
The average and max poolings are the most representative operations in DL. For fulfilling CAP in the complex domain, the real-valued average pooling (RAP) concept, which averages all entries within the pooling size, can be applied. Like in CReLU, it is regarded that z i is the i th element of a complex input z of the length I , where i = 1, 2, . . . , I ; also, the input channel size is supposed to be one. Then, given the pooling size G and the stride size S, an o th complex output elementz o can be defined as (10) [32].
where o = 1, 2, . . . , O, and O can be derived from I −G S + 1. Equation (10) permits CAP to derive the final output from the separate real-valued averages. For example, given the input in Fig. 3a, the CAP result becomes the red dots in Fig. 3b.

G. PROPOSED: COMPLEX-VALUED MAX POOLING
The real-valued max pooling (RMP) chooses only the maximum value among the pooling entries. When considering the complex extension of RMP, we may imagine a simple way of separately applying the max functions to a complex data's real and imaginary parts, respectively. However, this CMP with the separated max functions results in unavoidable data distortion, as shown in Fig. 3c; i.e., it does not seem that the outputs have statistical meanings. To achieve a stable CMP, we utilize the physically meaningful quantity of the complex data: the magnitude of a complex number.
Assume the complex input, the pooling size, and the stride size are identical to those of CAP. Then, a complex output elementz o can be represented as follow: VOLUME 10, 2022 where o = 1, 2, . . . , O. As shown in Fig. 3d, the CMP output with the magnitude does not introduce the signal distortion; also, it corresponds to the original intention of RMP.

H. PROPOSED: COMPLEX-VALUED SOFTMAX
Typically, the softmax is the last step of a DL model to normalize the prediction results into the probability distribution; consequently, each output entry of the softmax has a probability between 0 and 1, and the sum of all entries is 1.
We extend the real-valued softmax (RSoftmax) to CSoftmax by applying the magnitude of the complex data as in CMP.
Suppose z i is defined as the i th element of a complex input z with the length I , where i = 1, 2, . . . , I . Then, an o th complex output elementz o can be given bỹ where o = 1, 2, . . . , I .

IV. FULLY COMPLEX-VALUED DL CLASSIFIERS
The architectures of the new DL classifiers are described in this section. First, the baseline classifiers are introduced: (a) real-valued CNN (R-CNN) and (b) real-valued ResNet (R-ResNet). Next, we suggest our fully complex-valued classifiers: (c) complex-valued CNN (C-CNN) and (d) complexvalued ResNet (C-ResNet).

A. REAL-VALUED BASELINE CLASSIFIERS
We adopt R-CNN and R-ResNet suggested in [16] as the baseline classifiers based upon the preliminary comparison results of the recent real-valued classifiers in Appendix VI. The architecture of R-CNN is based on the VGG principle, which consists of the convolutional and fully connected parts as shown in Fig. 4a. The convolutional part extracts features from low-level elements to high-level abstractions of the input signals. After then, the fully connected part tries to map the extracted features into the modulation types. Another baseline, R-ResNet, is shown in Fig. 4b. According to the ResNet principle, shortcuts are included in the convolution part; they can mitigate gradient vanishing or explosion problems that may occur as the number of layers increases [34]. Meanwhile, it should be noted that R-CNN and R-ResNet are slightly modified versions of their original structures; in the fully connected parts, RReLUs are employed instead of the real-valued self exponential linear units (RSELUs) with alpha dropouts. The reason for the substitution is that it is still challenging to extend the concept of RSELU into the complex domain; development and evaluation on the complex version of RSELU are not investigated clearly yet, and further studies are needed. On the other hand, CReLU can be defined as described in Sub-section III-E.
As a result of compiling the baseline classifiers, R-CNN and R-ResNet have 159,832 and 165,144 parameters, respectively.

B. COMPLEX-VALUED CNN CLASSIFIER
Next, we propose the C-CNN classifier that more efficiently utilizes the mutually correlated information of the dataset. As shown in Fig. 4c, C-CNN basically shares the R-CNN's fundamental architecture. However, it is significantly noted that the data handling ways between the two classifiers are totally different; i.e., the complex-valued operations described in Section III are employed instead of all the real-valued operations.
Also, we adopt two additional structural changes to improve SMR capability. First, we optimize C-CNN by replacing the max poolings with the average poolings (i.e., CAPs) in the convolutional part. In general, the pooling operation provides computational efficiency by reducing the intermediate data sizes as the statistical summary. However, since the pooling obviously accompanies the information loss, this loss may affect the performance of a DL model variously depending on the complicated relationships among the dataset characteristics, the extracted features, and the architecture [35]. To choose the suitable pooling, we conducted the preliminary experiment comparing the classification performance between CMP and CAP with the radioML2018a dataset. As a result, C-CNN with CAPs showed slightly better performance; accordingly, CAPs are finally adopted. Second, we apply the regularization technique to C-CNN through the batch normalization. The batch normalization provides not only the training stability but also the regularization effect [36]; it is because the stochastic jittering (i.e., the small changes in variance and mean) of each batch acts as a regularizer, which helps mitigate the overfitting. We place CBNs before CReLUs providing the non-linearity (i.e., between CONV (or CFC) and CReLU). From the newly joined CBNs, we can expect better regularization.
In architecture compilation, the number of parameters for C-CNN is 322,800, which is about twice that of R-CNN.

C. COMPLEX-VALUED ResNet CLASSIFIER
This sub-section proposes the other complex-valued classifier: C-ResNet. The detailed architecture of C-ResNet is given in Fig. 4d. C-ResNet also follows the basic architecture of R-ResNet; however, like in C-CNN, all the real-valued operations are replaced by the complex-valued operations.
Moreover, there are four meaningful structural modifications in C-ResNet to enhance SMR performance. First, as in C-CNN, the pooling optimization is adopted: CAPs. Second, the regularization technique is also applied: CBNs. In third, we further optimize C-ResNet architecture by modifying the complex-valued residual block (CRB). To find a more effective structure than the real-valued residual block (RRB) in Fig. 4b, we conducted another preliminary experiments on the five different types of CRBs, which are the complex versions of the RRB structures illustrated in [37]. According to the comparison result, the CRB structure in Fig. 4d (i.e., CCONV → CBN → CReLU → CCONV → CBN → Addtion → CReLU) showed the most stable performance. Lastly, the iterative RCONV(1, 32)s in the convolutional part of R-ResNet are replaced by a CCONV(1, 32) in C-ResNet. Except for the first CCONV(1, 32) for dimension matching, the repeated others are no longer required since the two inputs for the summation within CRBs already have equal dimensions. These removals of the unnecessary operations can help reduce the learnable parameters.
As an architecture compilation result, C-ResNet requires 324,784 parameters, almost twice that of R-ResNet.

D. COMPLEX-VALUED TRAINABILITY ANALYSIS
The training of a DL model generally consists of two steps: the back-propagation and the gradient descent. In the backpropagation step, the gradient of the loss function for a weight is determined via the chain rule. Then, the weight is updated in the gradient descent step to minimize the loss by using the computed gradient. This real-valued training rule can be extended to the complex domain by employing the complex-valued chain rule and derivation in the Wirtinger Calculus [38]. A detailed analysis of complex-valued trainability is described in Appendix VI. From the analysis results, we confirm that the proposed classifiers are trainable.

V. EXPERIMENTAL RESULTS AND DISCUSSIONS
In this section, we analyze the performance improvements of the proposed classifiers in various aspects and provide indepth explanations. Also, thoughtful discussions about the issues of our approaches are presented.

2) TRAINING SETUP
For training the baseline and proposed classifiers, the Adam optimizer [42] with a default learning rate of 0.001 is used. In addition, the default batch size and epoch are set to 1,024 samples and 100, respectively.

3) DATASET
As an experimental dataset, the radioML2018.01a is employed; this dataset is based on a more practical propagation model with channel impairments such as additive white Gaussian noise (AWGN), carrier frequency offset, symbol rate offset, multipath fading (e.g., Rayleigh fading), and thermal noise [16]. Table 1 shows the attributes of this dataset; to the best of our knowledge, it contains the most diverse signal modulation types, the most number of signals, and the widest signal-to-noise ratio (SNR) ranges so far. In this study, we divide the dataset into the following three different sub-datasets: (a) training dataset, (b) validation dataset, and (c) test dataset. We randomly choose signal samples for each dataset to make even distributions for all modulation types and SNR values. And the three sub-datasets do not overlap any signal samples. By default, each sub-dataset includes 240,000 (240 K) signal samples.

B. EFFECTIVENESS OF COMPLEX SIGNAL HANDLING
We experimentally analyze the effectiveness of the complex signal handling of the complex-valued classifiers, including convergence speed, optimization, and generalization performance. Fig. 5 and Table 2 show the training and validation results of the baseline and proposed classifiers on the default training and validation datasets.
First, the proposed classifiers have faster learning convergence speeds than the baseline classifiers. As shown in Fig. 5, the validation losses of C-CNN and C-ResNet decrease rapidly and already achieve the best loss values of the R-CNN and R-ResNet around the ten epochs. Second, the proposed classifiers provide higher optimization results. In general, a DL model with a low loss value tends toward the more optimized solution while avoiding the local minima. From Table 2, we can confirm that C-CNN and C-ResNet show the lower train and validation loss values. Lastly, the proposed classifiers also offer better generalization properties for new and unseen data. According to the higher validation accuracies in Table 2, it can be seen that C-CNN and C-ResNet  outperform each baseline for unknown signals, which have never been observable during the training process.

C. IN-DEPTH ANALYSES OF SMR ACCURACY
This sub-section presents the detailed SMR performance on the default test dataset, according to SNR and modulation type variations. Fig. 6 shows the test accuracy results of each classifier by SNR; also, Table 3 summarizes these results. First of all, the proposed classifiers show meaningful improvements in the SMR accuracies of more than 5 % on average at all SNRs compared to the baseline classifiers; the accuracy improvements are more noticeable with gains of about 7 % on average, at the relatively high-quality environment above 0 dB SNR. In addition, C-CNN and C-ResNet outperform the baseline classifiers even in the Top3 accuracies, demonstrating the potential to enhance the classifying accuracy. Furthermore, the higher F1 scores indicate that the proposed classifiers statistically provide the more balanced performance in multiple signal classification problems.
Next, looking at the test results for each modulation type in Fig. 7, the proposed classifiers show the improved performance significantly in M -PSK, M -APSK, and M -QAM types, which have sensitive to complex-valued information: the phase. The detailed SMR improvements of C-ResNet for the phase-related modulation types are suggested in Table 4; refer to Appendix VI for the detailed accuracy plots according to modulation type and receiver operating characteristic (ROC) curves; also, due to the space limitation, we only suggest the representative test results of C-ResNet and R-ResNet. When looking at the accuracy results for all SNR, it can be seen C-ResNet shows the more robust capabilities at the relatively sophisticated modulation types with large M . For example, in QPSK, C-ResNet improves the R-ResNet's accuracy by 2.26 %. On the other hand, it improves surprisingly by 12.65 % in 32-PSK. Similarly, this trend can be confirmed in the APSK and QAM types with M > 32. In addition, the C-ResNet's higher F1 scores imply that it obtains more stable accuracies; also, the larger area under curve (AUC) values within the ROC plots for all phase-related modulation types show better distinguishability in multiple classification problems.
To further prove how the complex-valued information provides beneficial effects, we newly introduce the gradient-weighted class activation map (Grad-CAM) method in the complex domain: the complex-valued Grad-CAM (C-Grad-CAM). The detailed derivation and analyses of C-Grad-CAM are described in Appendix VI. From the C-Grad-CAM results, we can intuitively confirm that the phase-related elements of the complex signal affect the final decision more significantly, since the newly introduced information is adequately propagated within C-ResNet.
The SMR improvement trends of proposed classifiers also can be seen in the confusion matrix, as shown in Fig. 8. We can see that the confusion matrix diagonals of C-CNN and C-ResNet are more vivid than those of R-CNN and R-ResNet in phase-related types. It implies that the proposed classifiers more accurately distinguish those classes with less confusion.

D. IMPACT OF DATASET SIZE ON PERFORMANCE
The dataset size is a primary factor affecting the performance of a DL model. We would probably expect the DL model to perform better when more diverse and extensive data are available. However, since it is not easy to obtain enough samples in the real world, the performance analysis according to the dataset size can provide helpful insight in a practical situation. For this analysis, we train the proposed and baseline classifiers with three different dataset groups: (a) small, (b) medium, and (c) large groups. And each group consists of two datasets; 60 K and 120 K datasets in the small group; 240 K and 500 K datasets in the medium group; lastly, 1,000,000 (1 M) and 2 M datasets in the large group. Fig. 9 shows the test results of the proposed and baseline classifiers for the various dataset sizes. C-CNN and C-ResNet represent better accuracies than R-CNN and R-ResNet for all dataset sizes. Especially, the performance improvement on relatively insufficient datasets is impressive. In the small and medium dataset groups, the proposed classifiers increase the accuracies for all SNR by 10.32 % and 3.64 % on average, respectively; when considering only above 0 dB SNR, the averaging improvements are 16.17 % and 5.22 %, respectively.
Furthermore, we can confirm that the proposed classifiers only require about half the dataset size to attain the similar performance of the baseline classifiers. For example, the accuracies of C-CNN and C-ResNet at 240 K are comparable to those of R-CNN and R-ResNet at 500 K. We carefully deliberate these results in terms of the feature space. According to [43], it is well known that the required data size in a DL model is closely related to the dimensionality of the feature space; as the feature space's dimension increases, the space volume naturally rises rapidly; thus, since the available data becomes sparse, the DL model will require more data to achieve the acceptable performance. From the data processing point of view, the proposed classifiers may have a smaller dimension of the feature space than the baseline classifiers because they extract the features at the complex domain instead of each real and imaginary part of the complex data. In consequence, we believe that the smaller feature space makes the suggested classifiers demand fewer datasets. Undoubtedly, this effect can offer a significant benefit to the real-world environment where sufficient data is challenging to obtain.

E. COMPUTATIONAL COMPLEXITY
In terms of the learning and classification costs, we analyze the computational complexity of the proposed and baseline classifiers. Table 5 shows the measured learning costs, including peta floating-point operations (PFLOPs) and time required to train the classifiers. Looking into PFLOPs, it can be seen that the proposed classifiers inevitably require about four times more computations than the baselines for all datasets; When considering the primary complex-valued operations such as CFC or CCONV demanding the four real multiplications for a complex multiplication, these values seem to be reasonable. Also, in the learning time, our classifiers need approximately six-fold more time on average. Nevertheless, we do not regard these increments as a severe drawback, since the learning cost is mainly required during the initial training process.
On the other hand, the classification costs can be a critical burden because it directly affects the signal recognition interval in the practical applications. Table 6 shows the classification costs: mega floating-point operations (MFLOPs) and time per a randomly selected signal. As expected, we can see the classification FLOPs of the proposed classifiers is about four times higher than those of the baseline classifiers. Similar to the learning cost, we can not avoid increasing the classification cost due to the rise of complex-valued operations.
However, the measured classification time needs to be assessed carefully from the viewpoint of the proposed classifiers' applicability to the real-worlds. We believe that a VOLUME 10, 2022  few milliseconds is enough time to recognize the modulation types in real-time for the following basis: (a) the minimum allowable measurement period of the wireless link quality (WLQ) is as long as 2 milliseconds in the long term evolution (LTE) [44], which is one of the fastest and most sophisticated modern communications, and (b) at least in LTE, a few milliseconds of SMR time may be enough to recognize dynamically changed modulation types without omission. Consequently, it can be said that the proposed classifiers are suitable for most real-time SMR applications.

F. DISCUSSIONS
So far, we have described that the suggested data handling way gives us the following significant benefits: (a) faster learning speed due to the improved optimization, (b) higher recognition accuracy along with better generalization, and (c) less dataset demand. However, despite these useful benefits, there are several issues worthwhile to discuss:

1) NO ACCURACY GAIN AT LOW SNR
As shown in Fig. 6, the proposed classifiers show no profit compared to the baseline classifiers at low SNR (i.e., approximately below 0 dB). However, it should be noted that this limitation is not the unique phenomenon to our case but one of the main challenges of SMR. The signals below 0 dB SNR depict that the noise strength is larger than the pure modulated signal; undoubtedly, these corrupted data have no choice but to disrupt the proper training process of the classifier. In the same context, the proposed classifiers cannot easily overcome the unsolved problem of low accuracy in the noisy environment, even using the complex numbers. The possible improvement approach may be to block the inflowing noise in the pre-processing stage (as shown in Fig. 2c) by adopting techniques like frequency filtering or common-mode time noise reduction.

2) NO ACCURACY GAIN IN AM TYPES
The newly suggested classifiers do not show the noticeable performance in the case of amplitude modulation (AM) types (i.e., AM-DSB-WC, AM-DSB-SC, AM-SSB-WC, and AM-SSB-SC), as shown in Fig. 7. It seems that the root cause of this issue is the underlying nature of AM. In detail, only the amplitude (not frequency or phase) of the carrier wave is varied in proportion to that of the message signal [45]; as a result, a pure modulated AM signal is composed of only real parts without any imaginary parts. Consequently, it is no wonder that the real-value only modulation types do not stand to the benefits of the suggested complex data handling. In order to solve this limitation, we need to consider a different perspective: the hybrid learning of both time and frequency domains. In other words, a complex time/frequency-based classifier may be a helpful alternative to overcome this issue. We plan to address it as a future research topic.

3) ACCURACY GAIN SATURATION AT LARGE DATASET GROUP
In the average accuracy comparison result (in Fig. 9), the performance gap between the proposed and baseline classifiers gradually decreases as the dataset size increases. This result can be explained by the DL model's performance convergence effect [46] that the continuous growth of dataset size accompanies the accuracy convergence. Observing the accuracy results of the baseline classifiers, it can be discovered that the performance is getting better as the dataset size increases up to about 500 K. On the other hand, at the large dataset group (1 M and 2 M), the accuracies for above 0 dB SNR are roughly saturated into about 0.9. Following the baseline classifiers' trends, we can confirm that the similar saturations also occur in the suggested complex-valued classifiers, which share the fundamental frames of the baselines. However, it should be noted that our classifiers outperform the baseline classifiers on all the dataset sizes, although the accuracy gains become smaller. Furthermore, we would like to note that the purpose of presenting the performance saturation outcomes is to provide more in-depth analysis results considering diverse user circumstances. In practice, it may be much more challenging or nearly impossible to collect large enough datasets that reflect the real-worlds. Once again, we emphasize that our classifiers show the better enhancement in normal-scale datasets in the practical circumstance.

VI. CONCLUSION
In this study, we demonstrated that proper complex-valued data handling of the wireless signal could provide an opportunity to improve the SMR performance. Especially, the proposed classifiers could significantly improve the recognition accuracies even in relatively fewer datasets within the acceptable classification time. By providing diverse experimental analysis results and careful explainable causes, we believe that the proposed classifiers can play an important role in non-cooperative SMR applications, such as the recently sophisticated wireless environments. Finally, our study re-emphasizes the well-known insight that the considerate and deep comprehension of handling data is significantly important and may be the prerequisite for the optimal solution.

APPENDIX A PERFORMANCE COMPARISON OF RECENT DL CLASSIFIERS
Based on only the given accuracy results in [11]- [16], it is hard to directly compare the conventional DL classifiers, since their performance is tested in different experimental environments (e.g., signal modulation types, signal length, signal quality, etc.). In order to achieve overall consistency, we suggest the performance comparison result of the following four representatives among the conventional DL classifiers' structures: LSTM-based [11], fusion-based [13], CNNbased [16] (R-CNN), and ResNet-based [16] (R-ResNet) classifiers. During training and testing of each classifier, the identical train, validation, and test datasets are employed; these sub-datasets are selected from the radioML2018.01a dataset. According to the accuracy comparison result in Fig. 10, the ResNet-based classifier shows the best performance; the CNN-based classifier follows it.

APPENDIX B COMPLEX-VALUED BACK-PROPAGATION AND GRADIENT DESCENT
The Wirtinger Calculus provides the general partial derivative method and chain rule for a differentiable or non-differentiable complex-valued function. Suppose a complex variable z = x + jy and its complex conjugate z * .
Then, the differentiation of a complex-valued function f is defined as the partial derivative pair (i.e., R-derivative and conjugate R-derivative) with respect to z and z * as in (B-1) [38].
Also, the complex-valued chain rule for a composite function f (g), where f and g are complex-valued functions, is given as follows [38]: In the training stage of the proposed complex-valued classifiers, we employ a real-valued loss function such as the cross-entropy. The use of the real-valued loss function is rational for the following reasons: (a) the last layer of the proposed classifiers is Csoftmax (i.e., its final prediction outputs are the real-valued probabilities whose distributions are between 0 and 1); (b) the true labels of the dataset are the one-hot encoded real-valued data; thus, (c) the loss function using these prediction and truth data also becomes realvalued.
According to [47], the complex-valued gradient of the real-valued loss function L for z can be defined as In (B-3), it is significant to recognize that the gradient of L is related only to the conjugate R-derivative instead of the partial derivative pair. In other words, this property indicates that the multi-layer back-propagation for the complex-valued training can be achieved in the form of the one-side derivative (i.e., the conjugate R-derivative).
In detail, the complex-valued back-propagation can be derived from the complex-valued chain rule of (B-2). Assume that a complex-valued function with a complex weight w is z = g(z; w) = r(x, y; (w), (w)) + js(x, y; (w), (w)), where r and s are the real-valued functions. Then, the partial derivative of L with respect to z * can be determined as:   convolutional layer and the gradient from the feature map to the last class prediction. We can extend the concept of realvalued Grad-CAM in [48] to the complex domain. For any class c, we can define C-Grad-CAM L c C−Grad−CAM as follow: where A n is the complex-valued output at the n th channel of the target convolutional layer. Also, α c n denotes the gradient mean of the complex-valued softmax input magnitude |y c | with respect to A n as in (D-2).
where A n i and Z are an i th element and the length of A n , respectively. Fig. 14 shows the Grad-CAM results on the last convolutional layers of R-ResNet and C-ResNet under QPSK input signals of 30 dB and 4 dB SNRs, respectively. The four dotted circles (with the purple color) in the input constellation diagram indicate the QPSK symbols related to phase information. In the Grad-CAM results, we highlight the elements of the input signal affecting the final decision with red color; the stronger the influence, the deeper the intensity of the color.
As shown in Fig. 14a, we can see that R-ResNet and C-ResNet find out the QPSK type correctly with the great accuracies (99.88 % and 99.98 %, respectively) in the highquality (i.e., 30 dB SNR) signal. However, in the relatively low-quality (i.e., 4 dB SNR) signals, as shown in Fig. 14b, it is not easy to interpret that R-ResNet trains the input signals' features sufficiently since it misunderstands the QPSK input signal as 8-PSK with the 73.82 % accuracy. On the other hand, the input elements related to the phase information of QPSK are more deeply emphasized in C-ResNet; that is, it means that they influence on the final decision more powerfully.