POLSAR Target Recognition Using a Feature Fusion Framework Based on Monogenic Signal and Complex-Valued Nonlocal Network

With the continuous development of synthetic aperture radar (SAR) systems, multipolarization information has been increasingly applied to numerous fields, and automatic target recognition (ATR) in polarimetric SAR (POLSAR) has been recognized as vital problem. The SAR recognition methods can primarily fall into handcrafted feature-based algorithms and deep learning algorithms. The former exhibits excellent interpretability but insufficient generalization; the latter achieves stronger representational ability but relies on a considerable number of samples. To solve above problems, a feature fusion framework is proposed in this article based on monogenic signal and complex-valued nonlocal network (CVNLNet) for POLSAR target recognition. The proposed feature fusion framework effectively uses the complementarity of handcrafted features and deep features, while making up for the disadvantages of single feature-based methods. First, a Mono-BOVW model is proposed based on monogenic signal and bag-of-visual-words (BOVW) model to extract handcrafted features, which can more fully mine the information covered in POLSAR data in multiscale space. Moreover, CVNLNet is built for deep feature extraction to use both the amplitude and phase covered in POLSAR data. Next, a kernel discrimination correlation analysis algorithm is proposed to jointly analyze and transform the two features, so as to remove redundant information while retaining effective and discriminative information. Experiments on the MSTAR dataset and the GOTCHA dataset show that the proposed framework has superior performance on single polarimetric and fully polarimetric datasets.


I. INTRODUCTION
S YNTHETIC aperture radar (SAR) plays a vital role in real-time earth observation due to excellent characteristics such as all-day and all-weather, and is widely used in disaster as disaster monitoring [1], environmental protection [2], resource detection [3], meteorological observation [4], and other tasks [5]- [8]. Compared with single polarimetric SAR, the fully polarimetric SAR (POLSAR) system can measure the amplitude of the image, while containing the relative phase between different polarization channels [9]. Accordingly, POLSAR has been widely applied in various earth observation applications (e.g., target detection [10] and terrain classification [11]). Besides, several studies [12]- [14] have found that POLSAR has considerable application potential and value in stationary ground target recognition.
In the field of SAR automatic target recognition (SAR ATR), it is of critical significance to design a set of well-performing feature extraction and classification algorithm [15]. First, extracting the features which can effectively characterize the target is the premise of the subsequent correct classification. Generally, the features could be divided into two types: handcrafted features and deep features [16].
Handcrafted features are extracted from images by experts based on human perception and experience accumulation. They generally have specific physical meanings, such as computer vision features [17]- [21], electromagnetic characteristics [22], polarization characteristics [23], [24], and some special features like the monogenic signal. However, due to speckle noise and others, many vision features may have poor performance when directly transferred to POLSAR images [25]; while electromagnetic or polarization characteristics tend to focus on different specific scattering mechanisms, and generally require a combination of features [23]. The monogenic signal [26], an extended representation of analytic signal in high dimension, has aroused rising attention. Dong et al. [27]- [29] and Zhou et al. [30] used the multiscale components extracted based on the monogenic signal for SAR recognition, and both achieved better recognition accuracy than traditional features. In brief, handcrafted features are interpretable and so less affected by the number of samples; but they over-rely on human experience and lack generalization. For POLSAR data with complex scattering mechanism, it is a challenge to artificially design excellent features.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Unlike handcrafted features, deep networks can automatically learn effective features from data, which also means that it relies on a considerable number of samples. Deep networks could be grouped as real-valued (RV) networks and complex-valued (CV) networks. RV networks are based on RV representations and calculations (e.g., A-ConvNet [31] DCC-CNNs [32] and other networks [33]- [37]), which only use the amplitude while ignoring the phase of SAR data. Therefore, researchers introduce CV networks [38]. Zhang et al. [39] proposed a CV-CNN, Mullissa et al. [40] designed a d PolSARNet, besides, Tan et al. [41] and Zhang et al. [42] used CV-3D-CNN for hierarchical features extraction of POLSAR data. These networks are designed for POLSAR terrain classification rather than POLSAR target recognition. For SAR target recognition, Yu et al. [43] constructed a full convolutional neural network (CV-FCNN), and Scarnati et al. [44] evaluated the performance of several different complexvalued neural network (CVNNs). In brief, deep features show great advantages in the expression of deep abstract semantic and spatial structure; but they generally lack interpretability and are easily affected by the number of samples. However, the sample size of SAR datasets is generally small due to the difficulty of collection. Accordingly, how to apply deep learning robustly to POLSAR target recognition is worth studying in depth.
In order to make up for the deficiencies of a single feature under diverse and complex conditions, researchers have found that a reasonable combination of different types of features can greatly enhance the performance of image processing [45]. The fusion strategies based on multiple handcrafted features [46]- [51] or multiscale deep features [52]- [55] have been confirmed to be effective. Furthermore, some fusion strategies based on the handcrafted features and deep features, which have been indicated to have certain complementary properties, have also emerged [56]- [58]. Jia et al. [59] fused the features from principal component analysis (PCA) and CNN. Zhang et al. [60] fused the features from electromagnetic scattering center and MVGGNet. Feng et al. [61] developed a fusion method based on integration parts model and deep learning algorithm.
Some challenges remain in the fusion process of different types of features.
1) Different features may have different spatial dimensions.
2) The original feature information may be destroyed in the fusion process. 3) With the increase of feature types, the computation increases after fusion. Accordingly, a suitable feature fusion algorithm is critical in multifeature fusion [62]. The classic fusion algorithms are serial fusion and parallel fusion [63], which are simple to operate but fail to maximize the complementary advantages of features and are prone to redundancy. Thus, the fusion algorithms based on linear transformation are proposed (e.g., PCA [64], linear discriminant analysis [65], canonical correlation analysis [66], discrimination correlation analysis [60]). Furthermore, to analyze the nonlinear relationship between different features, the kernel function [67] is introduced into linear transformation (e.g., KPCA [68], KFDA [69], and KCCA [70]).
As revealed by the above works, it is difficult to characterize the POLSAR target comprehensively and accurately using only handcrafted feature-based algorithms, while only using end-to-end deep learning algorithms is highly susceptible to the number of samples. Inspired by above fusion strategies, we propose an efficient feature fusion framework for POLSAR target recognition to fully use the complementary advantages of handcrafted features and deep features. First, we construct a Mono-BOVW model to extract handcrafted features based on the intrinsic properties of images, which are robust and less affected by the number of samples. The monogenic signal model is capable of using a multiscale space to extract richer information (amplitude, orientation, and phase) from SAR data, whereas its direct use often leads to excessive computation due to the high feature dimension. Thus, we introduce a bag-ofvisual-words (BOVW) model to obtain stable low-dimensional features. Meanwhile, we construct a complex-valued nonlocal network (CVNLNet) to extract deep features with stronger representational ability, which uses both the amplitude and phase covered in SAR data. Furthermore, the CV nonlocal block is capable of capturing long-range dependencies. The advantages of the extracted handcrafted features and deep features could just make up for the shortcomings of each other. Lastly, to avoid the redundancy of fusion features while retaining as much effective and highly discriminative information as possible, we propose a fusion algorithm based on kernel discriminant correlation analysis (KDCA). The kernel function is introduced into the DCA algorithm to facilitate linear correlation analysis and dimension reduction of nonlinearly correlated features. Thus, in the fusion process, the features from the same category of targets have the most significant correlation, and the features from different categories have the most significant distinction. On that basis, stronger discriminative and more robust fusion features with lower dimensions can be obtained to increase the accuracy of SAR target recognition. The effectiveness and superiority of the proposed framework are verified on the MSTAR dataset and the GOTCHA dataset.
The main contributions and innovations of the proposed feature fusion framework are elucidated as follows.
1) A Mono-BOVW model is proposed for handcrafted feature extraction. 2) A CVNLNet is proposed for deep feature extraction.
3) A KDCA algorithm is proposed for feature fusion. The rest of this article is organized as follows. Section II introduces the proposed feature fusion framework for POLSAR target recognition. Section III presents experiments and discussions. Section IV succinctly concludes this article.

II. FEATURE FUSION FRAMEWORK BASED ON MONOGENIC
SIGNAL AND COMPLEX-VALUED NONLOCAL NETWORK Fig. 1 presents the overall architecture of the proposed feature fusion framework, which mainly comprises three parts, including preprocessing, extraction and fusion of handcrafted features and deep features, as well as classification.
In the first part, for better analysis and classification, the preprocessing operation is conducted by normalizing SAR image [39]. Z-Score function serves as the normalization algorithm, defined as: x * = (x−x)/σ, where x denotes an image,x and σ  denote the mean and the standard deviation of x, respectively. For POLSAR, each channel is independently normalized.
The second part includes three modules: handcrafted feature extraction based on the Mono-BOVW model, deep feature extraction based on CVNLNet, and feature fusion algorithm based on KDCA. First, we use the monogenic components generated by decomposition in different scale spaces to express the target scattering mechanism of SAR images, then extract lower dimensional mid-level semantic features by the BOVW model. Meanwhile, we construct CVNLNet by the inserting CV nonlocal block into the CV residual network (ResNet) to extract deep features. Subsequently, we use the KDCA algorithm to perform correlation analysis and transformation on the extracted handcrafted features and deep features for better fusion.
In the third part, the fusion features are fed into a classifier for training and classification. Since the support vector machine (SVM) classifier achieves high performance in a small number of samples, it serves as the classifier in this article.

A. Handcrafted Feature Extraction Based on Mono-BOVW
In the present section, the architecture of the Mono-BOVW model proposed for handcrafted feature extraction is elucidated, as presented in Fig. 2.
First, multiscale monogenic features are extracted from all training images. The monogenic signal is a two-dimensional (2-D) analytical signal, which describes the local amplitude, local orientation, and local phase information of the image in a rotation-invariant manner. It is based on the Riesz transform which is a 2-D extension of the Hilbet transform while retaining the important properties of 1-D analytical information. The Riesz transform spatial kernel function (h x , h y ) at any point (x, y) in the 2-D signal space could be expressed as follows: Since the Fourier spectrum period of the image is infinitely long, it is necessary to extend the input image infinitely by means of a bandpass filter, and then perform the Riesz transform. This article uses Log-Gabor filter to achieve bandpass filtering. If the input image is I 0 , the general form of the 2-D monogenic signal I M could be expressed as follows: where I denotes the extension of I 0 ; I x and I y denote the Riesz transform of I in the x and y direction, respectively. The operator " * " denotes convolution, F −1 denotes the inverse Fourier transform, G(ω) denotes the frequency response of the Log-Gabor filter which could be defined as follows: , where S denotes the scale space of the monogenic signal, ω 0 denotes the central frequency, σ denotes the broadband proportional factor, λ min denotes the minimum wavelength, and μ denotes the wavelength multiplication coefficient. Next, the local amplitude A, the local orientation θ, and the local phase P of the input image could be defined as follows: where A, θ, and P contain the local energy information, the local geometric information, and the local structure information.
Based on the S-scale log-Gabor filter, {I 1 M , I 2 M , . . . , I S M } denotes the monogenic signal under the condition of different scales, and the corresponding monogenic components are as follows: When S = 3, an SAR image can be characterized as three local amplitude maps, three local orientation maps, and three local phase maps. Then, we expand these feature maps into a long vector to form a monogenic feature vector, as shown in Fig. 2.
Subsequently, the BOVW model is adopted to perform statistical analysis on the distribution of monogenic feature vectors as the input feature descriptors (shown in Fig. 2). First, we make clusters from the descriptors. In the specific implementation, K-means is selected as the clustering algorithm. The center of each cluster will serve as a word of the visual dictionary. Next, for each image, a frequency histogram is built according to the visual vocabulary and the frequency of the words contained in this image. Then, the histogram is encoded to form the final feature vector.

B. Deep Feature Extraction Based on CVNLNet
In the present section, a CV network named CVNLNet is proposed to extract deep features, as illustrated in Fig. 3. It mainly includes two parts: CV ResNet and CV nonlocal block. Here, CV nonlocal block, as a separate module, can be inserted into any position in CV ResNet to form CVNLNet (represented by the dotted lines with arrows in Fig. 3). The network performance for different insertion positions in the experiments are discussed to determine the optimal network configuration.
The basic modules in CVNLNet such as convolutional layers, pooling layers, activation layers, and batch normalization layers are substituted with the relevant CV versions. These CV modules are elucidated within papers [38] and [39], in order to exploit both amplitude and phase in the POLSAR data. Notably, in the output layer, the complex features are transformed as real features by calculating the absolute value before the softmax classification which is not applicable to complex values.
1) CV ResNet: The main structure of CVNLNet is a deep CV ResNet. From [71], it is known that ResNet can address the degradation problem, thus achieving higher accuracy from considerably increased depth. Therefore, in order to prevent the gradient disappearance which may occur during the deep network training process, CV residual blocks (convolution blocks and identity blocks) are exploited. Within the CV convolution blocks, the stride of the first convolutional layer and shortcut connection layer is set as 2 to allow the input and output feature maps to have different sizes, inconsistent with that in the CV identity block. CV ResNet comprises three convolutional layers and four residual block groups (with two residual blocks each), as depicted in Fig. 3. In this network, inspired by [43], two convolutional layers are employed for replacing the fully connected layer to prevent overfitting and increase nonlinearity.
2) CV Nonlocal Block: Inspired by [72], a CV version of nonlocal block with high efficiency, as shown in Fig. 3, is presented and applied to our CV ResNet. It is capable of capturing long-range relationships, thus making up for the insufficiency that convolution operations deal with one local area each time [72]. The CV nonlocal block introduces global information by computing a weighted sum of the features at all positions in the feature maps, thereby making the target region more weighted and more prominent, to enhance the recognition performance.
The generic nonlocal operation in the RV domain could be expressed mathematically as follows: where x and y respectively denote the input and output feature maps with the same size, i denotes a position on the feature map while j denotes possible positions of the enumeration, and C(x) is a normalization function. g computes an expression for x j by a 1 × 1 convolution: g(x j ) = W g x j . f computes the relationship between x I and x j , which is implemented in this article as an embedded Gaussian function which could be defined in the RV domain as follows: Where θ( (8) at alI i positions is expressed as follows: where θ(x), ϕ(x), and g(x) denote the corresponding matrices calculated from x at all positions. A weight matrix is obtained from σ(x) with the use of softmax function. Subsequently, this weight matrix is multiplied by the matrix g(x) to obtain the weighted feature map y with the target area highlighted. However, data are complex (denoted by a subscript c) in CV network. Thus, the conjugate transpose of ϕ(x c ) replaces its transpose in (11), which is improved in the CV domain as follows: where the superscript H denotes a conjugate transpose operation. Next, the absolute value of the complex σ(x c ) is calculated for softmax. On that basis, (10) is modified as follows: Where |σ c | denotes the absolute value of σ(x c ), defined as follows: To decrease computation, a downsampling trick is adopted, which is implemented as a pooling operation. Thereby, ϕ(x c ) in (12) and g(x c ) in (13) are improved asφ(x c ) andĝ(x c ), which are the corresponding downsampling versions.
According to the above derivation, the general CV nonlocal block in the CV domain could be lastly defined as follows: where W z denotes a weight matrix in the 1 × 1 convolution. Moreover, to further decrease the computation, the number of channels in the convolutional layers containing W θ , W ϕ , or W g is decreased to half of that of x c . Next, the number of channels in the convolutional layer containing W z is required to match that of x c . Then, the output of the convolution layer containing W z is added to the input x c by a residual connection. Moreover, a CReLU activation is applied to z c to enhance nonlinearity.

C. Feature Fusion Algorithm Based on KDCA
To effectively fuse handcrafted features and deep features, a fusion algorithm based on KDCA is developed (shown as Fig. 4). KDCA performs feature selection and dimensionality reduction through projective transformation to avoid feature redundancy and fully uses the complementary advantages of the two features. First, the kernel function efficiently maps the original feature space to a higher dimensional feature space during the different feature fusion process, so that the nonlinear relationship is converted into linear relationship for subsequent correlation analysis. Moreover, the effects of different dimensions of the original features are suppressed due to mapping to the same spatial dimension. Then, DCA is capable of maximizing the correlation between features from the same category of targets in two feature sets and eliminating the correlation of features from different categories in the respective feature set.
First, the two features are normalized before fusion to make the scales the same for better fusion. The two feature matrices are assumed as X ∈ R p×n and Y ∈ R q×n , where n denotes the number of samples, and p and q denote the dimensions of the two features. Moreover, all samples originate from c separate categories, and n i denotes the number of samples belonging to the ith category. Thus, X ij and Y ij denote the feature vectors from the jth sample of the ith category. With the feature matrix X as an example, the mean of the ith category and entire feature set could be written asX I andX, respectivelȳ The between-class scattering matrix S bx could be defined as follows: In this step, considering the nonlinear relationship between different features, it is not ideal to perform linear analysis directly, so we innovatively introduce the kernel function to map the feature F to a linearly separable high-dimensional space using a nonlinear mapping Φ( · ):F = Φ(F ). The corresponding kernel function matrix K is constructed as follows: Therefore, S bx in (18) is improved by the kernel function as follows: where Φ x ( · ) denotes a nonlinear mapping for x, and K x denotes the kernel function matrix of Φ x (Θ x ). When samples from different categories are strongly discriminative,Ŝ bx would be a diagonalizable matrix where Λ denotes the diagonal matrix of real and non-negative eigenvalues which are sorted on the basis of an order of decrease. Moreover, P is a matrix consisting of m most significant eigenvectors corresponding to m largest eigenvalues. And then we can decrease the dimension of X from p to m by the transformation: W bx = Φ x (Θ x )P Λ −1/2 , which could convert the scattering matrixŜ bx to a identify matrix I: W T bxŜ bx W bx = I. This means that after this transformation, the correlation between different categories is minimized, i.e., the categories are separated. Accordingly, X could be mapped by W bx to X Likewise, we could get a transformation W by which decreases the dimension of Y from q to m Then, in order that only the features from the same category of targets in two feature sets have nonzero correlation, the between-set covariance matrix S' xy = X'Y ' T need to be diagonalized. The singular value decomposition is adopted for the diagonalization operation where Σ is a diagonal matrix. We let the two transformations, respectively, be W cx = U Σ −1/2 and W cy = V Σ −1/2 . Next, these two feature matrices can be transformed as follows: Through the two-stage transformation derived above, the two feature sets, with the largest distinction between different categories in the same set and the largest correlation of corresponding features between different sets, are lastly obtained. Then, the two transformed feature sets are concatenated to form a fusion feature set: F = X &Ỹ .

III. EXPERIMENTS AND DISCUSSIONS
In the present chapter, the effectiveness and superiority of the proposed recognition approach are verified in two public datasets, including the MSTAR dataset (single polarimetric) and the GOTCHA dataset (fully polarimetric), respectively. This article utilizes LIBSVM [73] to build SVM classifier.
In the MSTAR dataset, ten categories of targets at the depression angle of 17°are selected for training, and ten categories of targets at 15°are selected for test. The details of the sample data are listed in Table I. 2) Experimental Setup for Handcrafted Feature Extraction: First, the amplitude of complex data is used to form a gray image in MSTAR. Subsequently, the three-scale monogenic features are extracted from the gray image according to the monogenic signal model, as shown in Fig. 6. In (3), the parameters are specifically set as follows: S = 3, σ = 0.48, λ min = 8, μ = 2.5.
Next, the monogenic features extracted from the respective image are arranged as a long feature vector. And 99% of these feature vectors are selected for K-means clustering. The parameter k in K-means clustering algorithm is set as 2280. Then the  centers of all clusters obtained are quantified as visual words. Finally, the visual words contained in each image are counted to form a visual word histogram, which is further encoded to obtain a 2279-dimensional handcrafted feature vector.
In order to verify the effectiveness of the proposed Mono-BOVW model, we compare it with the single-scale monogenic signal with BOVW model (single-scale Mono-BOVW) and the multiscale monogenic signal model in terms of the classification overall accuracy (OA) and the feature dimension, as listed in Table II. Accuracy is the most commonly used performance metric in machine learning, defined as follows: Accuracy = (TP + TN)/(TP + TN + FP + FN). (28) where TP denotes the number of the true positives, TN denotes the number of the true negatives, FP denotes the number of the false positives, and FN denotes the number of the false negatives.  It can be seen that compared with the single-scale monogenic signal, the multiscale monogenic signal contains more information, so as to achieve a higher accuracy. And the introduction of the BOVW model into the multiscale monogenic signal greatly decreases the feature dimension, and improves the low-level semantic features to the mid-level semantic features with stronger representational ability, which is manifested in that the classification accuracy is greatly increased by 7.63%.

3) Experimental Setup for Deep Feature Extraction:
For ease of analysis, experiments are performed in the MSTAR dataset for verifying the effectiveness of CVNLNet and discuss the optimal network model. Table III lists the detailed configuration of CVNLNet with all CV layers. The size of SAR image is 128 × 128 in the input layer with one polarization channel. The first layer refers to a convolutional layer, the second layer refers to an average pooling layer, followed by four stages (Res1, Res2, Res3, and Res4, respectively). Here, Res1 comprises two identity blocks, while Res2, Res3, and Res4 comprise an identity block and a convolution block, with the insertion of a CV non-local block before the last block at any stages. After the four stages, there are two convolutional layers. Notably, complex features should be converted to real features by calculating absolute value before softmax in the output layer. The hyperparameters in CVNLNet are set as follows: the number of epochs is 100, the batch size is 32, cross entropy loss serves as the loss function, and the adam algorithm serves as the optimizer with an initial learning rate of 0.001. Furthermore, dropout layer is employed for the prevention of overfitting.  network, the accuracy of CVNLNet with CV nonlocal block at any stages is enhanced. However, as the insertion position is further back, the improvement in classification accuracy is smaller. The optimal classification accuracy occurs in Res1+, reaching 99.50%. This result is achieved probably because the deeper the network goes, the smaller the feature maps and the smaller the role of the CV nonlocal block will be. Fig. 7 presents the feature maps before and after the insertion at the Res1 stage. Notably, the target area in the map is highlighted while the background interference and noise are suppressed after inserting the CV nonlocal block, which improves the accuracy of target recognition by 1.38%. Thus, the CV nonlocal block is inserted at the Res1 stage of CV ResNet for constructing the optimal model of CVNLNet.
Next, the deep features are extracted using the optimal CVNL-Net model. In general, the input feature maps of the output layer are selected and transformed into a 512-dimensional deep feature vector.

4) Recognition Based on the Feature Fusion Framework:
The KDCA algorithm is adopted to fuse the extracted handcrafted feature vector and deep feature vector to form a stronger discriminative fusion feature vector. Next, this vector is fed into the SVM classifier for target classification.
For the KDCA fusion algorithm, a Gaussian kernel is selected as the kernel function, so the handcrafted feature vector and deep feature vector can be analyzed and fused in a suitable highdimensional space. Lastly, we get a smaller 494-dimensional fusion feature vector, which avoids the redundancy problem in the fusion of the two feature vectors and decreases the subsequent computation. Correspondingly, for the SVM, the Gaussian kernel is selected as the kernel function.
Besides the Mono-BOVW model and CVNLNet, state-ofthe-art SAR ATR methods are also cited for comparison, so as to examine the effectiveness and superiority of the proposed feature fusion framework. They include handcrafted featurebased methods, such as moment method [21], attributed scattering center model [22] and joint sparse representation (JSR) of monogenic components [30], as well as end-to-end neural networks, such as A-ConvNet [31], CV-CNN [39], CV-FCNN [43], and RVNLNet with the same architecture as CVNLNet. Furthermore, a fusion framework based on multiple handcrafted features [47], MKSFF-CNN based on fusion of multiscale deep features [55], and FEC based on fusion of handcrafted features and deep features [60] are also used for comparison. Table V  TABLE V  CLASSIFICATION ACCURACY  lists the classification accuracy of different methods. And the confusion matrix of the proposed feature fusion framework is illustrated in Fig. 8, where each row represents the true category of the samples, each column represents the predicted category, and each cell lists the number of samples predicted as the corresponding category. Moreover, receiver operating characteristic (ROC) curve and area under the ROC curve (Area Under the ROC Curve, AUC) are also used to objectively evaluate the classification performance, as illustrated in Fig. 9. ROC curve is an important evaluation metric in machine learning, which describes the relationship between true positive rate (TPR) and false positive rate (FPR). AUC is usually used to assist ROC to further evaluate the performance (generally say, the larger the AUC, the better the performance of the classification method) [55]. Similar to (28), the specific definitions of TPR and FPR are as follows: Moreover, experiments are performed under the condition of smaller training datasets which are sampled from the original MSTAR dataset, to verify the ability of the proposed feature fusion framework to adapt to small datasets. The sampling proportions of small training datasets in the original MSTAR dataset account for 1/3, 1/5, 1/7, 1/10, respectively, and the corresponding classification accuracy is illustrated in Fig. 10.
As depicted in Table V, the proposed feature fusion framework achieves the highest accuracy of 99.71% in the MSTAR target recognition task. And from the specific classification results illustrated in Fig. 8, it can be seen that in the 10-category dataset containing 2425 samples, only 7 samples in total are misclassified. Moreover, the ROC curves and AUC values in Fig. 9 also show that the classification performance of the proposed feature fusion framework is higher than other methods.
First, as can be seen from Table V and Fig. 9, the proposed feature fusion framework significantly outperforms the single feature-based methods, especially compared with the proposed Mono-BOVW and CVNLNet. On one hand, as depicted in Table V and Fig. 9, compared with CVNLNet which extracts high-level semantic features through automatic learning, the Mono-BOVW model based on artificially designed low-level semantic features may be less adaptable to SAR data. However, the proposed fusion framework based on these two features exploits their complementary characteristics to the maximum extent through KDCA algorithm, so as to obtain stronger discriminative fusion features, which improves the classification accuracy by 1.11% and 0.21% compared with Mono-BOVW and CVNLNet, respectively. On the other hand, as depicted in Fig. 10, the classification accuracy of CVNLNet decreases much faster than that of Mono-BOVW when training samples are greatly reduced. This is because the learning process of the deep neural networks is more easily affected by the number of samples, and it is prone to overfitting when samples are insufficient. Accordingly, the proposed fusion framework can better make up for the deficiency of CVNLNet by introducing handcrafted features (extracted by Mono-BOVW) which are less affected by the number of samples due to its interpretability. Therefore, the classification performance can maintain a certain stability when the number of samples is small. To sum up, the proposed feature fusion framework can maximize the use of the automatic learning and strong representational ability of deep features and combine with the characteristics of handcrafted features, to obtain highly discriminative features which can more comprehensively characterize the target when the samples are sufficient; while the recognition performance can be kept stable to a certain extent, thanks to handcrafted features when the samples are insufficient. Table V and Fig. 9 also reveals that in terms of fusion feature-based methods, compared to [47] and [55], paper [60] and the proposed framework have higher accuracy and AUC values, which can be attributed to the latter's utilization of complementary advantages of handcrafted features and deep features [60], while the former only fuses handcrafted features or deep features. The proposed feature fusion framework is slightly better than [60], which is due to the more effective use of the phase information of the SAR data through the monogenic signal and CVNLNet. Moreover, the accuracy of CVNLNe is 1.15% higher than that of RVNLNet, because the real-valued images as the input of RVNLNet only contain amplitude, while the complex-valued images as the input of CVNLNet contain both amplitude and phase and are effectively utilized.
In addition, in order to verify the excellent performance of the proposed KDCA fusion algorithm, the classical serial fusion algorithm and the advanced DCA fusion algorithm are used for comparison. The feature dimension, the classification accuracy, and the overall classification runtime (representing computational complexity) are listed in Table VI. Obviously, the serial algorithm based on direct connection is too simple, so that the fusion features have redundancy and fail to effectively utilize the complementary advantages of different features, and too high dimension brings too much computation to the classifier. The DCA algorithm gets the least runtime, but the accuracy is not improved compared with CVNLNet. This is because the direct linear transformation of DCA only retains features with the dimension of c-1 (c denotes the number of categories) and lose much effective features, and the gains outweigh the losses. On the contrary, compared with the serial algorithm, the proposed KDCA decreases the dimension of the fusion features from 2791 to 494 by projective transformation, thereby shortening the runtime by 22.6 s; in addition, compared with DCA, KDCA introduces a kernel function used for nonlinear mapping before the discriminant correlation analysis to avoid the transformed fusion feature dimension being too low, so as to retain the effective discriminant information to obtain higher accuracy. To sum up, KDCA can achieve both lower overall computational complexity and higher classification accuracy.
B. Experiments on the GOTCHA Dataset 1) GOTCHA Dataset: The GOTCHA dataset is a fully polarimetric (including HH, HV, VH ,and VV polarization modes) dataset collected by AFRL, which comprises eight complete circular passes (covering the azimuth of 360°) with different depression angles [75]. The scene image comprises numerous calibration targets and ground civilian vehicles. The scene area of interest marked with nine categories of targets is shown in Fig. 11. The optical images and name of nine categories of vehicle targets are presented in Fig. 12.
For imaging, the complete circular aperture (360°) is separated in subaperture which has the azimuth of 4°. Thus, in each of the 8 passes, we can obtain 90 scene images. Then, according to the locations of all targets provided by the GORCHA dataset, nine categories of vehicle target images which have 50 × 50 pixels are selected from the scene image. We select images from pass1, 3, 5, and 7 for training (360 samples per category), and images from pass2, 4, 6, and 8 for test (360 samples per category), as listed in Table VII.

2) Experimental Setup for Feature Extraction:
The same as the MSTAR dataset, for the GOTCHA dataset, the Mono-BOVW model is used for handcrafted feature extraction, and CVNLNet (with the CV nonlocal block inserted into the Res1 stage) is used for deep feature extraction. The difference is that, as a fully polarimetric dataset, each image in the GOTCHA dataset has four polarization channels, which can describe the information of the target more comprehensively.
Prior to the Mono-BOVW model, according to the reciprocity principle (HV ≈ VH), the three polarization channels (HH, HV, VV) of each complex-valued image are taken as amplitude values to form a gray image. The parameters in the monogenic signal model are set as follows: S = 3, σ = 0.48, λ min = 8, μ = 2.5. Then, we select 99% of these extracted monogenic feature vectors for K-means clustering (with the parameter k set as 3168). According to the BOVW model, a 3167-dimensional handcrafted feature vector is finally obtained.
For CVNLNet, the structure is similar to that in the MSTAR dataset. But since the size of the input image in GOTCHA dataset is 50 × 50 (the image is smaller) and has four polarization channels, the size of the convolution kernel in the first convolutional layer of the network is set as 5, and the stride is set as 1. Subsequently, the above network is trained (the hyperparameters are the same as those in MSTAR), and a 512-dimensional deep feature vector is extracted for subsequent fusion.
3) Recognition Based on the Feature Fusion Framework: Similar to the experiments in the MSTAR dataset, a 356dimensional fusion feature vector, which is extracted by the KDCA fusion algorithm, is fed into SVM classifier for classification, in accordance with the proposed feature fusion framework.
To verify the effectiveness and superiority of the proposed feature fusion in the GOTCHA dataset, in addition to Mono-BOVW and CVNLNet, some other state-of-the-art methods are also used for comparison. For the fully polarimetric SAR data different from single polarimetric SAR data, in order to better utilize the important phase relationship between multichannel data, the commonly used methods mainly include polarization decomposition algorithms and deep neural networks. For example, tensor local discriminant embedding (TLDE) based on multiple polarization decomposition [24], real-valued networks like A-ConvNet [31] and RVNLNet (corresponding to CVNLNet), complex-valued networks like CV-FCNN [43], CV-3D-CNN [41]. The classification accuracy (per class and overall) of these methods are listed in Table VIII, and the ROC curves and AUC values are illustrated in Fig. 13.
As can be seen from Table VIII, the proposed feature fusion framework achieves the highest overall accuracy of 99.63%, and the highest accuracy in almost every category. Fig. 13 also verifies the superior performance of the proposed framework. In particular, the classification accuracy of the proposed feature fusion framework is 1.61% and 0.19% higher than that of Mono-BOVW and CVNLNet, which verifies that it effectively utilizes the complementary features of handcrafted features and deep features to obtain stronger discriminative fusion features. Combining with Fig. 14, it can be seen that when there are enough training samples, deep features play a key role in the framework to achieve higher accuracy, and when there are fewer samples, handcrafted features come into play to avoid the rapid degradation of the performance of the framework. Therefore, the proposed feature fusion framework can always have relatively stable and excellent recognition performance.
In addition, Table VIII also shows that compared with RVNL-Net, CVNLNet can fully utilize the phase relationship between different polarization channels in fully polarimetric data, so it can achieve higher accuracy; and compared with other CV networks (CV-FCNN, CV-3D-CNN), CVNLNet performs better because the nonlocal block can make the target region get more attention in the whole image to enhance the ability of the network to extract features.

IV. CONCLUSION
In this article, a feature fusion framework is proposed based on monogenic signal and complex-valued nonlocal network for POLSAR target recognition, which effectively uses the complementary advantages of handcrafted features and deep features to make up for the lack of representation ability of a single feature. First, a Mono-BOVW model is employed to extract robust handcrafted features, and a CVNLNet network is constructed to extract deep features with strong representational ability. Subsequently, the two features are analyzed and transformed based on the proposed KDCA algorithm to form the stronger discriminative fusion features with lower dimension after redundancy removal. In both the single polarimetric MSTAR dataset and the fully polarimetric GOTCHA dataset, the proposed framework achieves a high classification accuracy and exhibits good adaptability to small sample datasets. This article reveals that the proposed feature framework is promising and takes on a critical significance in SAR-ATR.