SAR Target Classification Based on Multiscale Attention Super-Class Network

The convolutional neural network (CNN) is widely used in synthetic aperture radar (SAR) target recognition, but conventional CNN mainly adopts a single-scale convolutional kernel, resulting in losing part of the feature information of targets and does not pay enough attention to significant features. On the other hand, conventional CNN approaches only assign fine-class labels to SAR targets, ignoring the high-level semantics information of similar categories, which reduces the feature differences between categories and the generalization ability of the model. Therefore, this article proposes a multiscale attention super-class CNN (MSA-SCNN) for SAR target classification. First, MSA-SCNN combines multiscale feature fusion with the attention module to improve the integrity of SAR target feature representation. The attention module includes channel and spatial attention modules, which realize the weighted enhancement of different scale features. Additionally, MSA-SCNN introduces super-class labels to increase the feature difference between categories. The classification stage consists of a fine-class branch and a super-class branch, and the features trained on the super-class branch are fused to the fine-class branch to improve the network's fine classification ability. Experiments on the moving and stationary target acquisition and recognition dataset and the FUSAR-Ship dataset show that the proposed MSA-SCNN outperforms many current existing state-of-the-art methods.


I. INTRODUCTION
S YNTHETIC aperture radar (SAR) has the ability to obtain high-resolution images all-day and all-weather. Compared with other imaging methods such as optical and infrared, SAR can acquire target information covered by clouds and vegetation. Nowadays, SAR is widely used in numerous fields, e.g., military reconnaissance and geological exploration [1]. In most SAR applications, automatic target recognition (ATR) plays an important research and application value in the field of military reconnaissance, so it has received considerable attention [2]. Manuscript  Generally, a standard SAR ATR system usually consists of three stages: detection, discrimination, and classification recognition [3]. In the detection stage, the region of interest (ROI) is prescreened according to the local grayscale statistics in the SAR images [4]. The discrimination stage significantly removes the false alarm clutter and reserves the real targets [5]. Finally, the target features are extracted, and a classifier is designed to classify the SAR targets.
SAR target classification is one of the vital stages of SAR ATR processing. Generally, the classification methods are divided into two categories: template-based methods [6] and modelbased methods [7]. The template-based method needs to build a template database, and the extracted SAR target features will be best matched with the template database during the recognition stage. The classification accuracy is related to the manually designed template, which takes a lot of time to build the template database [8]. The model-based method adopts three-dimensional (3-D) modeling and electromagnetic calculations to simulate SAR images, and iteratively adjusts the model in the process of predicting SAR target chips. Based on these two mainstream methods, many SAR target recognition algorithms have been proposed in recent years, such as principal component analysis [9], linear discriminant analysis [10], support vector machine [11], adaptive boosting [12], conditionally Gaussian model (CGM) [13], and iterative graph thickening [14]. These methods usually need to extract specific features from SAR images and predesign complex target recognition algorithms, which brings tremendous challenges to practical applications.
With the rapid development of deep learning [15], convolutional neural networks (CNNs) have been applied to various computer vision tasks such as image classification [16], object detection [17], semantic segmentation [18], etc., and it has achieved superior performance. CNN directly extracts low-level and high-level features from raw images through convolutional and pooling layers, providing an effective solution for SAR target classification. Many novel works using deep neural networks have proven to be powerful tools for SAR target classification [19], [20], [21].
Most CNN methods only use a single-scale convolutional kernel, resulting in some feature representation loss of the SAR targets. Ai et al. [22] proposed a novel CNN model based on multikernel-size feature fusion (MKSFF-CNN), which uses convolutional kernels of different sizes to extract the multiscale deep features, and then, MKSFF-CNN concatenates the features extracted by the convolutional layers of different dimensions to This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ achieve the finest classification. Although MKSFF-CNN improves the SAR target feature representation completeness, two important problems should be solved for SAR target classification. First, Multiscale features have information redundancy, and the network needs to automatically focus on important features and suppress unnecessary features to increase the representation power of multiscale features. Second, since SAR images are sensitive to observation conditions, the network needs to introduce prior knowledge to increase the difference of multiscale features between different categories, thereby improving the generalization ability of the model.
In order to increase the feature difference between categories, Zhang et al. [23] designed a two-stream deep network and introduced SAR domain knowledge such as target azimuth and phase into the CNN to assist in classification. For SAR classification tasks, existing methods mainly introduce prior knowledge and fuse features at the network input, ignoring the category labels that actually guide the classification at the network output. These methods only have one kind of fine-class label, and the misclassifications between any two classes are treated equally. In contrast, when humans create categories, nonparallel semantic relations are established between each category [24], and categories with communal features belong to a super-class label, which can assist in fine classification. For example, there are many different types of tanks, they all belong to a super-class label-Tanks. When classifying a tank of an unknown type, it should tend to be classified under the tank super-class label rather than identified as the armored vehicle or another class label. Therefore, super-class labels can improve the generalization ability of the model to unknown classes.
To sum up, in order to solve the problem that the SAR target features extracted by most networks are incomplete and have information redundancy, lack of attention to important features, and small feature differences between categories, this article proposes a multiscale attention super-class CNN (MSA-SCNN) for SAR target classification. In the extraction stage, MSA-SCNN combines multiscale feature fusion with the attention module to improve the integrity of SAR target feature representation. The attention module includes channel and spatial attention modules, which focus on important features and suppress unnecessary features. In the classification stage, MSA-SCNN has a superclass branch and a fine-class branch, which are corresponding to the super-class and fine-class labels. The super-class branch focuses on the communal features of SAR categories so that the feature difference between super-classes increases, whereas the fine-class branch focuses on more refined category features. Finally, features extracted by the super-class branch are fused to the fine class branch to assist in fine classification. The main contributions of this article could be summarized as follows.
1) The multiscale features of SAR targets are analyzed and a new network structure-MSA-SCNN is proposed, which uses convolutional kernels of different sizes to extract feature information of different scales, and fuses these features at each layer. It greatly improves the integrity of the SAR target feature representation. 2) MSA-SCNN uses spatial and channel attention for multiscale feature weighted enhancement, so that the network focuses on target features, suppresses the background clutter, and avoids information redundancy. 3) MSA-SCNN introduces the prior knowledge of superclass labels into CNN and creates two classification branches. The high-level semantic features trained on the super-class branch are fused to the fine-class branch for final classification. The assistance of super-class labels increases the feature difference between the categories, and further improves the generalization ability of the model. The rest of this article is organized as follows. Section II introduces SAR multiscale features and super-class labels. Section III elucidates the details of the MSA-SCNN model structure and training strategies. Section IV is the experimental part, where comparative experiments and ablation studies are designed and performed. The classification performance is validated on the moving and stationary target acquisition and recognition (MSTAR) dataset and the FUSAR-Ship dataset with a detailed evaluation. Finally, Section V concludes the article.

II. BACKGROUND KNOWLEDGE OF MSA-SCNN
In this section, the multiscale features of SAR targets and super-class labels will be discussed in detail. Meanwhile, the division of super-class and fine-class labels will be given. MSA-SCNN uses two methods to improve the integrity of the SAR target feature representation and the generalization ability of the model.

A. Multiscale Features of SAR Targets
Generally speaking, the deeper the network is, the stronger the representation and nonlinear fitting ability it will have. VGG [16] uses stacked small-size convolution kernels to increase the network depth, whereas ResNet [25] introduces skip connections to alleviate the gradient vanishing problem of deep networks, and the network is further deepened. However, the existing SAR target datasets are generally small, and the application of deeper networks can easily lead to overfitting. Therefore, apart from deepening the networks, other methods should be considered to improve the ability of target feature extraction.
GoogLeNet [26] proposed by Szegedy won the championship in the ImageNet large-scale visual recognition challenge competition that year. This model uses a parallel network structure to obtain multichannel features through different convolution branches. The size of the convolutional kernel of each branch is different, which means that the input of the same layer has multiple receptive fields of different sizes, so it can extract multiscale features from images.
Inspired by this, convolutional kernels of different sizes are used to extract features from SAR images. Fig. 1 displays the results of the feature maps after activation function binary processing to facilitate observation and analysis. The usage of 3×3 kernels highlights the local features of the SAR images and divides the target feature maps into multiple small regions, and the holes in the feature maps make the global contour features inconspicuous. As the size of kernels increases, the global contour features of the target are gradually enhanced. But as the hole area of the target feature map decreases, the local detail features are lost by degrees. This phenomenon also reveals that the SAR targets have multiscale features. However, the number of multiscale feature channels without any processing is large, which leads to information redundancy. Therefore, this article further extracts important features and suppresses unnecessary features through attention mechanisms and super-class labels.

B. SAR Domain Super-Class Labels
SAR targets contain lots of prior knowledge in the SAR domain, such as azimuth angle, phase information, etc. [23]. Unlike optical images, SAR images are sensitive to the azimuth angle. Under the side-view imaging mode of SAR, some areas will lose echo signals due to the blocking of the target itself, and the attribute scattering center (ASC) that reflects the target structure also changes with the azimuth angle [26]. The SAR phase also contains additional information. However, the final focused image will have phase errors due to motion errors and terrain factors. The introduction of these two prior knowledge can improve the recognition rate [27], but it is easily affected by various external factors and becomes unstable.
The above-mentioned prior knowledge in the SAR domain will be affected by the signal-to-noise ratio of SAR images, and the extraction results of prior information may be contaminated. For the SAR image classification problem, the SAR target category labels are also prior knowledge. While recognizing unknown targets, our humans use prior knowledge to determine the high-level super-class labels of objects first and then identify them finely [24]. Inspired by human recognition of objects, super-class labels are higher level semantic divisions of classes with similar features. The conception of super-class was originally applied to the dataset with unbalanced class distribution [28], which helps minority classes benefit from abundant samples under the same super-class. It is interesting to notice that the SAR targets can also abstract the super-class labels, which are more stable than other prior information, because when the SAR observation parameters and the scene change, the prior information of azimuth angle and phase may change, but the super-class labels will not change. In this way, the super-class labels can be stably used for SAR target recognition ignoring the influence of external factors.
Taking the MSTAR dataset as an example, Fig. 2 shows the super-class labels divided by ten categories of targets, in which the black classes under each optical image represent original fine-class labels, and the red classes represent super-class labels. BMP2, BRDM2, BTR60, and BTR70 are all armored vehicles, so they belong to a super-class label, whereas T62 and T72 belong to different models of tanks, so they are divided into a super-class label. The rest of the vehicles have no common characteristics and belong to their labels. MSA-SCNN introduces super-class labels to guide the network to filter multiscale features so that the feature difference between categories increases and the generalization ability of the model is improved.

III. MSA-SCNN CLASSIFICATION METHOD
In this section, a novel SAR classification method MSA-SCNN is proposed. This approach combines multiscale feature fusion with the attention module to improve the integrity of SAR target feature representation. Meanwhile, it introduces superclass labels to increase the multiscale feature difference between categories. The basic structure of the MSA-SCNN model will be described. Then, the configuration of training implementation will be given.

A. Structure of MSA-SCNN
The basic structure of the proposed MSA-SCNN is shown in Fig. 3, which adopts multichannel parallel convolutional layers for feature extraction, and each convolutional layer uses convolutional kernels of different sizes, where n is the number of channels, and the convolutional kernel size is Sn×Sn. MSA-SCNN extracts multiscale features and fuses them after each convolutional layer to ensure the integrity of the SAR targets. In addition, MSA-SCNN uses attention modules after convolutional layers, which focus on important features, suppress unnecessary features, and avoid information redundancy.
In the classification stage, MSA-SCNN is divided into the super-class branch and the fine-class branch. The subsequent convolutional layer adopts 3 × 3 small-sized kernels to extract deeper high-level semantic features. Then, the features extracted by the super-class branch are fused into the fine-class branch, which increases the feature difference between categories and assists fine classification. Finally, the softmax classifier assigns a posterior probability to each target category. The outputs of the two branches correspond to fine-class labels and super-class labels. And the total loss of MSA-SCNN is the sum of the loss weights of the two branches. The details of those layers and training operations are described in the rest of this section.

B. Multiscale Convolution and Pooling
The convolutional layer is the core block in our network, and it can automatically extract the multiscale features from the input SAR images. The small-size and large-size kernels can extract local feature information and global contour feature information, respectively. So MSA-SCNN uses parallel multiscale convolution kernels for feature extraction and fusion.
Each convolutional layer has n convolution kernels of differ- j_n is the jth output feature map of the nth convolutional channel in this layer. Suppose that w (l) ij_n denote the convolutional kernel operating the ith input feature map to the jth output feature map in the nth convolutional channel, and b (l) j_n is the jth bias. The forward propagation process in the convolutional layer can be expressed as j_n denotes the convolutional result before nonlinear activation. The symbol * denotes the convolutional operation, and σ(·) is the nonlinear activation function. MSA-SCNN adopts the rectified linear unit (ReLU) [29] function as the nonlinear function, which avoids problems of gradient explosion and disappearance. In addition, the calculation of ReLU is easy and efficient.
After multiscale convolution, the network will fuse these multiscale features. To ensure that feature information of each scale is not lost, MSA-SCNN concatenates these multiscale features. Let f (l) n be the output feature map of the nth channel in the lth convolutional layer and F (l) is the fused feature of the lth convolutional layer, the feature fusion can be expressed as ( Since the 1 × 1 kernel is unhelpful in increasing the receptive field, the too-large kernel will repeatedly extract feature information when it slides on the SAR image with a small stride [22]. Therefore, in the proposed MSA-SCNN, the number of convolutional channels is four, and the sizes of kernels are S 1 = 3, S 2 = 5, S 3 = 7, S 4 = 9.The multiscale features extracted by them are complementary to the fusion process and can represent the SAR target features better.
The pooling layer is generally connected after the convolutional layer to reduce the dimension of the feature map, thereby reducing the parameters of the entire network. MSA-SCNN adopts the maximum pooling [30] and calculates the maximum response of the pooling window for output.

C. Multiscale Attention Module
The SAR features after multiscale convolutional have multiple channels, resulting in a lack of attention to significant features and information redundancy. Interestingly, attention not only tells where to focus but also improves the representation of interests. MSA-SCNN adopts the CBAM [31] to focus on important features and suppress unnecessary ones. Fig. 4 shows the structure of CBAM, which includes channel attention and spatial attention modules. Given a multiscale feature F ∈ R C×H×W as input, average-pooled features and max-pooled features are first generated in each channel. Both features are then forwarded to shared fully connected layers and the output features are merged using elementwise summation. Finally, the channel attention weight W C ∈ R C×1×1 is obtained through the activation function. The channel attention process can be summarized as where F is the channel attention feature, the symbol ⊗ denotes elementwise multiplication, and σ(·)denotes the sigmoid function. Different from the channel attention, the spatial attention focuses on the interspatial relationship of multiscale features, which is complementary to the channel attention. Averagepooled features and max-pooled features along the channel axis are first generated. Applying pooling operations along the channel axis is shown to be effective in highlighting informative regions [32]. Then, both features are concatenated and forwarded to a convolution layer to produce spatial attention weight W S ∈ R 1×H×W . The spatial attention process can be summarized as where F is the channel and spatial attention feature, the Conv.@1 × 1 represent a convolution operation with the filter size of 1 × 1.
After the introduction of the CBAM, the SAR multiscale features are further filtered on the channel. Meanwhile, in the space, the model more focuses on target features and suppresses the background clutter.

D. Super-Class and Fine-Class Branches
MSA-SCNN introduces the prior knowledge of super-class labels into CNN and creates two classification branches. The structure of the branches is shown in Fig. 5. The outputs of the two branches correspond to the fine-class and super-class labels, respectively. Generally, the fine-class labels are the original labels of the SAR target, and the super-class labels are the high-level classes reassigned for SAR targets. If several fine classes have the same attribute characteristics, they are assigned a super-class label. Let [C 1 , C 2 , . . . , C m ]be the fine-class labels of SAR targets, and [S 1 , S 2 , . . . , S n ]be the super-class labels, where m and n represent the number of fine-class and super-class labels, Then the number of super-class labels must be less than or equal to the number of fine-class labels (i.e., n ≤ m).
Due to the small feature size after max pooling, the subsequent convolutional layers use small size 3 × 3 kernels to extract deep features. The super-class branch extracts the common features of each super-class to increase the difference between classes, whereas the fine-class branch pays more attention to the fine features of different classes. Therefore, the CBAM attention module is also added after the convolutional layer to make the two branches pay different attention.
To increase the feature difference between categories and further improves the generalization ability of the model, the features extracted from the super-class branch are fused into the fine-class branch. Let f c is the feature extracted by the fine-class branch, and f s is the feature extracted by the super-class branch. The feature fusion strategy is also concatenating on the feature channel, ensuring that the SAR target feature information is not lost. F, the final feature of the fine-class branch after feature fusion, can be expressed as The fused feature F will be sent to the fully connected layer and the softmax classifier for final classification.

E. Fully Connected Layer, Dropout, and Softmax
The fully connected layer is essentially equivalent to the spatial transformation of the feature. The feature information will be converted into feature vectors and input into the fully connected layer. Let F v be the feature vector of the fused feature, the forward propagation process of the fully connected layer is expressed as where W and B are the weights and bias of the fully connected layer, respectively, and Z (l) is the output of the lth fully connected layer. Since the fine-class branch fuses the features of the super-class branch, and the length of the feature vector after flattening is longer, the number of neural units in the fully connected layer is also more than that of the super-class branch.
When there are fewer training samples, the deep network will have overfitting problems, resulting in the bad performance of the model on the test set. This article uses the dropout method [33] to suppress this problem effectively. It sets the output of each hidden unit to zero randomly. Since the parameters of the fully connected layer are relatively large, our proposed MSA-SCNN applies the dropout scheme due to complicated parameters in the fully connected layer and sets the dropout probability to 0.5.
The softmax classifier is often adopted for multitarget classification by connecting to the back of the fully connected layer to provide the posterior probability of each category. Let the output of the last fully connected layer be Z = [z 1 , z 2 , . . . , z C ], then the corresponding posterior probability for each class can be formulated as where y i denotes the ith target category and C indicates the total number of categories. The output of the softmax classifier is a C-dimensional vector, representing the probability of each category. For the fine-class and super-class branches, the softmax output vector dimensions correspond to the number of fine-class labels and super-class labels, respectively. The two branches compute the posterior probabilities for their respective classes.

F. Cost Function and Backpropagation
The cost function of multiclassification is the cross-entropy loss, which is defined as where w and b are trainable parameters in the network. In the MSA-SCNN model, there are two classification branches corresponding to two cost functions. Let L fc and L sc be the fine-class and super-class cost functions, and L is the total cost function. They are formulated as where m and n represent the number of fine-class and super-class labels, and λ is the weight coefficient of the super-class cost function, which is in the range of (0, 1). The total loss is the weighted sum of the super-class loss and the fine-class loss.
If the value of λ is too large, the feature extraction will focus more on the super-class labels during training. The λ value of MSA-SCNN in this article is 0.5, considering that the superclass loss and the fine-class loss are equally crucial. Different λ have various impacts on the accuracy, which will illustrate in the experimental section. w and b can be optimized by continuously minimizing the cost function during the training process, which is favorable to the classification accuracy. Although the proposed MSA-SCNN has two branches, the way of training parameters is similar to one-branch approaches. And backpropagation [34] can still be used to compute gradients and update network parameters.

A. Datasets and Random Cropping
The SAR image datasets used in this article are the MSTAR dataset [35] and the FUSAR-Ship dataset [36]. Both datasets are detailed in the following.
1) MSTAR Dataset: The dataset is acquired by the Sandia National Laboratory, operating at X-band with a high resolution of 0.3 m and HH polarization. The MSTAR dataset includes ten different classes of military vehicles (rocket launcher: 2S1; armored carrier: BMP2, BRDM2, BTR60, and BTR70; bulldozer: D7; tank: T62 and T72; truck: ZIL131; and air defense unit: ZSU23/4), which are captured under different conditions, such as aspect angle, depression angle, and serial number. The optical images of the targets and their corresponding SAR images are displayed in Fig. 6. To comprehensively evaluate the classification performance of the proposed MSA-SCNN, the standard  I  NUMBER OF TRAINING AND TEST IMAGES FOR THE SOC  EXPERIMENTAL SETUP operating condition (SOC) and extended operating condition (EOC) [35] were used to test the algorithm.
The SOC refers to that the serial numbers and target configurations in the testing set are the same as those in the training set, but with different aspects and depression angles. Table I lists a summary of the SOC experimental setup, showing that SAR images with the depression angle of 17°and 15°belong to the training set and the testing set, respectively. The proposed MSA-SCNN adopts two kinds of labels (fine-class labels and super-class labels). According to the common attributes of the military vehicles, ten fine-class labels are divided into six superclass labels, named rocket launcher (2S1), truck (ZIL131), tank (T62 and T72), armored carrier (BTR60, BTR70, BRDM2, and BMP2), air defense units (ZSU23/4), and bulldozer (D7).
The EOC is closer to real battlefield situations and this article selects the configuration-variant (EOC-C) and version-variant (EOC-V) datasets. The EOC-C refers to the addition or removal of discrete components on the target, such as removing the fuel barrel on the T72. In addition, the EOC-V refers to target version variation, which means that after some armored vehicles are finalized, they will be upgraded, such as adding the state-ofthe-art reactive armor or replacing the main gun with a larger caliber. The EOC dataset contains two BMP2 variants (9566 and c21) and ten T72 variants (812, S7, A04, A05, A07, A10, A32, A62, A63, and A64). Optical images and the corresponding SAR images of the eight T72 targets are shown in Fig. 7. It can be seen that the T72 variants are almost indistinguishable, which brings challenges to SAR target recognition.
A summary of EOC-C and EOC-V for training and testing datasets is listed in Tables II and III. There are four target types (BMP2, BRDM2, BTR70, and T72) for EOC training sets with a depression angle of 17°. The EOC-C test set has two target types (BMP2 and T72) with seven different configuration variations, and EOC-V has one target type (T72) with five version variations. The EOC dataset is of great significance for evaluating the generalization ability of the MSA-SCNN model.

2) FUSAR-Ship Dataset:
The dataset is constructed by 126 original Gaofen-3 images, covering a large variety of sea, land, coast, river, and island scenarios. It includes different classes of   ship chips as well as samples of strong scatterer, bridge, coastal land, islands, sea, and land clutter. In this article, ten categories of samples are used to verify the effectiveness of the proposed MSA-SCNN model, and the SAR images of ten categories are shown in Fig. 8. In addition, a summary of FUSAR-Ship for training and testing datasets is listed in Table IV. Like the MSTAR dataset, the fine-class labels are divided into several super-class labels, named ships (cargo, fishing, tanker, and other  ships), lands (bridges, coastal lands, and land patches), sea (sea patches and sea clutter waves), and strong scatterer (strong false alarms). Both MSTAR and FUSAR-Ship datasets have different sizes of images for categories. To ensure that the SAR image size is the same as the network input size (88×88), this article adopts the random cropping method to process the data uniformly. Ten image slices are cropped for each SAR image as the training set, one of which is the center crop as the raw SAR image dataset, and the rest nine are randomly cropped as an expanded dataset. The randomly cropped SAR target may be incomplete, which is helpful to improve the generalization ability of the network, while avoiding the overfitting problem. Fig. 9 shows the result of randomly cropping one of the SAR images.

B. Results and Analysis Under SOC
In this experimental setup, the performance of the proposed architecture will be evaluated under the SOC dataset. The superclass and fine-class labels are set according to Table I. And all the training SAR images are randomly cropped to expand the datasets. Each convolutional layer has four channels with dimensions 3 × 3, 5 × 5, 7 × 7, and 9 × 9. Meanwhile, considering that the super-class and fine-class losses are equally significant, the super-class loss weight λ is set to 0.5. Fig. 10 shows the fine-class classification performance of the proposed MSA-SCNN in the form of the confusion matrix on the SOC experiment. The confusion matrix is the visualization tool used to evaluate the target classification performance, whose rows correspond to the true category labels of the target, and columns represent the predicted category labels of the target. The total accuracy of ten fine-class classifications is calculated to reach 98.31%. From Fig. 10, the diagonal elements are much larger than the confusion matrix in other positions, which means MSA-SCNN has high classification accuracy for each target type in the SOC experiment. For the ZIL131 and ZSU23/4 categories, the accuracy even achieves 100%.
MSA-SCNN also has classifier output in the super-class branch. Fig. 11 shows the super-class classification performance in the form of a confusion matrix on the SOC experiment. It can be seen that the classification accuracy under six super-class labels reaches 98.02%, which indicates that the features learned by the MSA-SCNN model can distinguish super-classes.
In order to comprehensively validate the superiority of the proposed MSA-SCNN, a series of commonly used SAR target classification methods are compared with the proposed MSA-SCNN. These methods include MSRIHL-CNN [21], VDCNN [37], VGG-Net [16], Res-Net [25], ViT-B/16 [38], Swin-T [39], CA-MCNN [40], and MKSFF-CNN [22]. MSRIHL-CNN optimally fuses the deep features extracted by CNN and the local edge features extracted by Haar-like template. VDCNN is a multiview deep neural network that fuses SAR image features from multiple views for the same target layer by layer. VGG-Net and Res-Net are two commonly used deep neural network models. ViT and Swin-T are current existing state-of-the-art methods for image classification. CA-MCNN adopts the ASC model to extract SAR target component information and then fuses it with the deep features of CNN to improve the recognition accuracy. MKSFF-CNN uses convolutional kernels of different sizes to extract the multikernel-size deep features of the SAR target, and then, these features are fused in an optimal way to acquire the lowest loss.
Table V displays the classification accuracy of these methods under the SOC experiment. MSRIHL-CN, VDCNN, and CA-MCNN introduce the prior knowledge in the SAR domain, which improves the completeness of SAR feature representation. However, the way to acquire prior knowledge is very complicated, and the prior information is easily changed by external factors. VGG and Res-Net increase the network depth to extract the deep features of the SAR target and improve the classification accuracy. However, they use fixed-size convolutional kernels, which lose part of the scale features of SAR images. MKSFF-CNN uses convolutional kernels of different sizes to extract the multiscale features, but does not pay enough attention to significant features and has information redundancy. ViT and Swin-T replace the backbone network from CNN to transformer structure, and the accuracy is improved.
The MSA-SCNN proposed in this article focuses on important features and suppresses unnecessary features due to combining multiscale feature fusion with the attention module. Meanwhile, the prior knowledge of super-class labels is introduced to increase the multiscale feature difference between categories. The final accuracy rate reaches 98.31%, which is higher than other methods. In particular, the 2S1, T62, ZIL131, and ZSU23/4 categories have higher accuracy, and it is found that these categories belong to different super-class labels, indicating that the model has indeed learned the different features of super-classes. When fusing features into the fine-class branch, the differences between the categories are increased, and the recognition accuracy is also improved.
To more intuitively explain the effectiveness of MSA-SCNN, the raw SAR test images and the output vectors of the fully connected layers of MSA-SCNN are mapped to a 2-D Euclidean space by the t-distributed stochastic neighbor embedding (t-SNE) [41] algorithm. The t-SNE is a powerful dimensionality reduction algorithm that can help us study the distribution characteristics of high-dimensional data in low-dimensional space. Fig. 12 illustrates the input SAR images and the fine-class classification output of Res-Net, MKSFF-CNN, and MSA-SCNN. It can be observed that the visualization results of the raw samples are mixed and difficult to classify. The outputs of Res-Net and MKSFF-CNN have been significantly improved, but the feature distribution is uneven, which is easy to misidentify. However, after being processed by MSA-SCNN, the samples with the same class label became closer, and the feature distribution distances between categories are farther, so they are easier to be recognized. Fig. 13 illustrates the super-class classification  output of MSA-SCNN. The introduction of super-class labels also makes the super-class samples more clearly separated from each other.

C. Results and Analysis Under EOC
In the EOC experiment, all training SAR images of EOC are also randomly cropped to expand the datasets. Each convolutional layer has four channels with dimensions 3 × 3, 5 × 5, 7 × 7, and 9 × 9. Meanwhile, considering that the super-class and fine-class losses are equally significant, the super-class loss weight λ is set to 0.5.
Figs. 14 and 15 show the fine-class and super-class classification performance of the proposed MSA-SCNN in the form of the confusion matrix on the EOC-C and EOC-V experiments. It can be seen from the figures that both the fine-class accuracy and the super-class accuracy are around 97%. Especially for the T72 variants A04, A32, and A62, the accuracy has reached 100%.   Despite the targets having lots of variants, after introducing the super-class labels and attention module, the common features of the same super-class can be extracted so that the target can be well recognized. These results also substantiate that the proposed network can adapt to the target classification of different types and has a good generalization ability.
To sufficiently verify the superiority of the proposed MSA-SCNN under the EOC experiment, this section compares the MSA-SCNN with other methods under the EOC experiment. Tables VI and VII display the EOC-C and EOC-V classification accuracy of different methods. It can be seen that the CGM has lower accuracy than the deep learning methods and cannot effectively extract the target features. Although Res-Net increases the network depth, it cannot fully represent SAR target features only by fixed-size convolutional kernels. The performance of ViT and Swin-T on EOC is not as good as that on SOC, and the accuracy is similar to the CNN methods. VDCNN fuses multiview SAR target feature information, but it cannot effectively extract common features from variant targets, resulting in no significant improvement in classification accuracy. The MSA-SCNN combines multiscale feature fusion with the attention module to get a more complete SAR target feature representation and introduces super-class labels for different category types. Therefore, the network learns the common features of the super-class and assists in the final fine classification. And the MSA-SCNN can adapt to different types and configuration variants of the target, with 98.46% accuracy on EOC-C and 99.63% accuracy on EOC-V. Finally, these results also indicate that the proposed MSA-SCNN outperforms other methods.

D. Results and Analysis Under FUSAR-Ship
The FUSAR-Ship dataset has a variety of complex categories, including ships, lands, sea clutter waves, strong scatterers, etc., which brings huge challenges to SAR image classification. Meanwhile, it can also verify the effectiveness of the proposed method MSA-SCNN. Figs. 16 and 17 show the fine-class and super-class classification performance of the proposed MSA-SCNN in the form of the confusion matrix. It can be seen from the figures that the fine-class accuracy and the super-class accuracy are 94.05% and 98.44%, respectively.
Similar to the previous experiments, this section compares MSA-SCNN with other methods, and the accuracy of each category is recorded in Table VIII. In terms of total accuracy, the proposed MSA-SCNN is more than 10% higher than other CNN methods. ViT and Swin-T, the two optimal image classification methods, also have higher accuracy than the conventional CNN methods. In terms of each category, the cargo, fishing, and tanker categories have lower accuracy than the others in all methods, whereas MSA-SCNN introduces super-class labels,   which increases the feature difference between categories and makes them easier to distinguish.
All the experiments carried out have manifested that the proposed MSA-SCNN has a good recognition capability in both the MSTAR and FUSAR-Ship datasets, and clearly verify the superiority of the proposed framework.

E. Analysis of Super-Class Loss Weight λ
This section gives the classification performance of the proposed MSA-SCNN model under different super-class loss weights λ. The experiments are carried out under the SOC dataset, with ten fine-class and six super-class labels. All conditions follow the settings in Table I and only change the different super-class loss weights λ. Fig. 18 clearly shows the change in fine-class and super-class accuracies with a line graph.
It can be seen from the figure that with the increase of λ, the super-class and fine-class accuracies both increase first and then decrease. When λ is 0.5, the super-class and fine-class losses are considered equally important, and the fine-class accuracy reaches the maximum value of 98.31%. However, the super-class accuracy is not the maximum at this time. When λ is 0.6, the super-class accuracy reaches the maximum value of 98.38%. The model trained with a larger λ pays more attention to the super-class features. When the λ is too large, the fine-class branch has little effect on the network, and the learned features are not enough to distinguish fine classes. In this way, the total loss of the network is too large, and the accuracy of super-class and fine-class both decreases.
When λ approaches 0, the network hardly pays attention to the features extracted by the super-class branch, which reduces the final feature difference of the target. So the super-class and fine-class accuracy are both low. These experiments also certify the importance of introducing super-class labels. In practical applications, the value of λ can be flexibly changed as needed.

F. Attention Visualization
To analyze the effect of the attention module CBAM, we apply the Grad-CAM [42] to the super-class and fine-class branches using SAR images from the SOC dataset. Grad-CAM is a recently proposed visualization method that calculates the importance of the spatial locations in convolutional layers. By observing the regions that MSA-SCNN has considered important for predicting a class, we attempt to look at how this network is making good use of multiscale features.
The best visualizations are often obtained after the deepest convolutional layer in the network, and localizations get progressively worse at shallower layers [42]. Therefore, this article selects the last CBAM layer features of the super-class and fine-class branches in MSA-SCNN to generate visualization results. Fig. 19 illustrates the CBAM visualization results of different branches of MSA-SCNN, and Grad-CAM results clearly show areas of interest. It can be seen that different branches pay different attention to SAR targets. The super-class branch focuses on the global contour features of the SAR target, whereas the fine-class branch pays more attention to the local detail features. Therefore, the two branches make full use of the multiscale features of the SAR target. In addition, the introduction of the attention module makes the model focus on the target features, and the background clutter features of the SAR image are suppressed. Finally, MSA-SCNN fuses the features of the two branches to further increase the multiscale feature difference between categories and improve the generalization ability of the model, which is well proved by the EOC and the FUSAR-Ship experimental results.

G. Classification Accuracy Evaluation Under Small-Size Training Datasets
In order to show that the proposed MSA-SCNN still has good performance even with a small sample size, a series of comparative experiments are carried out under the incomplete SOC datasets. A certain proportion of samples are randomly   Fig. 20 that no matter how the proportion of samples changes, the classification accuracy of MSA-SCNN is always higher than that of other methods. Even if the proportion of the sample is 0.2 in each category, the classification accuracy of MSA-SCNN can reach 87.21%. This can be explained by the introduction of the attention module and super-class labels, which makes MSA-SCNN easier to focus on important features and increase the feature differences between categories with small samples. Meanwhile, MSA-SCNN avoids the problem of overfitting and improves the generalization ability of the model.

H. Model Size and Computing Efficiency Evaluation
The computation time is an important indicator of the efficiency of the classification method. All experiments were run under the same computing station, which is composed of an Intel Core i7-7700K CPU with 4.20 GHz frequency, a 32.0 GB memory, and an Nvidia GeForce RTX 3090 GPU with 24.0 GB memory. The number of parameters, model size, and the computation time per batch including 64 SAR images for all models are recorded in Table IX. In terms of the model size and parameters, MSA-SCNN is much smaller than that of other models. This is because the designed network structure is simple and the number of network layers is small. Compared to MSA-SCNN, VGG-Net, Res-Net, and MKSFF-CNN have higher structural complexity, so the model has more parameters. The ViT model using the transformer structure has the largest number of parameters. Swin-T uses the tiny version, so its model parameters are less than ViT. MKSFF-CNN fuses the multiscale features of each layer into the final fully connected layer, resulting in a sharp increase in the parameters of the fully connected layer. Compared to MKSFF-CNN, the proposed MSA-SCNN introduces an attention module to avoid the redundancy of multiscale feature information and greatly reduce the parameters of the fully connected layer. In addition, in terms of computation time, MSA-SCNN has a shorter inference time than VGG-Net and Res-Net with deeper networks. MSKFF-CNN has a parallel multiscale feature extraction network, so the computation time is equivalent to MSA-SCNN. In general, the proposed MSA-SCNN model has a simpler structure and shorter computation time than other methods.

I. Ablation Experiment
The ablation experiments are designed under the SOC datasets to illustrate the superiority of MSA-SCNN objectively and comprehensively. The MSA-SCNN is divided into three parts: multiscale feature fusion, attention module CBAM, and super-class branch. Table X records the impact of each part on the SAR target classification performance.
According to Table X, all the multiscale feature fusion, attention module, and super-class branch methods are beneficial for enhancing classification accuracy. First, only adding the superclass branch improves the classification accuracy by about 2%, reaching 96.48%, demonstrating the effectiveness of super-class labels in the MSA-SCNN. And only adding the attention module also improves the accuracy. Then, after applying convolutional kernels with different sizes, the feature representation of SAR targets is more complete and the classification accuracy is better. When two convolutional kernels of different sizes are combined and used for feature extraction, the classification accuracy obtained increases by about 3% to reach 97%. When three convolutional kernels of different sizes are combined, the accuracy is further improved. Finally, when convolutional kernels with sizes of 3 × 3, 5 × 5, 7 × 7, and 9 × 9 are combined, such as the proposed MSA-SCNN, the classification accuracy obtained increases by about 4% to reach 98.31%. In addition, the ablation experiment also proves that it is reasonable for the proposed MSA-CNN to choose the convolutional kernels with sizes of 3 × 3, 5 × 5, 7 × 7, and 9 × 9 for feature extraction.

J. Combine With Detection Networks
The proposed MSA-SCNN model can be combined with the detector to achieve the task of SAR target detection. Object detection methods are usually divided into two-stage detectors and one-stage detectors. In two-stage detectors, e.g., Faster-RCNN [43] and FPN [44], the ROIs are generated by the region proposal module in the first stage. Then, the features of these proposals are processed by two branches of bounding box regression and classification. Since the bounding box regression and classification of the two-stage detector are separated, the proposed MSA-SCNN can replace the original classification branch. Fig. 21 shows the combination of MSA-SCNN and the two-stage detector.
In one-stage detectors, e.g., single shot detector (SSD) [45], you only look once (YOLO) [46], the network directly predicts locations and class labels of the potential object at several feature maps without ROI proposals. Therefore, if MSA-SCNN is to be combined with the single-stage detector, a separate branch needs to be added, and this work will be studied in the future.

V. CONCLUSION
This article proposes a novel network called MSA-SCNN for SAR target classification. First, this method combines multiscale feature fusion with the attention module, so that the network focuses on target features, suppresses the background clutter, and avoids information redundancy. Second, MSA-SCNN introduces super-class labels to extract the common features of the super-classes, which are fused into the fine-class branch to increase the feature difference between the categories. Finally, experiments on the MSTAR and FUSAR-Ship datasets show that MSA-SCNN can achieve better classification performance than the traditional CNN methods. Especially in EOC and FUSAR-Ship experiments, the generalization ability of MSA-SCNN is stronger. Future work will include research on other prior knowledge of SAR images and MSA-SCNN combined with the one-stage detector.