MdpCaps-Csl for SAR Image Target Recognition With Limited Labeled Training Data

Although convolutional neural networks (CNN) have shown excellent performance in many image recognition tasks, it commonly requires a lot of labeled data, and the recognition effect is frequently unsatisfied due to the limited labeled training data. In recent years, capsule network (CapsNet) has been shown to achieve a high recognition accuracy with a small group of training samples. In this study, a class separable loss based on cosine similarity is suggested to enhance the distinguishability of the extracted network. It is added as a regularization term to the original loss function to train the network, narrowing the intra-class difference and increasing the inter-class difference in each iteration. Meanwhile, a multi-dimensional parallel capsule module is established to obtain robust features and spatial relationships from the original images. Feature maps from convolution of different levels are extracted as the input of this module. Structural features derived from low-level convolution and semantic features derived from high-level convolution are used for low-dimensional capsule coding and high-dimensional capsule coding, respectively. In our experiment, the general moving and stationary target acquisition and recognition (MSTAR) database is used. We find that the accuracy of the multi-dimensional parallel capsule network with class separable loss (MdpCaps-Csl) is 99.79% using all training samples, which is higher than most current recognition methods. More importantly, the accuracy is up to 97.73% even if only 10% training samples are applied, indicating MdpCaps-Csl can make excellent performance upon limited training samples.


I. INTRODUCTION
As an active microwave remote sensing imaging system, synthetic aperture radar (SAR) can penetrate clouds and vegetations to identify covered information with high resolution and is little influenced by weather conditions [1]. SAR has been widely used in battlefield reconnaissance [2], environmental monitoring [3], geological survey [4], disaster assessment [5] and other fields. However, compared with optical images, there are too much speckle noises in SAR images, so it is difficult for human eyes to accurately and effectively interpret The associate editor coordinating the review of this manuscript and approving it for publication was Abdullah Iliyasu . SAR image targets [6], [7]. Therefore, it is of great significance to achieve SAR image automatic target recognition (SAR-ATR) with more effective techniques.
The previous work on SAR image target recognition is mainly based on template matching [8], [9]. They determine the type of the tested targets by comparing them with a series of templates generated by training samples. However, increasing recognition types and small deformation will often produce obvious scattering, significantly reducing the recognition accuracy. Hence some investigators attempt to put forward the method based on pattern recognition [10], which actually transforms the original data into the appropriate feature vector in a certain way and then uses the extracted VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ feature training classifier to identify the targets. The feature dimension of SAR images can be effectively reduced by using feature extraction, and its quality directly affects the classification of subsequent classifiers. The generally used feature extraction methods include principal component analysis (PCA) [11], [12], linear decision analysis (LDA) [13], independent component analysis (ICA) [14], fast fourier transform (FFT) [15] and ratio detector (RD) [16], etc. It is difficult to satisfactorily classify SAR image targets by using general classifiers directly. Therefore, the classifier design is generally based on the extracted features. Support vector machine (SVM) [17], [18], adaptive boosting (AdaBoost) [19], knearest neighbor (KNN) [20] and sparse representation-based classifier (SRC) [21], [22], etc., are widely used classifiers. Although these methods are effective in SAR image target recognition, they typically require extracting features manually, which easily affected by human subjectivity. At the same time, it is difficult to ensure the effectiveness of the algorithm since the classifier has no feedback. As a representation learning method that can facilitate feature classification by using automatically learning with large amounts of data, deep learning has been successfully applied in many fields, e.g., image classification [23], speech recognition [24], natural language processing [25]. Different from traditional methods, deep learning can automatically extract more powerful data as well as more abstract and distinctive features through deep structures [26]. Recently, deep learning, especially methods based on convolutional neural networks (CNN) [27], has also made great achievement in SAR image target recognition. Chen et al. [28] used an unsupervised sparse autoencoder to initialize the convolution kernel and fasten the CNN's feature learning process, achieving an accuracy of 84.7% in 10 categories of moving and stationary target acquisition and recognition (MSTAR) target classification. Chen et al. [29] proposed a new allconvolutional networks (A-ConvNet) consisting of merely sparsely connected layers without fully-connected layers, the accuracy of which is 99.13%. Jiang et al. [30] designed a SAR image target recognition method based on hierarchical fusion of CNN and ASC (attributed scattering center) matching, which can not only inherit the CNN's excellent performance in recognition but also maintain the model robustness through ASC. Wang et al. [31] proposed an updated squeeze and excitation network (ESENet) to reduce the impact of feature maps with little information being automatically obtained by CNN on the SAR-ATR performance. The enhanced-SE module can suppress these feature maps through allocating and computing different weights to the corresponding maps. Shao et al. [32] designed a lightweight CNN model for SAR image target recognition, which greatly reduced the iteration time and effectively alleviated the negative impact of data imbalance on recognition performance, receiving an accuracy of 99.54%.
Although the classification accuracy has been greatly improved, the most noticeable CNN-based recognition method often needs multiple convolution kernels to perform the same convolution operation, requiring a large amount of labeled training data. The data that can be used for SAR image target recognition is limited compared with optical images, since accurately acquiring SAR image data is high price as well as time-consuming. Insufficient data volume brings difficulty to effectively train the network, thereby further limits the devolopment of SAR-ATR.
The data enhancement method has been proposed to solve this problem. It changes parts of the data structures and their combination methods in the existing data set. The method also creates an ''expanded data set'' and adds it to the original training set, increasing the number of training samples [33].
However, it is a time-consuming manual task to build and optimize a more complex data augmentation combination in practice to achieve higher recognition performance. At the same time, some studies have found that some linear composite images could reduce recognition accuracy, rather than increase it [34]. Capsule network (CapsNet) [35] was proposed by Sabour et al. in 2017, and it is shown that CapsNet has better recognition performance than CNN when using a small number of training samples. The main difference between CapsNet and CNN is that CNN constantly adds layers for creating deep networks, while CapsNet embeds neural layers in another layer. A capsule is a group of neurons that introduces more structures into the network and generates a vector to represent an object in the images [36], [37]. The most important thing is that CapsNet integrates pose information and spatial attributes, allowing it can learn well using a small group of data. This method is closer to the human brain's thinking mode, and it can better show the hierarchy of the internal knowledge representation in neural networks. As shown in Fig. 1, in terms of the SAR image with several objects, the length and orientation of the activity vector, the output of each capsule, represent the possibility of the object and its instantiation parameters. In this example, the blue capsule attempts to find out the outline of the vehicle, and the red capsule attempts to seek the shadow of the vehicle. Importantly, different from these visual ones, the image contains another object being implicitly defined, including typical pose, accurate position, lighting conditions, deformation and other information, etc.
Although CapsNet was proposed not long ago, it has been successfully applied in SAR images. Shah et al. [38] introduced a CapsNet structure composed of a convolution layer, two capsule layers and a decoder network for SAR image target recognition. Its accuracy reached 98.14% in the tests of 10 kinds of MSTAR database.
Schwegmann et al. [39] applied CapsNet in the SAR ship detection task, which stimulated its ability to detect smaller adjacent ships. Comer et al. [40] proposed a principal Cap-sNet (PCN) architecture for SAR image classification in the context of self-service learning (S 3 L). This architecture used information invariant clustering (IIC) and auto encoding (AE) to learn from the unlabeled data. Ma et al. [41] proposed an improved detection method for SAR images based on image mapping and CapsNet. In this work, the two heterogeneous images are firstly transformed and compared in the feature space. Then the classified images are sampled. Finally, the classification results can be obtained by inputting the sample results into CapsNet.
Therefore, CapsNet is deeply studied in the present study based on the previous studies mentioned above. The original architecture is modified to work well in the SAR image target recognition. The main contributions of our work can be summarized in the following three aspects: 1) In order to enhance the CapsNet's extraction capability for SAR image data, a class separable loss based on cosine similarity is added to the improved loss function. It is used as a regularization term, which can reduce intra-class differences and increase the inter-class differences during feature extraction. 2) A multi-dimensional parallel capsule module is proposed to learn the spatial features of the images at different dimensions and enhance the CapsNet's robustness by limited training data. The feature maps obtained by convolution at different levels are taken as inputs and capsule-encoded in this module to improve the recognition performance of the network.
3) The proposed multi-dimensional parallel capsule network with class separable loss (MdpCaps-Csl) is used to perform a large number of algorithm experiments on MSTAR database. The results show that MdpCaps-Csl works better than most existing methods. It demonstrates good recognition performance, whether using all training samples, a part of training samples, or even an extremely small number of training samples.
The proposed method is evaluated using MSTAR database and is validated using the experiments. The remainder of this paper is organized as follows. The basic structure of CapsNet is introduced in Section II. The proposed class separable loss, multi-dimensional parallel capsule module, and MdpCaps-Csl for SAR image target recognition are described in Section III. The experimental results and discussion for all training samples, partial training samples, and fewer training samples are presented in Section IV. Finally, the conclusions are drawn in Section V.

II. CAPSULE NETWORK
CapsNet, a new capsule structure-based neural network, is robust to affine transformation. In CapsNet, capsule represents various features of a specific entity in the images, e.g., position, size, orientation, hue, texture, etc., and exists as a single logical unit. The data learned and predicted by the capsule itself can be passed to the higher-level capsule through the dynamic routing mechanism. The higher-level capsule can kept active when the prediction is consistent.
CapsNet has three advantages compared with the widely used CNN: • Output of CapsNet is vector with directions, while output of CNN is scalar. Not only can CapsNet use statistical information to detect features, it can also understand them well. It can detect the same object in different directions, thereby learning the basic thought of the object.
• CNN requires the superposition of multiple convolution kernels to perform the same convolution operation, requiring a large number of training samples. In terms of CapsNet, the model can learn characteristic variables in the capsule to maximize the retention of valuable information, which can use fewer training samples to infer the possible variables and get the same generalization as CNN.
• The pooling operation of CNN can loss many important characteristic information. As a result, the output is not sensitive to small variations in the input. Different from that, capsules of CapsNet can carry different attributes and each capsule carries a large amount of target information, where the detailed posture information will be saved in the network. Therefore, CapsNet has better performance in extracting target feature information compared with CNN. Fig. 2 shows the process transferring information from the low-level to the high-level through dynamic routing mechanism in CapsNet. Similar to the general neuron, weighted summation and non-linear activation are applied during the calculation of CapsNet. However, an additional matrix transformation operation is needed to transform the model from low-level features to high-level features in CapsNet, considering the spatial and hierarchical relationships between objects. The working principle of CapsNet is described in details as following.
Firstly, the output u i of the low-level feature capsule i is transformed into a predictionû j|i of the high-level feature capsule j using the spatial transformation matrix W ij . VOLUME 8, 2020 The conversion equation is: Next, performing weighted summation with the prediction capsuleû j|i to obtain the input vector s j of the high-level feature capsule j: where: c ij is the coupling coefficient between the two capsules. Finally, following CNN, an activation function is required to non-linearly activate its output during the capsule computation, while the difference is that the capsule is a vector rather than a scalar. The activation function used in general neural networks is not suitable for capsule activation. The squash function can keep the activated output vector v j varying between 0-1 and ensure that v j and the input vector s j have the same direction. The equation is as following: The first term of (3) is normalized according to the different length of s j , it is about 1 for long s j and is about 0 for short s j . s j is united by the second term of (3), that is, keeping the s j direction unchanged and its length to 1. The length of the output vector v j will also be taken as the probability of occurrence of a specific object.
Unlike the weighted summation of general neurons, the weight in the weighted summation of capsule is determined by the dynamic routing mechanism, rather than the back propagation algorithm, which can change the coefficient c ij to update the connection weights between the low-level feature capsules and the high-level feature capsules. The algorithm flow is as following: Step 1: The connection probability b ij between the low-level feature capsule i and the high-level feature capsule j is initialized to 0; Step 2: The coupling coefficient c ij varies with capsule connection, indicating that the low-level feature capsules contribute differently to the high-level feature capsules. The coupling coefficient c ij can be calculated by the Softmax function: Step 3: All the predicted capsules are weighted and summed using (4) and the output capsule v j is obtained by activation of the squash function; Step 4: A dot multiplication operation is performed on the converted input capsuleû j|i and output capsule v j in each capsule in the lower layer, updating the connection probability b ij . The equation is as following: Step 5: The loop is ended when reaching a predetermined route iterative number, otherwise skipping to the second step. Except the coupling coefficient c ij , other parameters are updated using the back-propagation algorithm to minimize the loss function. Assuming that a task contains P samples and K labels, the margin loss function for the p sample of k capsule (the number of high-level feature capsules is same as that of image labels) is as following: if and only if the k-th capsule is consistent with the true label of the sample. Otherwise, it is 0. m + and m − are used to control the learning intensity of the network. Ideally, the network not only learns the correct labels, but also can enable that the output probability of corresponding capsule is not less than m + when the network's input is the correct label. Similarly, the value is not higher than m − when the sample does not belong to the current label. The parameter λ margin is used to adjust the loss of the network when the sample is not in the true label, which can ensure the model stability.
In addition, the network takes a decoder consisting of multiple fully connected layers to reconstruct the input images. The euclidean norm between the input image x p and the reconstructed image x p is defined as reconstruction loss. The accuracy of the final classification can be improved by finetuning the parameters used in the network [42]. The reconstruction loss is as following:

III. IMPROVEMENT OF CAPSULE NETWORK
The recognition performance of CapsNet is directly determined by its network structure. In this section, a class separable loss based on cosine similarity and a multi-dimensional parallel capsule module are proposed to enhance the feature extraction ability and improve the recognition performance of the network.

A. CLASS SEPARABLE LOSS
Original loss function primarily focuses on the overall information of the SAR images and ignores their difference, while a large amount of common information with small information difference occurs in the SAR images. Consequently, the original loss function enables CapsNet to learn the approximate distribution of training samples, but makes it difficult to increase the difference between the sub-distributions of different image category, which may finally affect the classification performance of CapsNet. Some scientists tried to strengthen the class separability of feature extraction from models by using contrastive loss [43], [44], triple loss [45]- [48] and multi-class n-pair loss [49]. Although these methods could improve the accuracy of model classification to a certain extent, they did not make full In fact, many image target recognition tasks can be divided into parent classes and subclasses. The parent class refers to the initial classification of image targets, and the subclass refers to the accurate classification of image targets. The same parent class can be subdivided into multiple subclasses. Here, we construct a class separable loss based on cosine similarity by using the hierarchical relationship between the parent classes and the subclasses of SAR images, and add it to the original loss function to enhance the recognition performance of CapsNet.
The cosine similarity C(x i , x j ) of image x i and x j is: The closer the value is to 1, the more similar the two images are. In order to determine the sample center of a subclass, we firstly calculate the theoretical average of all images in this subclass, then select the image whose the theoretical average has the largest cosine similarity as the sample center. As shown in Fig. 3, for the input SAR image x i , the sample center of its subclass is c i p . c o p is the sample center of the closest subclass to x p and in the same parent class with x p . The cosine similarity C(x p , c o p ) can be expressed as follows: where: n o is the number of subclasses which have the same parent class as x p . Note that the sample center of x p 's subclass, c i p , does not involved in the calculation. c l p is the sample center of a subclass that is closest to x p but has a different parent class. The cosine similarity C(x p , c l p ) can be expressed as follows: where: n l is the number of subclasses, originating from a different parent class from x p . Ideally, the similarity relationship among the four different SAR images can be expressed as follows: where: m l and m 0 are hyper-parameters used to control the boundary of the feature space, and m l > m o > 0. We hope Hinge loss is a common loss function, which is often used to solve the problem of maximum margin [50], [51]. Hinge loss has the characteristics of sparsity, which can reduce the computational expense and enhance the feature extraction ability of the model in nonlinear problems. In order to strengthen the constraint of the similarity relationship in the process of network training, we construct a class separable loss L p class based on the hinge loss: The loss can be utilized to model the multi-level similarity relationship between the parent class and the subclass of the training samples. The new loss function L total based on the weighted sum of the three loss functions can be obtained by adding L p class to the original loss function, which is presented as: where: K is the number of sample labels. λ r is set 0.0005 to reduce the reconstruction loss. Its value hardly affects the overall loss function during the training. Note that, in a few recognition tasks, image objects can only be divided into subclasses. They can be considered as special cases of the abovementioned tasks, in which only exists a subclass in each parent class.

B. MULTI-DIMENSIONAL PARALLEL CAPSULE MODULE
Low-level convolution and high-level convolution are used to extract low-level structural features and high-level semantic features in CNN, respectively. In order to save the spatial information of the images, only one scalar convolution layer VOLUME 8, 2020 is used for feature extraction in the original CapsNet, while the absence of high-level semantic information makes it perform poorly in complex classification. Xiang et al. [52] proposed a multi-scale capsule network (MS-CapsNet) to obtain robust features from the original images. In the first stage of the multi-scale capsule coding unit, there are three feature extraction routes, which are independent of each other. Although it can improve the recognition accuracy to a certain extent, the feature extraction routes of MS-CapsNet also brings many additional convolution operations. Besides, the first layer of MS-CapsNet also retains a large number of huge scalar volume convolution layers used in the original CapsNet, which may lead to over fitting with the increasing number of network parameters.
The multi-dimensional parallel capsule module proposed in our study can fix this problem. As shown in Fig. 4, the dotted part on the left is the three scalar convolution layers in front of the module. branch 1 , branch 2 and branch 3 represent the low-, medium-, and high-dimensional capsule coding branch, respectively. s 1 , s 2 and s 3 represent the step size of each branch. c 1 × c 1 , c 2 × c 2 and c 3 × c 3 represent the convolution kernel size of each branch. n 1 , n 2 and n 3 represent the number of feature maps of each branch. The inputs of branch 1 are the feature maps obtained by Conv1, Conv2 and Conv3. The inputs of branch 2 are the feature maps obtained by Conv1 and Conv2. The inputs of branch 3 are the feature maps only obtained by Conv1. Each branch has different step size and the number of feature maps. Three dimensions of the features obtained by convolution of different layers can be encoded in this module, enabling network to further obtain the multi-dimensional feature representation. Compared with the multi-scale capsule encoding unit of MS-CapsNet, the multi-dimensional parallel capsule module does not need to add the convolution operation. branch can be expressed as a concatenation of three branches: branch = concat (branch 1 , branch 2 , branch 3 ) (14)

C. MdpCaps-CSL STRUCTURE
MdpCaps-Csl for SAR image target recognition is proposed in this paper. As shown in Fig. 5, the network includes input layer, three scalar convolution layers, multi-dimensional parallel capsule module, digitcaps layer, and output layer. In addition, a decoder network with three fully-connected layers attempts to reconstruct the input images by using the instantiation parameters of the digitcaps layer.
Each input image with size of 96 × 96 can generate 8 feature maps with the size of 48 × 48, 16 feature maps with the size of 24 × 24 and 32 feature maps with the size of 12 × 12 after passing through the first, second and third zero-filled convolutions with a step size of 2 with ReLU [53] as activation function. These feature maps of different sizes will be used as the input of the next layer. The multidimensional parallel capsule module passes three zero-filled capsule branches with convolution kernel sizes of c 1 × c 1 , c 2 × c 2 and c 3 × c 3 , respectively, and steps of 1, 2 and 4, respectively and can generate 8, 4 and 4 vector feature maps with size of 12 × 12 and vector length of 8. The connected feature maps are used as the input of the digitcaps layer with capsule length of 16 and the number of image labels as capsule number. The weights between the multi-dimensional parallel capsule module and the digitcaps layer are updated by dynamic routing mechanism. The output layer determines the input image type based on the length of the digitcaps layer output vector v j .

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. MSTAR DATABASE
The proposed method has been verified on the universal MSTAR database [54]. The data was collected using the sandia national laboratory (SNL) x-band SAR sensor platform in spotlight imaging mode and was co-funded by the defense advanced research projects agency (DARPA) and air force research laboratory (AFRL). This resolution of database is 0.3m × 0.3m and the omnidirectional angle coverage is 0 • − 360 • . It has been widely used in the examination and evaluation of SAR image target recognition algorithms.
The MSTAR database mainly consists of ground military vehicle and civilian vehicle images of different target types, aspect angles, depression angles, serial numbers, articulation and version variants. The database has 3 parent classes with 10 different subclasses (Artillery: 2S1 and ZSU234; Truck: BRDM2, BTR60, BMP2, BTR70, D7, and ZIL131; Tank: T62 and T72). Their SAR images and corresponding optical images are shown in Fig. 6. In our experiments, vehicle targets with depression angle of 17 • and 15 • are used as training samples and test samples, respectively. The detailed information of training set and test set are shown in Table 1.

B. EXPERIMENTAL SETTINGS
Input images of the same size are required by most recognition models, most of which are 128 × 128 in size, but part of them are larger than that, e.g., the image sizes of type 2S1 and type T-62 are 158×158 and 172×173, respectively. To avoid the influence of background noise, we directly cut the center of all the SAR images into the size of 96 × 96 input models.
A computer with Intel (R) Core (TM) i7 9800X @ 3.8GHz CPU, NVIDIA GeForce RTX 2080Ti GPU, and 16GB of memory is used in the experiment. The experiments are carried out with the software environment of 64-bit Ubuntu 16.04 operating system, CUDA 10.0.130, CuDNN 7.5.1, Tensorflow deep learning framework and Python 3.6.5 programming language. Adam optimization method [55] was used as the gradient descent algorithm for training. The hyperparameters after multiple trial and error experiments in training can be expressed as followings: the parameter λ margin is 0.1, m + is 0.9, m − is 0.1, m l is 0.3, m 0 is 0.1, algorithm iteration times is 50, batch number is 16, weight attenuation coefficient is 0.0001 and the dropout rate is 0.2.
The recognition accuracy and the confusion matrix are used as evaluation indicators in these experiments, which are compared with the current most advanced methods. The expression of recognition accuracy Ra is:

Ra=
N cor N sum (15) where: N sum is the total number of test samples with same type, and N cor is the total number of correctly-identified samples with same type. The larger the value Ra, the better the classification performance is. The confusion matrix is applied as a visualization method, where the row represents instances in the actual class, and column represents instances in the predicted class. It is calculated by comparing the position and classification of actual images with corresponding predicted ones.

C. ANALYSIS OF EXPERIMENTAL PARAMETERS
In this section, we leverage the method in [21], [56] and [57], randomly selecting a part of the images with 17 • depression angle as training samples in proportion, and all images with 15 • depression angle as test samples, to test the parameter λ c , the convolution kernel size combination (c 1 , c 2 , c 3 ) of different capsule branches and the route iteration number under three representative training sample ratios of 10%, 50% and 100%. The optimal parameters are selected and applied in the subsequent algorithm experiments.

1) PERFORMANCE UNDER DIFFERENT PARAMETER λ c
The initial value of parameter λ c has a significant impact on the final recognition performance of the network. The loss function is the original one when the parameter λ c is 0. The class separable loss will weaken the influence of the original loss function on the overall parameters and the recognition performance will decline when the parameter λ c is higher. As shown in Fig. 7, the recognition accuracies of MdpCaps-Csl increase first, then decrease with increasing parameter λ c under three different training sample ratios and reach the maximum when the parameter λ c is 0.5. Therefore, the value of parameter λ c is fixed as 0.5 and applied to the subsequent algorithm experiments. The capsule branches of the multi-dimensional parallel capsule module are the core components of MdpCaps-Csl. These branches are all learned from the previous scalar convolution layers, and they can represent small entities in the SAR  images. The convolution kernel size of each capsule branch has an important effect on its performance. When the convolution kernel is small, the feature extraction capability of the network is weak, while when the convolution kernel is large, the network may contain a lot of redundant information. We chose the five groups of (3, 3, 3), (3, 5, 5), (3, 5, 7), (5, 5, 5) and (7,5,3) to evaluate their impact on performance. As shown in Fig. 8, the recognition accuracies of network are all the highest when the scheme (5,5,5) is selected under three training samples ratios. This scheme is also applied in the subsequent algorithm experiments.

3) PERFORMANCE UNDER DIFFERENT ROUTE ITERATION NUMBERS
Coupling coefficient c ij , an important parameter of the network, is updated by dynamic routing mechanism. Choosing a right route iteration number is helpful to get the best coupling coefficient. Specifically, the small number can not enable the parameters to be trained effectively, while the large number will lead to over fitting and increase the extra training time. Fig. 9 shows the variation of recognition accuracies of MdpCaps-Csl with the route iteration numbers under three training samples ratios. The recognition accuracies all have peak values under three training samples ratios. When the route iteration number is 3, the accuracies all reach the maximum values. Therefore, the routing iteration number is set to 3 in the following experiments to achieve better recognition performance.

D. EXPERIMENTS ON ALL TRAINING SAMPLES
Firstly, all the training samples are used to perform algorithm experiments. CapsNet, CapsNet-Csl and MdpCaps-Csl are used to compare and verify the class separable loss and the multidimensional parallel capsule module, where the Cap-sNet, the baseline, includes input layer, three scalar convolution layers, primarycaps layer, digitcaps layer, and output layer, as shown in Fig. 10. The loss function is the combination of the margin loss function and the reconstruction loss function used in the initial proposal [35]. CapsNet-based CapsNet-Csl is revised through adding the class separable loss to the original loss function, as shown in (13). The multidimensional parallel capsule module based on CapsNet-Csl is used in MdpCapsCsl. All these models use the same experiment configuration and are fully trained.
The detailed test results of CapsNet, CapsNet-Csl and MdpCaps-Csl in Tables 2−4 show that the overall accuracy of MdpCaps-Csl is 99.79%, which is improved 0.90% and 0.37%, respectively, compared with CapsNet and CapsNet-Csl. The test results of MdpCaps-Csl using all training samples show that 4 types of target recognition error occur among the 10 types of vehicle targets, and the accuracy of each type exceeds 99%. Compared with CapsNet, the parent classes in CapsNet-Csl and MdpCaps-Csl have no errors, which have added class separable loss classification methods. The misidentified samples are all from one of parent classes, which are misclassified as the subclass of a certain type in the corresponding parent class. It also shows that the class separable loss based on cosine similarity can greatly enhance the class separability. The feature vectors representing input instances in MdpCaps-Csl can be easily observed through image reconstruction. A reconstructed SAR image derived   from the reconstruction of multiple fully-connected layers is shown in Fig. 11. It can be seen that the reconstructed image well retains the main features of the original image target with good robustness. The reconstructed image is smoothed on local noise compared with the original image. Table 5 compares the capsule structure-based MdpCaps-Csl with the pattern recognition-based traditional methods FIGURE 11. Comparison between original SAR image and reconstructed SAR image. and the deep learning-based methods proposed in recent years. The accuracy of MdpCaps-Csl can be improved by 11.79%, 4.79%, 6.19%, 3.19%, 0.99%, 7.09%, 2.79% and 7.19%, respectively, compared with EMACH (extended maximum average correlation height) + PDCCF (polynomial distance classifier correlation filter) [58], IGT (iterative graph thickening) [59], SRC [22], MSS (monogenic scale space) [60], MPMC (modified polar mapping classifier) [61], AdaBoost [19], CGM (conditionally gaussian model) [62], and BCS (bayesian compressive sensing) + scattering centers [63]. Also, MdpCaps-Csl is slightly higher in accuracy than other deep learning-based methods e.g., CNN [28], Com-plexNet [64], A-ConvNet [29], CNN + SVM [65], DCHUN [56], CNN-TL-bypass [66], CNN + ASC [30], LCNN + Visual Attention [32] and APCRLNet [57], etc. The above experiments show that MdpCaps-Csl can perform well in recognition without data enhancement.

E. EXPERIMENTS ON PARTIAL TRAINING SAMPLES
Most SAR image target recognition methods reported commonly use all training samples in algorithm experiments, while training data that is manually labeled by professionals is typically limited. We use the same method as in Section IV-C. A part of the images (10% to 100%) with 17 • depression angle are randomly selected as training samples and all images with 15 • depression angle are used as test samples, to compare and test the recognition performance of MdpCaps-Csl using partial training samples. Fig. 12 shows the experiment results of MdpCaps-Csl, two traditional machine learning-based methods, e.g., SVM [21] and SRC [21], and six deep learning-based methods proposed in recent years, e.g., A-ConvNet [29], DCHUN [56], APCRLNet [57], probabilistic meta-learning (PML) [67], CapsNet and CapsNet-Csl. Notably, the results of A-ConvNet in this section are derived from recurrent code cited from previous study [57]. Overall, the accuracies of SVM and SRC are much lower than other deep learning-based methods. The fewer training samples are used, the lager difference among them. The accuracy of MdpCaps-Csl is much higher than that of other methods, e.g., A-ConvNet, APCRL-Net, PML, CapsNet and CapsNet-Csl, using only 10% training samples, specifically, the values are 97.73%, 73.44%, 78.10%, 89.0%, 89.11% and 93.48%, respectively. In addition, the accuracy of CNN-TL-bypass [66] is 97.15% using   a total of 500 training samples that are randomly selected from 50 samples in 10 types of training data. With only 10% of the training data (only 275 images are selected as training samples), the accuracy of MdpCaps-Csl is 97.73%. The comparison shows that MdpCaps-Csl can recognize more accurately with fewer training samples than CNN-TL-bypass. The accuracies of hierarchical fusion of CNN and ASC matching method [30] and semi-supervised transfer learning model [68] are 87% and 91.36%, respectively, when using 20% of the training data. The accuracy of MdpCaps-Csl under the same condition is 98.80%, far higher than the above two methods. Tables 6−8 show the detailed test results of CapsNet, CapsNet-Csl and MdpCaps-Csl using only 10% training samples. For different subclass targets within the same parent class, MdpCaps-Csl is not confused in ''Tank'' (yellow area) and only one artillery of type 2S1 is incorrectly classified as type ZSU-234 in the 'Artillery' (blue area), the main confusion is in the parent class ''Truck'' (green area). In addition, we found that among the different parent classes, CapsNet recognized incorrectly 49 SAR images, while CapsNet-Csl and MdpCaps-Csl with class separable loss only failed 13 and 8 SAR images, respectively. In this case, CapsNet's recognition accuracy of coarse level is 97.98%, while CapsNet-Csl and MdpCaps-Csl are as high as 99.46% and 99.67%, respectively. This also proves that the class separable loss built with the types and level information of different SAR images helps to extract identification information of different classes more efficiently. This loss has an important role especially in reducing the confusion among different parent classes.

F. EXPERIMENTS ON FEWER TRAINING SAMPLES
In order to further verify the recognition performance of MdpCaps-Csl with fewer training samples, we randomly select 5, 10 and 20 images with 17 • depression angle from each type training data as training samples and all images with 15 • depression angle as test samples to conduct algorithm experiments. Fig. 13 shows the experiment results of A-ConvNet, APCRLNet, CNN cascaded features and AdaBoost RoF (CCFAR) [69], CapsNet, CapsNet-Csl and MdpCaps-Csl. The accuracies of MdpCaps-Csl are 63.05%, 78.56% and 90.19%, respectively, which are much higher than those of the other five methods. Although lacking training data with different rotation angles for this case [70], MdpCaps-Csl can still perform well.

V. CONCLUSIONS
CapsNet is very effective in collecting pose information and spatial attributes of the images compared with CNN and it can learn well with a small group of data. This method is similar to the human's thinking and can effectively express the hierarchical relationship of internal knowledge in neural networks. A cosine similarity-based class separable loss is introduced as a regularization term of the original loss function. A multi-dimensional parallel capsule module is used as well to improve CapsNet, which greatly enhances the feature extraction capability and robustness of the network. To verify the recognition performance of the method, the universal MSTAR database is used. As a result, the accuracy of MdpCaps-Csl is 99.79% using all training samples, which is higher than most pattern recognition-based traditional methods and deep learning-based methods. Even if only 10% training samples is used, the accuracy of MdpCaps-Csl can reach 97.73%, which is much higher than other methods. The results demonstrate that MdpCaps-Csl can still perform well using fewer training samples, despite not all rotation angles are examined. The application of CapsNet in the SAR image target recognition shows a broad prospect, while it is still at the early stage and needs further exploration. We will focus on the network performance of deep-level capsule structures in the future.