Semi-Supervised Deep Transfer Learning-Based on Adversarial Feature Learning for Label Limited SAR Target Recognition

The data-driven convolutional neural networks (CNNs) have achieved great progress in Synthetic Aperture Radar automatic target recognition (SAR-ATR) after being trained in a large scale of labeled samples. However, the insufficiency of labeled SAR data always leads to over-fitting, causing significant performance degradation. To solve the mentioned problem, a semi-supervised transfer learning method based on generative adversarial networks (GANs) is presented in the present paper. The discriminator of GAN with an encoder and a discriminative layer is redesigned to make it capable of learning the feature representation of input data with unsupervised settings. Instead of training a deep neural network with the insufficient labeled data set, we first train a GAN with varieties of unlabeled samples to learn generic features of SAR images. Subsequently, the learned parameters are readopted to initialize the target network to transfer the generic knowledge to specific SAR target recognition task. Lastly, the target network is fine-tuned using both the labeled and unlabeled training samples by a semi-supervised loss function. We evaluate the proposed method on the MSTAR and OpenSARShip data set with 80%, 60%, 40%, and 20% of the training set labeled, respectively. The results suggest that the proposed method achieves up to 23.58% accuracy enhancement over the random-initialized model.


I. INTRODUCTION
Deep convolutional neural networks(CNNs) have achieved progress in synthetic aperture radar(SAR) target recognition [1], for its robust ability to learn high-level features. However, this type of method requires large labeled data set to train the model, whereas most of the SAR data sets are unlabeled or sparsely labeled, leading to severe overfitting when a deep CNN is being trained. As manually labeling SAR data sets is hard to achieve or time-consuming, how to improve the performance of CNN trained in label-limited training set has become a hot topic in this area.
Chen et al. [2] proposed an all-convolutional network (A-ConvNet) replacing all full-connected layers with the convolution layers. This method reduces overfitting by reducing model parameters and achieves better performance than that of the general CNN in the classification of the Moving The associate editor coordinating the review of this manuscript and approving it for publication was Nizam Uddin Ahamed . and Stationary Target Acquisition and Recognition (MSTAR) benchmark dataset [3]. However, it remains a data-hungry method. Its performance decreases significantly with reducing labeled training samples.
Transfer learning emerged as a popular training method to overcome the label-limited difficulty by initializing the model using the parameters that have been learned from a large data set [4]- [6], and then fine-tuning the initialized model with small amounts of labeled samples of the target task. [7] fine-tuned a CNN pre-trained in CIFAR-10 optical data set for SAR images classification and achieved better performance than the random-initialized CNN. Reference [8] pre-trained an unsupervised generative network, Auto-Encoder(AE), to learn feature representations of large amounts of unlabeled SAR scene images. Subsequently, a classification layer was added at the end of the encoder and the whole network was fine-tuned. This method enhanced the performance of the CNN trained in small labeled training set. Besides, the model pre-training requires no additional label. However, the above methods only exploit the labeled training samples in target domain, while the information of unlabeled samples in target domain is wasted.
In the meantime, deep semi-supervised learning is a powerful framework leveraging unlabeled training samples to assist in training deep neural networks under insufficient labeled samples [9]- [11]. The common implementation of semi-supervised learning is to exploit the labeled samples to train a supervised CNN to predict the labels of the unlabeled samples. Then, the predicted labels are adopted to train the CNN with the unlabeled samples. Some other methods also employ GANs to achieve semi-supervised learning [12]. It has been applied to enhance the performance of SAR target recognition with label limitation [13]. The essential of semi-supervised learning is to use unlabeled samples to estimate the distribution of the data. One critical assumption in many semi-supervised learning algorithms is the structure assumption: samples within the same structure (e.g. a cluster or a manifold) are likely to have the same label. [14]. However, SAR images change significantly with the azimuth and depression angles. SAR images in the identical class can be quite different, which makes that some unlabeled training samples may not satisfy the structure assumption.
Given the mentioned discussions, a semi-supervised transfer learning method is proposed to overcome the labellimited difficulty of SAR images. As numerous unlabeled SAR images are available, the unsupervised generative network is first utilized to learn feature representations from the data based on the idea of [8]. Nevertheless, we propose an improvement that learns features using a variant of more advanced generative adversarial networks (GANs) [15], instead of the Auto-Encoder, as shown in Figure 1. Afterwards, the learned parameters are leveraged to initialize a CNN and an adaptive layer is added as a feature classifier. Lastly, the initialized model is fine-tuned using both the labeled and unlabeled samples of our target task through a semi-supervised loss. The proposed method achieves great progress in the classification of MSTAR and OpenSARShip data set when the labeled samples are insufficient. Even only 20% of training samples are labeled, the method can still achieve an accuracy of over 90% on the MSTAR data set.
The rest of this letter is organized as follow. Section II introduces the implementation of our method. Section III demonstrates the experiment settings. Section IV discusses the performance of the experimental results. Section V gives the conclusion.

II. METHOD IMPLEMENTATION
According to the concept of transfer learning, the training data is split into a source domain and a target domain. In the present paper, the source domain is composed of varieties of unlabeled SAR data X s that is different form but associated with the data to be classified. The target domain is the specific SAR data set that contains labeled samples {X t , Y t } and unlabeled samples X t waiting for classification. As illustrated in Figure 1, the overall architecture is divided into the source network and the target network. The source network is first pre-trained with both X s and X t to learn feature representations. Subsequently, the target network is initialized using the pre-trained parameters of the source network to transfer the learned features to the classification of the target domain. And lastly, fine-tune the initialized model using X t and {X t , Y t } through a semi-supervised loss function.

A. ADVERSARIAL FEATURE LEARNING
To learn features from the unlabeled samples, an unsupervised feature learning method is adopted based on the generative adversarial network(GAN).
The typical GAN is generally known as a powerful framework to generate high-quality data for its adversarial training method. It learns a generative network G to ''fool'' the discriminative network D by mapping an arbitrary latent feature distribution C to the data as close to the real data as possible. Meanwhile, the discriminative network D is trained to distinguish between the generated data and the real data as significantly as possible. G and D check and balance each other through alternately optimizing the following two adversarial objective functions [15]: where D produces a normalized scalar, indicating the probability that the data is real. Numerous experiments have VOLUME 7, 2019 demonstrated that the adversarial training method helps GANs generate more realistic data and be more generalized than the Auto-Encoder(AE). Nevertheless, the original GAN framework cannot map the data to the latent feature space, since it lacks a component like the encoder in AE. References [18] and [19] solve the problem by adding an additional encoder component to the original GAN, which requires more memory to restore the whole network. Thus, we directly do some modifications to the general GAN architecture, decomposing it into an encoder E and a discriminative layer T , as shown in the source network of Figure 1. E maps the generated data G(c) and the real data x to the feature vectors c ∼ C , where c has the same dimension as the latent feature vectors c ∼ C. And T discriminates whether c is extracted from the real data or the generated data. Since we expect to recover data from c , c should be highly correlated with c. Thus, We add a reconstruction loss term L rec that measures the correlation between c and c to the original adversarial objective functions, as shown below: where λ 1 and λ 2 are hyper-parameters that control the bias of each term. In mathematics, the inner product is often used to reflect the angle between two vectors. If and only if the angle between the two vectors is zero, their inner product reaches a maximum. At this point, the two vectors are linearly dependent. Thus, we naturally use the inner product of the normalized vectors of c and c to represent the correlation between them. Consider the input feature vectors of the form c = [c 1 , c 2 , · · · , c n ] and the predicted feature vectors of the form c = [c 1 , c 2 , · · · , c n ], the reconstruction loss can be defined as is the statistical average, and · 2 is the l2norm. By this means, we can learn an encoder that maps the generated data G(c) to the similar feature space to C by minimizing L rec . As the well trained G can successfully confuse D, X s , X t and G(c) are embedded into the same distribution of the feature space by E. Further, we can assume that E extracts common features from X s and X t , which is of great significance to alleviate overfitting.

B. SEMI-SUPERVISED KNOWLEDGE TRANSFER
The result of current experiments suggest that the pre-trained models can still retain the generic features of the data even after being fine-tuned many iterations [17]. Thus, we compose the target model with the pre-trained encoder E and a random-initialized adaptive layer A to transfer the generic knowledge to the specific classification task, as shown in Figure 1. A outputs an n-dimensional vector normalized by the SoftMax function to represent the probability of each category, where n denotes the category number of the target task. Subsequently, we fine-tune the whole model using both {X t , Y t } and X t to adapt it to the target task through a semisupervised loss function: where L sup is the supervised loss, L unsup is the unsupervised loss, λ(t) is a coefficient balancing the two terms. L sup can be simply defined as the CrossEntropy: where y is a one-hot encoding vector that records the real label of x t . According to the ''cluster assumption'' [20], samples of the same cluster have a high probability of belonging to the same class. Thus, we can consider the predicted category as the soft-label of the unlabeled samples. Meanwhile, the ''manifold assumption'' [14] suggests that the unlabeled samples should help to make the data space of the same class denser so that the decision function can better fit the data. Based on the above analysis, we use information entropy as the unsupervised loss to regularize the unlabeled samples, shown as: where A(·) and E(·) denote the forward propagation functions of the adaptive layer and the encoder, respectively. The coefficient λ(t) should be a tiny value at the beginning to reduce the disturbance of the unlabeled sample to the supervised training. Then, λ(t) increases gradually with iterations to use benefit from unlabeled samples. Thus, the form of λ(t) is given as follow: where α and T refer to hyper-parameters controlling the horizontal scale and the translation of the function, respectively. And λ 3 is important for the network performance. If it is overly large, it will disturb the training of labeled data. If λ 3 is too small, the semi-supervised learning cannot fully leverage the unlabeled data.

III. EXPERIMENT
This section interprets the experiment settings employed for verifying the performance of our method.

A. NETWORK CONFIGURATIONS FOR EXPERIMENT
A DCGAN-based [21] architecture is adopted to build both encoder E and generator G of the GAN for experiments. Thus, the hidden layers of both E and G primarily consist of convolution kernels, of which the forward propagation is where O j (w, h) denotes the output of the jth kernel, F For simplicity, the size of all the convolution and transpose convolution kernels is set to 4 × 4, while the stride in each dimension is set to two pixels.
The discriminative layer T is a dense-connected layer producing a scalar activated by a sigmoid function. The target network consists of the encoder E and the adaptive layer A whose outputs are normalized by the SoftMax function. The input size of both source domain and target domain is set to 128 × 128. Thus, the model architecture is provided as presented in Table 1, where n classes denotes the number of categories that the target SAR data set contains. The convolution kernel is expressed as ''hight × width @channels''. The input codes c are set to 128dimensional noise vectors of which the elements are produced randomly by a Gaussian distribution c ∼ N (0, 1). The entire network covers 4,257,792 trainable parameters in total.
As GANs are relatively difficult to train, we exploit some empirical techniques to accelerate and stabilize the training. First, leakyReLu acts as the activation function of E to avoid the activation value falling in the interval where the gradient is zero, which is expressed as: For comparison, the general ReLU function is shown as follows: Moreover, we add a BatchNormalization layer [22] between the convolution kernels and the activation functions to stabilize the training of GANs. And then, the adaptive gradientbased Adam optimizer [23] is utilized for training the model.

B. SOURCE DOMAIN DATA SET
In this paper, we've obtained large-scale unlabeled SAR images collected by the TerraSAR-X, Sentinel-1 and Gaofen-3 remote sensing satellites. The obtained SAR images are adopted to compose the source domain data set of the transfer learning. To fit the input size, the original image is cropped into 128 × 128 pixels with overlap. Slices containing several strong scattering points are mainly selected to form the training set of the source domain, as they are more similar to the SAR target images. The total source domain contains more than 20,000 image slices involving multiple scenes and resolutions. Some samples of the source training set are presented in Figure 2.

C. TEST IN THE MSTAR DATA SET
One of the accuracy test for the target task is performed using the airborne Moving and Stationary Target Acquisition and Recognition(MSTAR) system. To comprehensively assess the performance, the proposed algorithm is tested both under standard operating conditions (SOC) and EOC. In the SOC setup, the experimental data set contains ten types of objects, which are 2S1, ZSU234, BMP2, BRDM2, BTR60, BTR70, D7, ZIL131, T62, and T72. Some samples of MSTAR data set and their corresponding optical images are presented in Figure 3. The same class of targets in the training and testing sets has the same serial number but differs in azimuth and depression angle. Images of training set are obtained at 17 • depression angle, and images of test set are acquired at 15 • depression angle. Each sample in the MSTAR data set consists of 128 × 128 pixels. SAR images are always easily affected by the variance of depression angles. In the EOC setup, the proposed method is assessed with respect to large depression angle variations. The open source MSTAR data set provide four classes of targets(2S1, BRDM-2, T-72, and ZSU-234) that contain samples obtained at 30 • depression angle. Thus, the proposed method is tested on these samples, with the training set chosen from the corresponding four targets in the training set of SOC.
Some existing methods use numerous data augmentation techniques to enlarge the training set and crop the test set into small patch to achieve high performance. In the experiments of this paper, the raw images of MSTAR without any preprocessing and data enhancement are fed directly into the target neural network to present the robustness of the proposed method. The proportions of labeled samples to the training set are set to 80%, 60%, 40%, and 20%, respectively.

D. TEST IN THE OPENSARSHIP DATA SET
We also assess the proposed method on OpenSARShip data set [24]. This data set provides several types of ship chips collected from the Sentinel-1A/B satellite. Compared with MSTAR, this data set is obtained directly from the natural scene. The environment is more complex, which brings more challenges to the recognition algorithm. In the evaluation, four major classes (cargo ship, fishing ship, tanker and tug) of ships are chosen for experiment. For each class, 500 SAR ship chips are collected. As the categories of fishing ship or tug only contain about 120 samples, chips of these two categories are translated and rotated to create some novel data. Then, the data set was randomly split into training set and test set according to a ratio of 3:2. As the original chips are of different sizes, we resized them to 128 × 128 pixels using bilinear interpolation. Figure 4 shows some samples of the four classes of ships in the OpenSARShip data set.

E. OTHER DETAILS
To elucidate the improvement of our algorithm, four control trails, i.e., supervised learning (SL), semi-supervised learning (SSL), supervised transfer learning (STL), and our semisupervised transfer learning (SSTL) are set up. The specific implementations are shown as follows: 1) Supervised learning (SL): directly training the randominitialized target network using only labeled samples through the supervised loss L sup . 2) Semi-supervised learning (SSL): directly training a random-initialized target network using both labeled and unlabeled samples through the semi-supervised loss L semi . 3) Supervised transfer learning (STL): training the target network which is initialized by the pre-trained source network using only labeled samples through the supervised loss L sup . 4) Semi-supervised transfer learning (SSTL): training the target network which is initialized by the pre-trained source network using both labeled and unlabeled samples through the semi-supervised loss L semi . Furthermore, we also compared the performance of our method with the state-of-art semi-supervised learning and transfer learning methods for SAR target recognition, including transfer learning based on Auto-Encoder(TL-AE) [8], semi-supervised learning based on GAN (SSGAN) [13]. As these methods are implemented using different network architectures, and test on the images cropped into different sizes in their own experiments. To make it comparable, we do not use the original network, but evaluate all the methods with the same target network architecture as TABLE 1 presents. In the mean time, the input images are all regulated to 128 × 128-pixel.
We set the learning rate of both L AdvD and L AdvG to 1e-4 for adversarial feature learning, and 1e-5 of L semi for generic knowledge transfer. The mini-batch of each iteration is 64 for the limitation of memory. λ 1 and λ 2 can be adjusted during training the GAN. These two parameters should first ensure the convergence of L rec , and at the same time should be as small as possible to avoid slowing down GAN training. In this paper, they are set as: λ 1 = λ 2 = 2. The hyperparameters T and α primarily aim to provide a smooth transition for supervised learning and semi-supervised learning to reduce the fluctuation during training. We adjust these two hyper-parameters according the change of L semi and finally set T = 100, α = 5. λ 3 is the critical hyper-parameters for the network performance and is related to the labeled rate. This parameter is determined using cross-validation for different labeled rates, as shown in TABLE 2.

IV. RESULTS
The experimental results are presented in two parts. In the first part, the reliability of the features learned from the source domain is to be proved by data reconstruction. In the second part, we will check the improvement of the proposed method in the MSTAR and OpenSARShip data set, respectively.

A. RECONSTRUCT IMAGES WITH LEARNED FEATURES
A common method to verify the feasibility of features extracted from data is to see whether the data can be reconstructed with fidelity by these features. We enter a SAR image Figure 5a into the encoder to produce a feature vector and feed the feature vector into the generator to reconstruct the image. The reconstructed image is given in Figure 5c. It is suggested the reconstructed image is almost identical to the real image. Besides, the encoder is utilized to produce a feature vector of another image Figure 5b. Subsequently, images can be generated using the linear interpolations between these two vectors, as shown in Figure 5d-5j. The generated images represent the transition between Figure 5a and Figure 5c, which demonstrates the linear continuity of the extracted features.

B. ACCURACY ON MSTAR
In this subsection, the classification accuracies of semisupervised transfer learning(SSTL) and the other three comparison models are given. To ensure that all of the methods are adequately trained, we keep training the models until their accuracies of the labeled training samples stabilize at 100% and the loss values reach dynamic equilibrium. Though the training accuracy has been stable, the test accuracy still fluctuates during the training due to the different depression angle between the training and test sets [25]. To mitigate the impact of this fluctuation, the result is recorded as pairs of mean and standard deviation (µ ± std) over the last twenty epochs.      Table 6 represent the SOC test accuracy of the four methods trained in the training sets of which labeled samples account for 80%, 60%, 40%, and 20%. To be intuitive, Figure 6 shows the histogram of test accuracy for each experiment.
The experimental results suggest that the semi-supervised transfer learning (SSTL) outperforms all of the other three     methods in the four experiments, and the superiority of our method becomes more noticeable as the labeled training set decreases. Even when only 20% of training samples are labeled, our algorithm still achieves an accuracy of nearly 90%. The performance enhancement exceeds more than the addition of that of the semi-supervised learning (SSL) and supervised transfer learning (STL).
The test accuracy for EOC setup is listed in TABLE 7-10. Figure 6 presents the average test accuracy of each experiment for EOC setting. It is still found that the accuracy of SSTL is to some extent enhanced compared with other methods. But in this part of experiments, STL always achieves better performance than that of SSL. This indicates that when    the training set and the test set are quite different, the performance enhancement of transfer learning is more significant than that of semi-supervised learning.

C. ACCURACY ON OPENSARSHIP
The test accuracy on the OpenSARShip data set is listed in TABLE 11-14. And Figure 8 visualizes the results of the four methods. Given the results, it is therefore concluded that SSTL always achieves the optimal accuracy. For the study of this paper, the transfer learning (SSTL) method significantly outperforms the semi-supervised learning method (SSL),    probably because that the openSARship data are obtained from the sentinel-1 satellite that also provides some data to compose part of the source domain.   15 shows the comparison between the proposed method and the state-of-art SSGAN and TL-AE. SSL and STL are the proposed semi-supervised learning and transfer learning methods mentioned above. It is suggested that under the same experimental settings (same target network architecture, input size, source domain data), the accuracy of SSGAN and SSL are not show significant difference. Besides, in most cases, the supervised transfer learning (STL) proposed in this paper is slightly better than TL-AE, presumably because GAN always can extract more representative features than the Auto-Encoder(AE). Furthermore, the combination of semisupervised learning and transfer learning (SSTL) displays noticeable accuracy enhancements.

V. CONCLUSION
In the present paper, a semi-supervised transfer learning method based on adversarial feature learning is proposed to address the limited label difficulty in SAR-ATR. The methods exhibit a much higher performance on MSTAR benchmark and OpenSARShip data set than that of the randomly initialized model. Besides, the proposed method outperforms the methods only based on semi-supervised learning or transfer learning in the performance enhancement, and this advantage will be more significant with the reduction of labeled training samples. He is currently a Professor with the ATR Laboratory, NUDT. His research interest includes radar system design, precise guidance, and target recognition.