Cross-Domain Few-Shot Learning Between Different Imaging Modals for Fine-Grained Target Recognition

Fine-grained target recognition in synthetic aperture radar (SAR) or infrared imaging modal is an open problem in some application scenarios where training samples are scarce. Transferring common features from visible optical (VO) samples is effective for the case that SAR (infrared) samples are scarce. However, for the fine-grained target recognition, transferring common features face two issues: first, common features can be divided into fine-grained features and the coarse-grained features. For the fine-grained target recognition task in the few-shot case, how to transfer fine-grained common features needed to be considered. Second, in the SAR (infrared) imaging modal, parts of samples carry much noise because of the limitation of the imaging mechanism, masking the subtle difference for the fine-grained target recognition task, making such fine-grained common features not easy to be transferred, especially when training samples are scarce. To handle these issues, corresponding solutions are proposed in this article as follows: first, the common-feature-contrastive loss is proposed to transfer fine-grained common features from VO samples; second, based on the modeling of the heteroscedastic uncertainty, the training strategy of sample quality evaluation is proposed to emphasize the training samples with less noise. Experiments on three datasets, including MSTAR, P-openSARship, and P-VAIS, represent the superiority of the proposed algorithm over baseline and other popular cross-domain few-shot learning algorithms.


I. INTRODUCTION
B ECAUSE of the robustness to complex weather conditions and illumination variation, synthetic aperture radar (SAR) and infrared imaging modals play important roles in many applications, such as military investigation and early warning. Among these applications, distinguishing targets of different subclasses is important for decision-making, which bring the fine-grained target recognition task. However, it is not easy to obtain sufficient SAR (infrared) samples of concerned subclasses in such application scenarios, bringing the few-shot learning problem. Therefore, fine-grained target recognition in the few-shot case is an essential technology in many applications of the SAR (infrared) imaging modal. Different from traditional target recognition, fine-grained target recognition aims to distinguish subclasses of targets [1], e.g., the cargo and tug in Fig. 1. The core problem in fine-grained target recognition is to learn fine-grained feature representation that can represent the subtle difference among different subclasses [2]. Further, in the SAR (infrared) imaging modal, compared with the visible optical (VO) targets containing rich discriminative texture features that can represent the fine-grained feature, such features of the SAR (infrared) targets are difficult to extract. For example, the patterns or colors of different subclasses VO targets are various, but for the SAR (infrared) targets, such features are similar among different subclasses. To learn such fine-grained feature, contrastive [3] or triplet loss [4] are used in the existing algorithms of the SAR (infrared) imaging modal [5], [6], [7], [8]. However, these algorithms are prone to be overfitting in the few-shot case. For the few-shot learning in the SAR (infrared) imaging modal, some algorithms are based on data augmentation [9] or using the metalearning training strategy to learn the model [10], [11]. However, such algorithms need sufficient SAR (infrared) samples of related classes, which limits the application scenarios. Recently, Tai et al. [12] proposed to transfer "common features" between SAR and VO imaging modals from a complex source network to a small target network to enhance the target network's ability to extract common feature so that the requirement for SAR (infrared) samples is reduced. samples and enable to recognize SAR (infrared) samples, which represent the shared features between the VO and SAR (infrared) samples, such as the shape feature of VO and SAR (infrared) ship [12]. However, during transferring the common features, they did not consider the difference of common features among different subclasses, causing the transferred common features to be unable to represent the subtle difference among fine-grained classes, decreasing their performance for the fine-grained target recognition task.
Inspired by the work of Tai et al. [12], which enhanced the target network's ability to extract common features by transferring them from the source network with the connection-free attention modules, this article deals with the fine-grained target recognition of the SAR (infrared) imaging modal in the few-shot case by considering the following issues.
1) Learning common features that are discriminative among different fine-grained classes are critical to the finegrained target recognition of the SAR (infrared) imaging modal in the few-shot case. According to the proposed works [5], [8], contrastive loss is effective to learn finegrained features by pushing the distance of features between different subclasses away from each other. In the normal fine-grained recognition task, fine-grained features from different subclass samples are expected to be as far as possible. So the margin in the contrastive loss, which represent the expected distance between different subclasses is simply set as a relatively large value. However, for the fine-grained target recognition task of the SAR (infrared) imaging modal in the few-shot case, fine-grained common features are expected to be transferred, so that the margin should not be too large or small. The common features learned in [12] locate the middle layers of the network where the transfer weight obtains a higher value. This phenomenon shows that common features exist in a specific area of the embedding space. If the margin is set too large, the distance of fine-grained common features between the different subclasses will be pushed too far to locate in the specific area of the embedding space, losing the ability to recognize SAR (infrared) targets. Conversely, most features cannot provide the loss, making fine-grained common features hard to be transferred. Therefore, obtaining a suitable "margin" is the critical problem for transferring fine-grained common features. 2) In the SAR (infrared) imaging modal, some samples carry much noise because of the limitation of the imaging mechanism (see Fig. 1), which may mask the critical parts of the SAR (infrared) images that can be used to distinguish targets of different subclasses, making such fine-grained common features hard to be transferred, especially in the few-shot case. Besides, noise in samples will make the effective features more difficult to be captured and decrease the predictive power of the convolutional neural network (CNN) [13]. Therefore, the noise problem needs to be considered to transfer fine-grained common features. For the first issue, different from setting a fixed margin based on experience, we define the expected value of margin as common-feature boundary (CF-boundary) to keep the classification ability of common features. Further, based on the CF-boundary, the CF-contrastive loss function is proposed to avoid choosing an unsuitable "margin" manually. Specifically, first, considering that common features between different imaging modals could be more dissimilar from that in the same imaging modal, we obtain the CF-boundary by calculating a mean distance of the corresponding features between VO and SAR (infrared) samples. Then, the "margin" is learned by minimizing its distance to the CF-boundary (see Fig. 2). Besides, additional connection-free attention modules are constructed to connect each feature pair of the source and target networks when the inputs are from different subclasses, namely the push-attention modules. The original connection-free attention modules in [12] are renamed as pull-attention modules for distinguishing. When the inputs of both the source and the target networks are from the same subclass, the pull-attention module is learned to transfer common features. When the inputs are from different subclasses, the push-attentionmodule is learned to make such common features discriminative between different fine-grained classes so that the fine-grained common features are transferred. Generally, the CF-contrastive loss is proposed to learn to transfer fine-grained common features for the fine-grained target recognition in the few-shot case. Note that the CF-boundary that represent the expected distance between the VO and SAR (infrared) samples provides additional supervision information for learning the fine-grained common features.
The second issue can be overcome by decreasing the contribution of samples containing much noise to the network training. To achieve this aim, the training strategy of "sample quality evaluation" (SQE) is proposed, which regards samples with much noise as the low-quality samples and focuses on the high-quality samples during the training process. Specifically, existing works [14], [15], [16] demonstrate that heteroscedastic uncertainty represents the noise inherent in samples, such as sensor noise or motion noise. Therefore, in SQE, the noise of training samples is measured by modeling the heteroscedastic uncertainty, then a lower learning rate is given to the training samples with higher heteroscedastic uncertainty. Generally, Fig. 3. Overview of the proposed algorithm. Improvements compared with the baseline are represented in the red dotted box. 1) Push-attention modules and the CF-contrastive loss are introduced to learn fine-grained common features. 2) Based on the modeling of heteroscedastic uncertainty, the SQE training strategy focus on training samples with less noise. VO samples are used to pretrain both networks to obtain the ability to extract common features, and SAR (infrared) samples are input into both networks to extract common features [12]. Then, the fine-grained common features are transferred to the target network's suitable position to enhance the target network's fine-grained common feature extraction. Therefore, both the VO and SAR samples are input to both the source and the target networks. SQE training strategy is proposed to transfer fine-grained common features by mitigating the influence of noise.
In this article, two novel designs are proposed (contents in the red dotted box of Fig. 3): 1) both pull-attention and pushattention modules are established to learn to transfer fine-grained common features with CF-contrastive loss; 2) based on the modeling of heteroscedastic uncertainty, the training strategy of SQE is proposed.
The contributions are as follows. 1) CF-boundary is proposed to represent the expected distance of fine-grained common features between different imaging modals. 2) CF-contrastive loss is proposed for learning to transfer fine-grained common features, in which the "margin" is learned based on the proposed CF-boundary. 3) Based on the modeling of heteroscedastic uncertainty, the training strategy of SQE is proposed to focus on training samples with less noise, decreasing the difficulty of transferring fine-grained feature representation.

A. Cross-Domain Few-Shot Learning (CDFSL)
CDFSL is first proposed by FWT [17], which learns representations by using featurewise transform. In CDFSL, auxiliary samples and novel samples obey the different distribution. Wang et al. [18] proposed CDSFL-ATA to alleviate the problem of inductive bias for metalearning-based algorithms through task augmentation. Fu et al. [19] proposed a meta-FDMixup network to utilize to guide the model learning. However, these algorithms are limited in visible imaging modal and do not suitable for SAR (infrared) imaging modal, such as SAR (infrared) imaging modals.

B. Fine-Grained Recognition for SAR (Infrared) Images
Metric-based algorithms are mainly used to cope with this problem. Lin et al. [20] proposed the task-driven dictionary learning framework based on the SAR-HOG features. Lang et al. [21] proposed a wrapper feature selection framework to learn a joint feature. Wang et al. [22] utilized the geometry and scattering characteristics of the target. Margarit and Tabasco [23] used a fuzzy logic decision rule to combine the scattering features and the geometric features to categorize ships. Xu et al. [24] proposed distribution shift metric learning by adding a distribution regularization term. He et al. [25] designed a taskspecific backbone network to extract powerful deep features with triplet loss [26]. To train the network well, they rearrange the OpenSARShip [27] dataset to acquire enough friendly training samples.

C. Uncertainty in Deep Learning
Uncertainty can be captured with the Bayesian network [14]. Specifically, uncertainty consist of two main types: epistemic uncertainty and aleatoric uncertainty [16]. Epistemic uncertainty captures the unknown information from the unseen data, which can be explained with increased training data. On the other hand, aleatoric uncertainty captures uncertainty that cannot be explained by collecting more data. Further, aleatoric uncertainty consist of two subclasses: homoscedastic uncertainty and heteroscedastic uncertainty. Homoscedastic uncertainty is not dependent on the input data. Its value stays constant for inputs in the same task but varies between different tasks. Heteroscedastic uncertainty is predicted as a model output. Recently, uncertainty has been used in many tasks, such as pixelwise depth regression [28], multitask learning [29], continual learning [30], face recognition [31], 3-D face reconstruction [32], image deconvolution [33], and autonomous driving [34]. In such tasks, uncertainty benefits the network from various aspects. However, the potential of uncertainty remains to be tapped in few-shot learning.

III. REVISITING BASELINE
We review the baseline [12] first for the completely introduction of the proposed algorithm. In general, similar to Fig. 3, the baseline transfer common feature with pull-attention modules. In total, eight ResNet blocks [35] and four Bayesian-CNN with four Bayesian ResNet blocks are introduced into the source and target networks, respectively. We introduce three major contents of the baseline in the following sections.

A. Transfer Common Features With Pull-Attention Modules
In the baseline [12], we regard S and T as the source and target network, respectively. Common features are transferred from S to T by matching each feature between them with learnable pull-attention modules.
Specifically, let x t be the input image of the SAR (infrared) imaging modal and y t be the corresponding class label. Then, x t is fed into S and T to obtain features. Let S m (x t ) be the mth layer feature of S and T n (x t ) be the nth layer feature of the T. The F-norm of between the distance S m (x t ) and T n (x t ) is minimized to transfer features. Meanwhile, the baseline weights the feature connection (S m , T n ) and its cth channel by λ m,n and w m,n c , respectively. Finally, the objective function is given as follows: F , Q represent all the feature connections, z m,n is an 1 × 1 convolution function to equal channels of the corresponding feature connection (S m , T n ), and f (a, b) represents pixelwise multiplication is performed between each value in the cth channel of a and the value in that of b. Note that the dimension of F m,n (x t ) is MN × C, where M and N are the width and height of T n (x t ), respectively, w m,n c is a 1 × C vector and λ m,n is a value, which are obtained by the pull-attention module parameterized by φ 1 . C represents the number of channels in S m (x t ). Note that θ is used to represent the parameters of both T and z, denoted as T θ and z θ .

B. Accurately Updating Important Parameters
To find important parameters, Bayesian-CNN [36] is constructed as the target network. Specifically, each parameter of the target network is initialized with the Gaussian distribution, and θ i ∼ N (μ i , σ 2 i ) represent the ith parameter. The importance of a parameter is thought to be inversely proportional to its standard deviation. Further, during the training process, the learning rate β i of θ i is set based on its standard deviation where γ i = 1 log(1+e σ i ) and β is the initial learning rate.

C. Objective Function
The Bayesian network is learned by the variational inference. Specifically, the Kullback-Leibler (KL) divergence with the parameters' posterior and a designed distribution q(θ|ι) parameterized by ι is minimized Further, N Monte Carlo samples [36] is used to approximate (3) where j is the jth operation of sampling. Finally, the objective function of the baseline is given as follows:

IV. PROPOSED ALGORITHM
A. Overview   Fig. 3 visualizes the overall structure of our algorithm. Compared with the baseline, the proposed algorithm focus on the fine-grained target recognition task by the following operations (contents in the red dotted box): 1) push-attention modules and the CF-contrastive loss are proposed to learn fine-grained common features; 2) an additional fully connected layer is constructed to output the heteroscedastic uncertainty; 3) based on the modeling of heteroscedastic uncertainty, the training strategy SQE is proposed to focus on the training samples with less noise.

B. Learn Fine-Grained Common Features
For the fine-grained target recognition task of the SAR (infrared) imaging modal in the few-shot case, common features that discriminative among different subclasses need to be transferred. Therefore, the CF-contrastive loss and push-attention modules are proposed to transfer fine-grained common features.
Specifically, when the inputs of the source and target networks are both from the pth subclass, (1) is minimized (the same as the baseline). Further, when the inputs of both networks are from the pth and qth subclasses, we push the T n θ (x t q ) away from the S m (x t p ) to learn discriminative features by maximizing the following objection function given as follows: where λ m,n diff and w m,n c diff have the same form as that in (1) but calculated by push-attention modules. Push-attention modules have the same structure as the pull-attention modules but parameterized by φ 2 , F m,n = (S m (x t p ) − z m,n θ (T n θ (x t q ))) 2 F , and f is the same as the one in (1).
For a clear description, parts of (1) is renamed as L m,n same (θ, where x t p and x t p are two different samples of the pth subclass. Further, the contrastive loss is introduced to minimize L m,n same and maximize L m,n diff at the same time where margin is the hyperparameter that control the expected distance between S m (x t p ) and T n θ (x t q ), and label indicates whether the inputs of the source and target network belong to the same subclass. The value of label is 1 when the inputs belong to the same subclass; otherwise, it is 0.

C. CF-Contrastive Loss
In the contrastive loss, the hyperparameters' margin is difficult to choose manually. This is because that margin with too low value makes little common features provide the loss, and conversely, features of the target network will be pushed too far to keep the ability for classifying SAR (infrared) targets. Therefore, the CF-contrastive loss is proposed to avoid choosing an unsuitable "margin" manually.
Specifically, we give each feature pair a corresponding learnable margin that is regarded as υ m,n , where (m, n) represent the feature pair (S m (x t p ), T n (x t q )). Thus, (8) becomes = label * L m,n same + (1 − label) * max(υ m,n − L m,n diff , 0). (9) However, υ m,n will become zero or even negative in the learning process. Thus, the CF-boundary is proposed to constrain the learning of margin, which are designed to represent the expected distance that we would push a common feature away from another one. Considering that common features between different imaging modals could be more dissimilar from that in the same imaging modal, we obtain the CF-boundary by calculating a mean distance of the corresponding features between VO and SAR(infrared) samples. Specifically, the CF-boundary is calculated by the following two steps (see Fig. 4  on the CF-boundary, the following objective function is given to learn the margin whereCF m is the mean value of CF m after 200 sampling times. Finally, based on (9) and (10), the CF-contrastive loss for learning fine-grained common features is proposed as follows: where Q represent all the feature connections.

D. Training Strategy of "SQE"
Existing research [13] shows that noise will make the effective features more difficult to extract. However, some SAR (infrared) samples carry much noise because of the limitation of the imaging mechanism, which is regarded as the "low-quality" samples in this article. Such low-quality samples make the network learn more information about noise, reducing the ability of the network to distinguish targets of different subclasses. Therefore, we aim to reduce the contribution of such samples to network training. To achieve this aim, the training strategy of "SQE" is proposed. In SQE, the heteroscedastic uncertainty is modeled in the target network to measure the quality of training samples because it captures the noise inherent in the samples [14]. Then, a lower learning rate is given to the samples with higher heteroscedastic uncertainty.
Specifically, heteroscedastic uncertainty can be regarded as an output of the network [16], thus an additional fully connected layer of the target network is constructed to obtain the heteroscedastic uncertainty (see Fig. 3): [ŷ, τ ] = T θ (x t q ), wherê y is the output class and τ is the heteroscedastic uncertainty of the input x t q . Besides, to ensure the value of heteroscedastic uncertainty is a small positive number, we map τ to [0, 2] by: τ = | 2 * arctan(τ ) π + 1|, namely constrained heteroscedastic uncertainty (CHU). Further, considering that the parameters is updated with a minibatch of samples, we propose to make the learning rates of parameters inversely proportional to the mean value of CHU in a minibatch whereτ is the mean value of CHUτ of training samples in a minibatch. Besides, considering the training strategy of AUIP of the baseline, the learning rate of the target Bayesian-CNN parameters is scaled according to the heteroscedastic uncertainty of a minibatch training samples and parameters' standard deviation. Specifically, in a minibatch, the learning rate β i of the ith parameter θ i is given as follows: where β is the learning rate and σ i is the standard deviation of the ith parameters.

E. Objective Function
This section shows the way of modeling the heteroscedastic uncertainty and give the total objective function.
The heteroscedastic uncertainty can be modeled as the regularization term of the objection function [16]. Thus, in this article, the heteroscedasticity uncertainty of input samples is modeled in the target Bayesian-CNN network by the following objective function: Finally, the total objective function L total is given as follows: Algorithm 1 shows the training scheme and φ = {φ 1 , φ 2 } represents the parameters of both the pull-attention and the pushattention modules.

A. Data Preparation
There are two types of datasets: source domain and target domain datasets. All source domain data are applied for pretraining the both the source and the target networks. The training samples in the target domain dataset are used to train the target network. In this article, the source domain dataset is a VO dataset consisting of two subdatasets, and three target domain datasets include two SAR and one infrared dataset.

1) Source Domain Dataset:
The source dataset consists of a VO vehicle and a VO ship subdatasets.
1) VO vehicle subdataset consists of 2639 targets of the vehicle and 2754 samples of background [37]. 2) VO ship subdataset consists of 1000 samples of the ship and 3000 samples of the background [38]. [39] is the SAR vehicle dataset. Table I expresses the details. Fig. 5 shows several examples of the dataset. for each parameter θ i do:

2) Target Domain Datasets: MSTAR dataset
1) P-OpenSARShip is the SAR ship dataset, which comes from the OpenSARShip dataset [27]. In total, five subclass targets are used to construct the P-OpenSARShip dataset, including Cargo, Dredging, Fishing, Tanker, and Tug. The detail is given in Table II. Fig. 6 shows several examples of the dataset. 2) P-VAIS is the infrared ship dataset, which comes from VAIS [40] dataset. In total, seven subclasses infrared targets are used to construct the P-VAIS dataset, including Cargo, Fishing, Passenger, Pleasure, Sailboat, Speedboat, and Tug. The specific number of each category is given in Table III. Fig. 7 shows several examples of the dataset. Note that the categories of targets in the VO and SAR (infrared) do not need to be the same but just lie in a similar coarse-grained class. For example, the VO dataset corresponding to the MSTAR dataset is the normal car dataset, which can be easily downloaded through the Internet. Besides, compared with other few-shot learning SAR ATR algorithms, which require samples of similar distribution (dataset) to pretrain the model, the related VO training samples used in our algorithms are much easier to collect in the real scene. For example, algorithms [9], [10], [11] set samples of three classes as the support set and all the samples of the other seven classes are used to pretrain the     model in the MSTAR dataset, which means they need to collect other military targets. Differently, we just need to collect the normal vehicle targets.

B. Compared Algorithms
For a comprehensive comparison, the compared algorithms consist of two types. The first type is mainstream metalearningbased algorithms, including FWT [17], CDFSL-ATA [18], and meta-FDMixup [19]. The second type is the transfer-based algorithms, including L2T [41], DTLF [42], [43], and NWPU [44] fine-tune and the baseline. These algorithms do not focus on the CDFSL problem but achieve a competitive performance compared with those metalearning-based algorithms. Note that literature [42] and [43] belong to the same algorithm but published in the journal of Remote Sensing and CVPR workshop.
C. Settings 1) Evaluation Setup: All the metalearning-based algorithms are pretrained on the mini-imageNet [45]. For transfer-based algorithms (including the proposed one), network is pretrained with the source domain dataset. The five-way one-shot and five-shot are two popular cases in metalearning-based few-shot learning algorithms [46], [47], in which 15 query samples is set. Therefore, for a fair comparison with metalearning-based algorithms, we adopt scenarios of five-way one-shot and fiveshot in our article. Specifically, in the five-way five-shot case for the MSTAR, we randomly sample five classes among the total ten classes at one time, and we conduct such sampling 600 times. Each time, the sampled five class is random and different, so that each class of the total ten classes will be included. Further, we report the mean value and standard deviation of accuracy of 600 times sampling. Note that, because two kinds of sets exist in the MSTAR dataset (training and testing sets), training samples (support samples) and testing samples (query samples) are sampled from the training and the testing sets, respectively.
2) Training: We train each model for 400 epochs. The optimizer is Adam. For both kinds of attention modules parameters, the learning rate is 10e −6 . For the parameters of the target network, the learning rate is 0.1. The initial value and learning rate of each margin are 0.25 and 0.01, respectively. The target network parameters are a mixed Gaussian distribution. Its mean value and variance are 0 and 0.0025. All samples are resized to 32 × 32. : Tables IV and V show the comparative results with other SOTA metalearning-based CDFSL algorithms and transfer-based algorithms. We can see that our algorithm reaches the highest accuracy in all datasets. Specifically, compared with the metalearning-based CDFSL  algorithms, the maximum accuracy improvement reaches 30.09%, 7.32%, and 12.92% for the MSTAR, P-OpenSARship, and P-VAIS datasets, respectively. Compared with transferbased algorithms, the maximum accuracy improvement reaches 11.47%, 4.98%, and 7.00% for these datasets, respectively. Besides, to validate the influence of the number of query samples, we sample 35, 55, and 75 query samples in one sampling time of the five-way five-shot case of the MSTAR dataset. The results are shown in Table VI.

1) Comparative Results
Besides, the experiments in the three-way five-shot case are conducted in the MSTAR dataset to compare the proposed algorithm with other few-shot SAR ATR algorithms [9], [10], [11]. Note that such few-shot SAR ATR algorithms set three classes as the support set, and all the samples of the other seven classes are used to pretrain the model. Therefore, for a fair comparison, we also perform the same setting. Table VII tabulates the experimental results.
2) Ablation Experiments: Tables VIII and IX tabulate the results of the ablation experiments for the proposed designs. SQE represents that we use SQE training strategy and without using   CF-contrastive CF-Contrastive represents learning fine-grained common features with the CF-contrastive loss and without using the SQE strategy. We can see that each design improves the performance, and the highest accuracy appears when they exist simultaneously, proving that both designs are effective. Tables X and XI tabulate the results of the ablation experiments for different manually set margins.
3) Results for Noisy Samples: To validate the robustness of our method to the noisy samples, we compute the heteroscedastic uncertainty of each testing sample in the P-OpenSARShip dataset with the method of [48] and collect the first 50 samples of each class to form a new testing set and name it as the P-OpenSARShip-noise dataset. The results of the five-way five-shot case on the P-OpenSARShip-noise dataset is given in Table XII.

4) Visualization of Feature Distribution:
To deeply understand why the proposed algorithm works, Fig. 8 visualizes the features in our algorithm and the baseline with the t-SNE algorithm [49]. We can see that features in our algorithm are more compact and separable, which valid that the fine-grained common features are transferred in our algorithm.

5) Analysis of Feature-Connection Weight:
To explore the knowledge transfer, we record the feature-connection weight λ m,n . Fig. 9 shows the results. Layer n in each row and column represents the nth layer of the source target networks, respectively. In Fig. 9(a), the features of the second and third layers obtaining a higher weight for transferring, which has the same conclusion as the baseline. Further, among all the weights λ m,n dif f in Fig. 9(b), λ 2,4 diff = 1.75 and λ 4,2 diff = 1.71 are significantly higher than other values, which indicates that features in these two feature connections need to be discriminative.

E. Discussion
We analyze the proposed algorithm from aspects of performance and the training process.
1) Performance: First, according to the abovementioned experimental results, it can be concluded that the proposed algorithm obtains the highest accuracy among all the algorithms on all datasets. Also, ablation experiments represent that the proposed designs are effective. We explain the following reasons. 1) We distinguish samples of different fine-grained based on the proposed CF-contrastive loss so that the common features are discriminative enough, increasing the network performance for the fine-trained target recognition task. 2) In the SAR (infrared) imaging modal, some training samples carry much noise, which is harmful to the network training, especially in the few-shot case, but existing algorithms did not deal with this problem. This article decreases the contribution of low-quality samples to the network training and strength the learning of targets features. Thus, the performance of the network is increased.  Besides, we can see a large difference in the recognition accuracy between MSTAR and P-OpenSARship. This is because the P-OpenSARship dataset is very challenging. Specifically, first, the sizes of samples are different, which distorts the samples feed into the network; second, for some samples, more than two ships exist in one sample. Differently, MSTAR is a well-arranged dataset whose samples are the same size, and each sample only contains one target. Therefore, the recognition accuracy between MSTAR and P-OpenSARship is very different.
2) Training Process: Fig. 10 compares the training process of the baseline and ours on the P-VAIS dataset. From Fig. 10, we can see that the proposed algorithm converge faster than that of the baseline, but fluctuate more. This is because the low-quality training samples are fitted to a lower degree, which impacts the loss value.
3) Complexity: Although the Bayesian network contains more parameters than the original CNN, the target network containing 9.8 M parameters also has the advantage of scale to some popular small networks, such as ResNet18 (11.2 M) [35], VGG19 (20.0 M) [50], and Xception (20.8 M) [51]. Besides, on the Geforce 2080Ti, the inference time for each testing sample only needs 0.00078 s, which meets the real-time requirements.

VI. CONCLUSION
A CDFSL-DIM algorithm for fine-grained target recognition is proposed in this article. The CF-contrastive loss is proposed to learn distinguishing common features. Besides, we propose a novel training strategy, namely SQE, which adjusts the learning rates of the minibatch according to the heteroscedastic uncertainty of training samples. Compared with other metalearningbased CDFSL algorithms, the maximum accuracy improvement reaches 30.09%, 7.32%, and 12.92% for the MSTAR, P-OpenSARship, and P-VAIS datasets, respectively. Compared with transfer-based algorithms, the corresponding accuracy improvement reaches 11.47%, 4.98%, and 7.00% for these datasets, respectively. For the future work, algorithms for reducing the number of parameters of the target network are worth studying.