Convolutional Transformer-Based Few-Shot Learning for Cross-Domain Hyperspectral Image Classification

In cross-domain hyperspectral image (HSI) classification, the labeled samples of the target domain are very limited, and it is a worthy attention to obtain sufficient class information from the source domain to categorize the target domain classes (both the same and new unseen classes). This article investigates this problem by employing few-shot learning (FSL) in a meta-learning paradigm. However, most existing cross-domain FSL methods extract statistical features based on convolutional neural networks (CNNs), which typically only consider the local spatial information among features, while ignoring the global information. To make up for these shortcomings, this article proposes novel convolutional transformer-based few-shot learning (CTFSL). Specifically, FSL is first performed in the classes of source and target domains simultaneously to build the consistent scenario. Then, a domain aligner is set up to map the source and target domains to the same dimensions. In addition, a convolutional transformer (CT) network is utilized to extract local-global features. Finally, a domain discriminator is executed subsequently that can not only reduce domain shift but also distinguish from which domain a feature originates. Experiments on three widely used hyperspectral image datasets indicate that the proposed CTFSL method is superior to the state-of-the-art cross-domain FSL methods and several typical HSI classification methods in terms of classification accuracy.

spatial image [1], [2], [3] that integrate the characteristics of image and spectra. HSIs contain abundant spectral and spatial information [4], [5], which have been applied in land-use and landcover classification and have gained increasing attention [6], [7], [8], [9]. In HSI classification, it is sufficient to labeled samples in the same scene such that a scene can be classified correctly. However, achieving labeling process is difficult for a newly collected HSI.
Cross-domain HSI classification was proposed for resolving the problem of difficult classification due to the scarcity of ground-cover labels [10], [11], [12], [13]. This aims to use the similarity of covering features between multiple HSIs to form classification and recognition criteria from an HSI with sufficient labeled pixels for model training and learning, which is called the source domain or source scene. Then, the model is used to identify and classify another HSI with similar scenes called the target domain or target scene that is seriously lacking in labeled pixels or even without available labeled pixels.
Inevitably, difficulties and challenges in cross-scene HSI classification tasks followed. Restricted by factors such as sensor differences, imaging time, location, and atmospheric environment, the acquired HSI has heterogeneity [14], [15], [16], [17]. Therefore, solving the distribution differences of the source and target domains is the key to cross-scene HSI classification, which is the domain adaptation problem. In recent years, a series of HSI classification approaches have been presented to achieve cross-scene learning tasks and solve domain adaptation problems, which can be roughly defined into two types: heterogeneity of feature distribution and heterogeneity of feature space.
The former refers to the HSIs collected by the same optical sensor under different angles, times, locations, etc., causing heterogeneity in the feature distribution between the same land covers in different scenes, which is manifested by the same number of spectral bands but the spectral curves may differ in the same class. The latter refers to the restriction of the parameters of the optical sensor, which leads to feature space heterogeneity between source and target domain HSIs; this manifests that not only the spectral bands are different in number, but the spectral curves of the identical class in different scenes would also be significantly different.
To address cross-scene classification, from the heterogeneity of feature distribution-based perspective, some works are operated to explore the similarity between the source and target domains, thus, solving the spectral offset problem. Deng et al. [18] proposed a feature embedding model based on deep metric This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ learning, which applies the features learned from the source scene to the target scene with an unsupervised domain adaptation technique. A maximum mean difference (MMD)-based graph optimal transmission (GOT) was proposed to align the distribution discrepancy of the source and target domains [19]. An unsupervised domain adaptation method was accomplished for cross-scene HSI classification by utilizing an integrated framework with spectral-spatial feature dense compaction [20]. The unsupervised domain adaptation method for feature learning does not demand labeled data in the target scene, but it requires having a small enough discrepancy between the source and target scenes. Although the heterogeneity of feature distribution-based methods enable to decrease data migration between two domains, they usually require that the target categories are the same as the source and can not classify the new unseen categories.
From the perspective of heterogeneity of feature space, Liu et al. [21] introduced spectral shift mitigation to simultaneously minimize the amplitude shift between source and target domains as well as the spectral variation for the target scene. Despite the great similarities in the data between the source and target domains, the classes between the two scenes may differ and new classes need to be considered. Recently, few-shot learning (FSL) [22], [23], [24] has been used to address the above problem, the goal of which is to classify a target class data given just a small number of labeled samples from each class. Li et al. [25] proposed a deep cross-domain few-shot learning (DCFSL) method for cross-scene classification of HSIs in the case of less labeled data, which overcomes domain shift by learning a domain-adaptive feature embedded space through a 3-D-CNN-based deep residual network from two mapping layers of the source and target sceneries that are used for ensuring that the inputs to the embedded feature extractor share equal dimensions. In addition, DCFSL makes it possible to perform domain distribution alignment by the domain discriminator. Zhang et al. [26] developed a dual graph cross-domain few-shot learning (DG-CFSL) method to mitigate the impact of domain transitions. DG-CFSL designs intradomain distribution extraction block (IDE-block) to carry out domain alignment using nonlocal spatial information which has powerful corresponding properties.
The foregoing FSL approaches enable increased classification accuracy with limited labels; they commonly extract features using a convolutional neural network (CNN) that have obtained significant results for cross-scene HSI classification. However, it is difficult for CNN to capture the sequence attributes of spectral features due to the limitations of its network backbone. In addition, the receptive field of CNN is limited which may easily cause the missing information in the down-sampling layer, and it needs to expand the convolution kernel to expand the receptive field, which causes dimensional disaster. A transformer network [27], [28] can be utilized to overcome the above issues because it can capture the sequence attributes of spectral features. Meanwhile, vision transformer (ViT) [29] has been proposed to apply a transformer in image classification, Chen et al. [30] developed a multistage vision transformer model to form pyramid feature extraction. Wu et al. [31] introduced spectrally enhanced and densely connected transformer model to capture local contextual and semantic features. Feng et al. [32] developed a novel spectral transformer with dynamic spatial sampling and gaussian positional embedding to take full advantage of the flexible nature of spatial sampling, to emphasize the importance of the central image element for HSI cube classification, and to improve the adaptability. Peng et al. [33] proposed a spatialspectral transformer with cross-attention, which is composed of a dual-branch structures with spatial and spectral sequence. However, it tends to overlook some local information that may be important for HSI classification. To enhance information utilization and extract more discriminative features, we combine the CNN and transformer module and propose a convolutional transformer-based few-shot learning (CTFSL) structure for cross-domain HSI classification. Specifically, two FSLs are first executed simultaneously for the source and target domains. After mapping two domains' bands to the same dimensions through the distribution aligner, a feature extractor based on a convolutional transformer (CT) network is utilized to learn spectral-spatial features, which can both expand interclass distances and reduce innerclass distances. Furthermore, a domain discriminator is employed to tackle the domain separability problems that can not only classify the same target domain classes as source domain classes but also classify new unseen classes.
The major contributions presented in this article are grouped as follows.
1) A CTFSL framework is proposed where a novel FSL method is developed to solve classes that are scarcely represented, and an FSL loss is defined to avoid overfitting to underrepresented classes. 2) The CT network is designed by composing the convolutional neural network and a vision transformer, which achieves more effective feature embedding and extracts both local detail and global information for HSI patches. 3) An adversarial loss is introduced using domain discriminator based on FCN to match the prediction between two domains and optimize the proposed network model for a cross-domain task. 4) It can be observed that CTFSL can achieve better classification results than other cross-domain FSL methods on practical application. The rest of this article is organized as follows. Section II briefly describes some relevant concepts. Section III explicitly explains the full details of the proposed CTFSL for crossscene HSI classification. Section IV shows experimental results to demonstrate the superior performance of CTFSL. Finally, Section V concludes this article.

II. RELATED WORK
This section introduces several relevant concepts to better explain the proposed CTFSL.

A. Domain Adaptation
In cross-scene HSI classification, domain adaption aims to transfer data knowledge from the source domain to the target domain by mapping the data features of two domains into the same feature space [34], [35]. Domain adaptation can solve the distribution discrepancy between the source and target domains by learning domain-invariant features. Domain adaption may be described in two forms: unsupervised domain adaptation [36], [20], [37] and supervised domain adaptation [38], [39], [40]. In domain adaptation, the source domain has rich learning information. Unsupervised domain adaptation refers to the target domain without labeled samples, while supervised domain adaptation means that the target domain has a few labeled samples. Our method leans toward supervised domain adaptation and proposes cross-scene few-shot domain adaptation.

B. Cross-Scene Few-Shot Learning
FSL is one type of meta-learning [41], [42] that processes images given only a small number of labeled samples [43]; FSL aims to construct a consistent scene of a source and target domain based on an FSL method through meta-learning [44], [45], [46]. In cross-scene HSI classification, FSL is usually defined as a K-way N -shot task [47] (i.e., N labeled samples of K unique classes) and N is very small, e.g., 1 or 5 [48]. First, two HSI datasets are given: the source dataset X s ∈ R S D and the target dataset X t ∈ R T D , where X t contains two parts D f with labeled few-shot data and D t with unlabeled test data, i.e., X t = D f ∪ D t . Then, the numbers of categories in the source and target domains are marked with C s and C t separately. Generally, to guarantee diversity in the training samples, we set C s > C t , which is beneficial for meta-learning [49], [50].
In our method, we take the source data X s ∈ R S D and the target labeled few-shot data D f as the training set for feature extraction and the target unlabeled data D t as the test set for model evaluation. The FSL model operates on the task-based learning tactic in both the source and target domains, where each task is one single iteration of training. During every iteration, taking the FSL on the source dataset X s as instance. A support set is first formed with C classes and K samples per class are randomly selected from X s . Therefore, the support set is expressed as consists of N samples randomly selected from the identical C classes that are unique from the elements of the support set. It is note worthy that the sample labels of the query set are considered as unknown. In experiments, we usually set K significantly smaller than N which can simulate the practical few-shot classification scenarios. Summarizing, a C-way K-shot N -query FSL work is formed for the source dataset. The target FSL is similar to the source data.

C. Vision Transformer
After the publication of the vision transformer (ViT) [29], it has been broadly used in various tasks of computer vision due to its excellent performance such as HSI classification [51], [52], [53], [30]. ViT is derived from the structure of the original transformer [54], [55], [56] and is easy to transplant into different tasks. The original transformer, which is a typical encoderdecoder model, is proposed for natural language processing. Therefore, the transformer consists of two parts: the encoding  and decoding components. The encoding component is composed of multiple encoder layers, each of which is made up of two sublayers: self-attention and feed-forward network [57]. Likewise, the decoding component also consists of a stack of decoder layers, but decoder inserts a third sublayer per layer, encoder-decoder attention, in addition the two sublayers of the encoder. The transformer is entirely based on self-attention mechanisms, which can realize input parameter sharing by the global contextual information.
Inspired by the tremendous achievements of the original transformers, ViT is an extension in the field of image classification. The original transformer only accepted sequential inputs (i.e., the input of the original transformer is 1-D embeddings). Therefore, the input image in ViT is first divided into a series of nonoverlapped fixed-size patches (i.e., 2-D patches) that are then projected into patch embeddings (i.e., flatten the 2-D patches into a 1-D image sequence). Finally, send the patch embeddings of the image into the transformer to extract features.  self-attention can be calculated as follows: where d k is the dimension of K. The attention weights obtained from the dot product of Q and K are responsible for calculating the attention scores between each pair of different vectors that determine the level of attention given to the other data when encoding the data at the current location. √ d k and Softmax normalize the attention scores to enhance the gradient stability to improve the training, and subsequently convert the scores into probabilities. Finally, according to the probability magnitude, each value vector is multiplied with the sum of the probabilities to assign attention weights to it and produce the final output vector.

III. METHOD
This section introduces the convolutional transformer-based few-shot learning (CTFSL) network for cross-scene HSI classification. Fig. 3 displays the structure diagram of the suggested CTFSL, which contains four parts: few-shot learning (FSL), distribution aligner, feature extractor, and domain discriminator. Specifically, executing FSL in both the source and target categories concurrently. Then, a distribution aligner is used before the feature extractor to map the source and target domains into an identical dimensions. Next, the feature extractor maps features from two domains into a scene-consistency metric space. The domain discriminator predicts the domain to which a feature belongs and achieves the distinguishability of the two domain classes.

A. Few-Shot Learning
Given the source domain data X s ∈ R S D having C s classes and the target domain data X t ∈ R T D having C t classes separately, the proposed CTFSL network has two FSL tasks: the source FSL task S fsl and target FSL task T fsl . Two kinds of FSL are executed in the classes with both the source and target domains simultaneously by episodes, enabling scene consistency between the source and target domain data and building cross-scene classification model.

1) Source FSL:
In the source FSL S fsl task, selecting C classes from the source classes C s to form an episode. In the source episode, source data X s is divided into a support set and a query set Q s = {(x s j , y s j ))} C×N j=1 . Specifically, C categories are randomly selected from X s , with K samples from each category, forming a support set. Moreover, a query set is formed by randomly selecting N samples from the same C classes that are distinct to those in the support set. After that, the distribution aligner is first applied for dimensionality reduction of all samples in the support and query sets, after which the embedding characteristics are obtained by the feature extractor. FSL is executed by comparing the similarity of the embedded features between the query and support sets per category. The class prototype for a support sample x s i in the support set S s is where S k s is the set belonging to class k in the support set, |S k s | is the number of samples in S k s , x s i denotes a support set sample for which the label is y s i , and f ϕ indicates the feature extractor with argument ϕ. A query sample x s j in Q s has the category distributivity computed by the Bregman divergences (i.e., the Euclidean distance) based on a softmax function j represents a support set sample for which the label is y s j , ED(•) denotes a Euclidean distance function, C denotes the amount of distinct categories per episode. The source FSL loss of x s j ∈ Q s is calculated into the negative log-probability of its corresponding truth category by cross-entropy loss 2) Target FSL: Similar to the source FSL task, C classes are selected from target classes C t to form an episode in the target FSL. In the target episode, target data X t is similarly divided into a support set Notice the support set samples are selected from labeled data D f with only a few samples. Therefore, the class prototype for a support sample x t i in the support set S t is The class predicted probability for a query sample x t j in Q t expressed as .
(6) The target FSL loss of x t j ∈ Q t is given by

B. Distribution Aligner
The heterogeneity of feature distribution between the source and target domains resulted in inconsistent spectral resolutions of the samples. Thus, a distribution aligner is employed for mapping the source (the Chikusei dataset with 128 bands) and target domains (e.g., the Indian Pines dataset with 200 bands) to the same dimension d. The distribution aligner is implemented via 2-D CNN. First, we ensure the rationality of the selected band by selecting 9 × 9 neighborhoods to be the input spatial dimensions. Thus, assuming that I ∈ R 9×9×b is the input of the HSI cube where b means the bands amount, the result obtained from the distribution aligner as where I A ∈ R 9×9×100 is the aligned dataset, and A ∈ R b×100 is the function of the distribution aligner. b × 100 denotes learnable parameters in the alignment. There are 128 × 100 parameters for X s , and 200 × 100 parameters for X t .

C. Feature Extractor
The feature extractor works for extracting the spatial-spectral embedding features and mapping them to a scene-consistency metric space. The feature extractor is based on a convolutional transformer (CT) network, which effectively combines the convolutional neural network (CNN) with a vision transformer (ViT) structure and can extract both the local and global features for using the spatial and spectral information sufficiently. The feature extractor mainly consists of two subblocks, Fig. 3 shows the architecture of the feature extractor (see the feature extractor module). Specifically, Fig. 4 shows the CT module of the feature extractor.
The input to the feature extractor is the output I A ∈ R 9×9×100 from the distribution aligner. In our method, the input patch I A is fed into the CT module which consisted of a CNN block and a ViT block. The CNN block extracts local features f c from I A and the ViT block is utilized to extract global features f v . Then, we combined the local and global features to form the feature representation f of the feature extractor

D. Domain Discriminator
To reduce domain shift as inspired by [40], a domain discriminator is explored with adversarial loss to predict the domain to which a feature belongs. The domain discriminator is built on a fully convolutional network (FCN) that contains a convolutional layer with a 5 × 5 kernel as a filter, a convolutional layer with a 1 × 1 kernel, a residual block, followed by a final convolutional layer with a 1 × 1 kernel. Except for the last layer, each convolutional layer is followed by a batch normalization (BN) and a rectified linear unit (ReLU) nonlinear activation function. Our goal is classifying whether the features come from the source or target domain. On the domain discriminator, we define an adversarial loss function L D to resolve the imbalance among classes while the loss L D should be minimized where D(·) and 1 − D(·) are the probabilities of a sample i belonging to the source and target domains predicted by the domain discriminator, respectively. f θ denotes the features from the feature extractor with parameter θ, x s i and x t i are samples from the source and target domains (i.e., x s i ∈ X s , x t i ∈ X t ), respectively.
Thus, the source domain's total loss function as Likewise, the target domain's total loss function as Finally, the nearest neighbor (NN) method is utilized to classify unlabeled samples in the target domain during the testing phase and then generate their classification maps to evaluate the effectiveness of CTFSL.

IV. EXPERIMENTAL RESULTS
The experiments are performed using software platform Pycharm on a 12th Gen Intel Core TM i9-12900KF processor equipped with NVidia GeForce TM RTX 3090 Ti and 64 GB of RAM, and all codes executed on Python 3.7.

A. Experimental Data
The proposed CTFSL approach for cross-domain HSI classification is performed employing four public HSI datasets, namely, the Chikusei, Indian Pines, University of Pavia, and Salinas datasets.
1) Source Domain: The source domain dataset utilizes the Chikusei dataset. The Chikusei dataset was gathered over agricultural and urban areas in Chikusei, Ibaraki, Japan by a Headwall Hyperspec-VNIR-C imaging sensor, on July 29, 2014 [58]. It comprises 128 spectral bands with a spectrum of 363-1018 nm, comprises 2517 × 2335 pixels in which each has a spatial resolution of 2.5 m and comprises 19 unique landcover categories. Fig. 5(a)-(c) presents the false-color image, the matching ground-truth map and the matching color card of the Chikusei. The classes of the Chikusei dataset and the corresponding sample numbers are shown in Table I.
2) Target Domain: The Indian Pines, University of Pavia, and Salinas datasets are applied as target domains. The Indian Pines dataset was acquired over the agricultural Indian Pine test site in North-western Indiana by an AVIRIS sensor in June 1992. It comprises 200 spectral bands with a spectrum of 400-2500 nm, comprises 145 × 145 pixels in which each has a spatial resolution of 20 m and it comprises 16 unique land-cover categories. Fig. 6(a)-(c) presents the false-color image, the  Table II.
The University of Pavia dataset was acquired over Pavia, Nothern Italy utilizing the ROSIS sensor in a flight campaign. It comprises 103 spectral bands with a spectrum of 430-860 nm, comprises 610 × 340 pixels in which each has a spatial resolution of 1.3 m and it comprises nine unique land-cover categories. Fig. 7(a)-(c) presents the false-color image, the matching ground-truth map and the matching color card of the University of Pavia. The classes of the University of Pavia dataset and the corresponding sample numbers are shown in Table III.
The Salinas dataset was gathered over Salinas Valley, California using AVIRIS sensor. It comprises 204 spectral bands  with a spectrum of 400-2500 nm, comprises 512 × 217 pixels in which each has a spatial resolution of 3.7 m and it comprises 16 unique land-cover categories. Fig. 8 presents the false-color graph, matching ground-truth map, and matching color card of Salinas. The classes of the Salinas dataset and the corresponding sample numbers are shown in Table IV.

B. Experimental Setup
The input of the proposed CTFSL is chosen from set with patch size {5 × 5, 7 × 7, 9 × 9, 11 × 11, 13 × 13, 15 × 15}. From Fig. 9, for all experiments on different target domains, we observe that with increasing input patch size, the classification accuracy also increases, but it will decrease after increasing beyond a certain extent and it approximately obeys the Gaussian distribution. Therefore, taking this into account, our method sets    the input patch size to 9 × 9. The CTFSL method is trained via an Adaptive Moment Estimation (Adam) optimizer. The training iterations are setup as 10 000 and the learning rate as 1e-3. In the episodic training phase, each episode represents a C-way K-shot mission. C indicates the count of categories and sets it as the class number in the target domain (i.e., setting the University of Pavia dataset as 9, the Indian Pines and Salinas datasets as 16). K indicates samples number per class within support set S and is always set to one regardless of source or target FSL. In addition, samples number per class within query set Q is N Q , and N Q is setup to 19 to evaluate the learned classifier. Furthermore, 200 labeled samples selected arbitrarily from each category of the source domain to acquire transferred knowledge. Finally, classification was based on a K-nearest neighbor (KNN) classifier, and the number of nearest neighbors is set to 1.
To guarantee the equity of the above-mentioned approaches, five labeled samples per target domain category were first chosen for training within all control experiments. Then, adding random Gaussian noise to augment the data. The remaining entries in the target domain are regarded as the testing data. In addition, for cross-domain methods, learning portable information by 200 labeled samples selected randomly from each source domain class (DFSL+SVM, DFSL+NN, and DCFSL).
To assess the classification effects of different approaches objectively, we adopted three widely used quality indicators, the overall accuracy (OA), the average accuracy (AA), and the kappa coefficient. The training samples chosen at random for all experiments, for which reason ten repetitions were performed for eliminating the influences, and thus obtaining the means and standard deviations of OA, AA, and Kappa. In addition, the values reported for each metric were computed by taking the average of the outcomes derived from ten repetitive experiments with arbitrarily chosen training samples.

C. Comparison of Different Methods
Comparing the proposed CTFSL method with three typical classification methods (KNN, SVM, and 3-D-CNN), and three FSL classification approaches (DFSL+SVM, DFSL+NN, and DCFSL) show our method's advantages and efficiency. For the supervised methods (KNN, SVM, and 3-D-CNN), training the classifier can only choose a few-shot data from the target domain. The reason why source domain samples cannot be used as a training set in these methods is that they demand the same training as the test categories. In particular, KNN calculates the Euclidean distances between test and training samples of distinct categories, and obtains the class to which the test sample belongs by comparing the average of the smallest Euclidean distances. It is noteworthy that the number of nearest neighbors is set to 1. SVM learns nonlinear support vector machine by kernel method to map nonlinear data into a linearly separable space, but the standard SVM method ignores the spatial information, focusing only on spectral information in HSI. The 3-D-CNN method enables effective extraction of deep spectral-spatial characteristics that contribute to the accurate classification of HSI.
Nevertheless, in the case of the FSL approaches (DFSL+SVM, DFSL+NN, and DCFSL), the samples in the source domain can be utilized to learn transferable knowledge since the classes may differ between the source and target domains. Concretely, learning metric space in DFSL+SVM and DFSL+NN methods to extract spectral-special features via a deep residual 3D CNN, and then, such metric space could be used in few-shot classification with a SVM or NN classifier. The DCFSL model is based on the DFSL+NN and DCFSL, which construct a unified structure to address FSL and domain adaptation problems. With the suggested CTFSL scheme, the aforementioned default arguments are applied for all experiments.
To confirm the effectiveness of suggested CTFSL, experiments on three datasets are compared to the foregoing comparison approaches. The first executed on the Indian Pines dataset. Comparing the different methods' performances, 5 labeled items per category were randomly sampled from the Indian Pines dataset. To objectively evaluate the performances of the different methods, 10 classification experiments were repeated for eliminating the influence from from stochastic sampling. The classification performance of all methods was assessed employing the mean and standard variance of the OA, AA, and Kappa coefficients. Specifically, the optimal values for each class are bolded to highlight, and the values in parentheses refer to the standard  deviation of the precisions achieved from ten experimentations. Table V shows the classification accuracy of every category for Indian Pines under different methods. The cross-domain FSL approaches (DFSL+SVM, DFSL+NN, DCFSL, and CTFSL) are clearly superior to those traditional classification approaches (KNN, SVM, and 3-D-CNN) in the case of limited methods. In particular, the proposed CTFSL's OA, AA, and Kappa value are at least 4.13, 2.48, and 4.44 percentage points higher than the comparison method, respectively, which indicates that the CTFSL method is generally feasible. To visually demonstrate the proposed CTFSL's effectiveness, Fig. 10 shows a corresponding classification map of all the aforementioned methods. As shown in the figure, the proposed CTFSL can see some noise, but in contrast, it shows a classification map still with the smoothest spatial distribution and it has the best precision with less mislabeling, which are the concordant outcomes with Table V. The second among them conducted on the University of Pavia dataset. Table VI displays the OA, AA, and Kappa coefficient, and the detailed classification accuracies of each class on the University of Pavia with various classification approaches. As Table VI shows, the KNN, SVM, and 3-D-CNN classification methods only consider limited target domain samples to develop the training data, so OA values are only 60.48%, 65.08%, and 69.87%. By contrast, the OAs of the cross-domain FSL-based classification approaches (DFSL+SVM, DFSL+NN, DCFSL, and CTFSL) are usually greater than 78%, because they can make full use of the source domain information and the target few-shot labeled information. In addition, comparing the DCFSL approach, the OA with our suggested approach increased from 83.83% to 85.03%, which proves this method's effectiveness. As an instance, the classification precision has improved from 74.46% to 80.24% for Class 6 and from 56.62% to 90.75% for Class 8 by comparison to DCFSL. The AA and Kappa of CTFSL are also the highest among all the compared classification methods. Fig. 11 shows the classification result maps under different methods. In particular, the classification graph of CTFSL clearly demonstrates its classification advantages compared to other methods.  The third carried out on the Salinas dataset had the analogous findings. Table VII shows the classification accuracy values yielded by the compared approaches and the suggested CTFSL. As an instance, compared to KNN, the classification precision has improved from 75.72% to 98.08% for Class 3, that of Class 8 has increased from 48.43% to 83.26%, and that of Class 15 has increased from 61.03% to 80.78%. Fig. 12 visually represents the proposed CTFSL's effectiveness by showing the corresponding classification maps yielded by all the aforementioned methods with the OAs. Apparently, the suggested CTFSL yields a classification map with the smoothest spatial distribution and it has the best precision with less mislabeling from Fig. 12, which are the concordant outcomes with Table VII.
To illustrate the proposed CTFSL's computational complexity effectively, Table VIII shows the computational efficiency (including training and testing times) of the above methods in different target domains. For three typical classification methods (KNN, SVM, and 3D-CNN) without a cross-domain, their training times are shorter than that of the other crossdomain FSL classification methods (DFSL+NN, DFSL+SVM, DCFSL, and CTFSL). The table shows that although our method takes a long time to train, it has the highest accuracy.

D. Parameter Analysis
To analyze the performance of the nearest neighbor size, the algorithm comparison experiments under different nearest neighbor size are carried out to analyze the sensitivity of the CTFSL algorithm on three target domain datasets. We set 1, 2, 3, 4, 5 as the size of the nearest neighbors to conduct 10 iterations of the experiment and obtain the average of the results to compare the performance, Table IX shows the classification accuracy for three datasets under different nearest neighbor size, where the best results are bolded to highlight. As can be seen from the results in the Table IX, the nearest neighbor size set to 1, 2, and 4 on the Indian Pines, University of Pavia, and Salinas datasets exhibit optimal classification performance, respectively.    In the comparison experiments, we set the number of nearest neighbors to 1, which is not optimal for University of Pavia and Salinas, but still shows better performance than the other methods. Although it is best for Indian pines when the number of nearest neighbors is set to 1, it can be seen from Tables V and IX that there are still better classification results than other algorithms when the nearest neighbor size takes other values. This further proves the superiority of our method.
To investigate the effect of the labeled sample size on the CTFSL method performance, 1, 2, 3, 4, and 5 labeled samples were also randomly selected from for each class of the target domains to build few-shot data respectively. Then the classification experiments with different numbers of labeled samples were performed ten repetitions, and the classification accuracies for each number of labeled samples with previously mentioned methods under the Indian Pines, University of Pavia, and Salinas datasets are shown in Tables X-XII, where the best results are bolded to highlight. To illustrate this visually, Fig. 13 shows the classification accuracy curves of different labeled sample numbers on three target domain datasets. As shown in Fig. 13, the OAs of the classification results obtained by all the methods are closely related to the change in labeled values, the increased number of labeled samples, the higher classification accuracy, and using five labeled samples per class exhibits the best performance. In particular, the CTFSL method is superior to the other methods mentioned with the same labeled samples, which shows the superior stability of the CTFSL method.

E. Analysis of Practical Applications
To verify the effectiveness and superiority of CTFSL method in practical application scenarios, we conduct an experimental   analysis of HSI data for a scenario in the Dongting Lake Basin. The Dongting Lake Basin dataset was gathered by Hyper-Spectral Observation Satellite GaoFen (GF)-5 Advanced HyperSpectral Imager (AHSI) on December 8, 2019, it consists of 2008 × 2083 pixels with a spatial resolution of 30 m and 330 spectral bands in the wavelength range 400-2500 nm. GF-5 is the world's first hyperspectral satellite covering the full spectral range and enables comprehensive observation of the land and atmosphere. By processing GF-5 data from the Dongting Lake Basin, a scene with 452 × 380 pixels and  305 effective spectral bands was selected as the experimental dataset. The scene contains six different land-cover classes and 16 584 ground-truth labels, Fig. 14(a)-(c) shows the false-color image of the scene and the corresponding ground-truth map and the corresponding color code.
In the validation experiment, five labeled samples of each class are randomly selected for training in CTFSL and various comparison methods, and the rest are regarded as the testing data. The number of nearest neighbors is set to 1. The results show that the CTFSL method has more performance advantages with the highest classification accuracy that the OA value is 90.43%, compared to other methods. Fig. 15 shows the classification result maps and the corresponding average of OA values among ten repetitive experiments under different methods are in parentheses. As shown in the figure, the visualization map of CTFSL can see some noise, but in contrast, it still shows the most accurate and spatially smoothest classification map of classes with fewer mislabeled pixels, compared to other methods.

V. CONCLUSION
This article proposes convolutional transformer-based fewshot learning method for cross-domain hyperspectral image classification. The method includes three main parts: 1) distribution aligner based on few-shot learning to achieve the dimensionality reduction; 2) feature extractor based on convolutional transformer network to obtain the local-global features; 3) domain discriminator based on fully convolutional network to reduce the domain shift. Experiments have been performed on three different real hyperspectral images, and the results show that the proposed CTFSL outperformers the existing state-of-the-art FSL methods in cross-domain HSI classification, thus verifying its effectiveness. However, the good performance of the proposed CTFSL method relies on a relatively large computational cost. Further developments of this work should further improve its performance while reducing the computation time.