Vision Transformer With Contrastive Learning for Remote Sensing Image Scene Classification

Remote sensing images (RSIs) are characterized by complex spatial layouts and ground object structures. ViT can be a good choice for scene classification owing to the ability to capture long-range interactive information between patches of input images. However, due to the lack of some inductive biases inherent to CNNs, such as locality and translation equivariance, ViT cannot generalize well when trained on insufficient amounts of data. Compared with training ViT from scratch, transferring a large-scale pretrained one is more cost-efficient with better performance even when the target data are small scale. In addition, the cross-entropy (CE) loss is frequently utilized in scene classification yet has low robustness to noise labels and poor generalization performances for different scenes. In this article, a ViT-based model in combination with supervised contrastive learning (CL) is proposed, named ViT-CL. For CL, supervised contrastive (SupCon) loss, which is developed by extending the self-supervised contrastive approach to the fully supervised setting, can explore the label information of RSIs in embedding space and improve the robustness to common image corruption. In ViT-CL, a joint loss function that combines CE loss and SupCon loss is developed to prompt the model to learn more discriminative features. Also, a two-stage optimization framework is introduced to enhance the controllability of the optimization process of the ViT-CL model. Extensive experiments on the AID, NWPU-RESISC45, and UCM datasets verified the superior performance of ViT-CL, with the highest accuracies of 97.42%, 94.54%, and 99.76% among all competing methods, respectively.


I. INTRODUCTION
T HANKS to the rapid development of Earth observation (EO) technology, a massive amount of remote sensing (RS) images with a high spatial resolution (HSR) are being generated every day. Interpreting these RS images, which contain sufficient land-cover/land-use information, has practical significance in many fields, such as object detection [1], land planning [2], and traffic management [3]. Among the many image interpretation tasks, RS images have received increasing attention. RS images aim to allocate a semantic label to the input RS image, where the label is from a predefined label set that refines the content of the RS images [4], [5], [6]. Scene classification is done in feature space so that the description ability of features extracted by the model directly affects the classification performance. In the beginning, scene classification methods are mainly based on hand-crafted features, which can be divided into low-level and high-level features. Low-level features [7], [8], [9] are usually constructed by visual attributes such as color [10], texture [11], and shape. And mid-level features are generated by encoding the lowlevel features through some encoding methods, such as bagof-visual-words (BoVW) [12], vectors of locally aggregated descriptors (VLAD) [13], and improved Fisher kernel (IFK) [7]. These hand-crafted features heavily depend on the expertise of designers, and the capacity of information expression is limited.
With the rise of deep learning, data-driven feature extraction methods that do not rely on prior knowledge are born. Especially in supervised deep learning, models learn deep features by training themselves on a large number of the labeled dataset so that they can fully exploit category information to extract high-level semantic features. Among them, convolutional neural networks (CNNs) have shown powerful capability of feature learning in visual applications. Several classical CNNs have been proposed, such as AlexNet [14], VGGNet [15], GoogLeNet [16], ResNet [17], and U-Net [18]. Concerning scene classification, CNN-based methods can be divided into three branches depending on how they are used: employing a pretrained model as a feature extractor, fine-tuning a pretrained model, and training a new model from scratch.
In the first branch, pretrained CNNs are considered feature extractors, and then, the resulting features are fused or combined to capture more visual information. Studies [19] use different pretrained CNNs to extract vision features and fuse the result features. The results show that fused features are more discriminated against. In [20], the CNN model is used to extract multilayer feature maps, and these feature maps are combined by calculating their covariance matrix of them after being stacked. Finally, the result covariance matrices are used for classification. The aforementioned models demonstrate that CNNs have good generalization capability for scene classification.
In contrast with using pretrained CNNs as extractors, finetuning pretrained CNNs on target datasets, which can lead to an end-to-end model, is more straightforward and more effective. In addition, when training data are insufficient, fine tuning takes precedence over training from scratch. The emphasis of optimizations in the second branch is placed on fine adjusting the networks [21], [22], [23] and loss functions [24], [25].
Although fine tuning pretrained CNNs can achieve effective classification performance, pretrained CNNs still have some limitations. Due to the gap between the nature dataset and the RSI dataset, features learned by the model trained on the nature dataset are not perfectly suitable for the RSI dataset. Moreover, modifying the pretrained model is not that convenient. Many studies of the third branch focused on either improving the CNN structure [26], [27], [28], [29] or constructing a hybrid model framework [30], [31] according to the characteristics of remote sensing datasets.
Overall, the CNN is a multilayer structure, where the convolutional layer plays a prominent role in extracting features from images. Thanks to the convolutional operations, the model can learn the local spatial information of the input images. And though progressively expanding the receptive field of the convolutional kernels in each layer, it can acquire the features of a global view. Sacking multiple convolutional layers can boost the classification performance significantly. However, the CNN cannot capture long-range relationships limited by the local receptive field.
Due to the complex spatial distribution of ground objects and the bird's eye view of RS imaging equipment, it is very generic for multiple ground objects to coexist in one single RS image [33], [34]. In the RS image shown in Fig. 1, which is of the "school" scene, there are multiple ground objects, a baseball field, a tennis court, a playground, roads, and so on. Furthermore, these objects are distributed in all directions of the image. The coexistence and dispersed distribution of multiple ground objects bring challenges to scene classification. So capturing global long-range interactions for these ground objects has vital practical significance in scene classification. Besides the CNN, transformer [35] is another deep learning structure that has taken off in the natural language processing (NLP) domain. The transformer benefiting from the self-attention mechanism, can capture long-range interactions on input sequence data and learn a global representation. Encouraged by the success of the transformer in NLP, Dosovitskiy et al. [36] have extended the standard transformer structure to visual applications and proposed the ViT model that demonstrates the enormous potential for image classification.
Although ViT has shown excellent feature learning ability, its performance on RS images has not yet reached saturation. Further improving its performance without increasing the parameter scale or integrating with additional depth structure is possible. In addition to the characteristics of the coexistence of multiple ground objects mentioned above, intraclass diversity (shown in Fig. 2) and interclass similarity (shown in Fig. 3) are also two nonnegligible challenges. In recent years, contrastive learning has received considerable attention due to its great potential for visual representation learning ability. Since 2019, research on comparative learning (CL) has developed rapidly, resulting in many excellent methods, such as SimCLR [38], SimCLR V2 [39], MoCo [40], and MoCo V2 [41]. Among them, the supervised contrastive (SupCon) loss, a batch contrastive approach for a supervised setting, has the intrinsic ability to perform hard positive/negative mining [42]. So, it is possible to employ SupCon loss to help the ViT model to learn more discriminative features.
Given the appealing properties of ViT and CL, in this article, a novel two-stage end-to-end framework for the scene classification is proposed, named ViT-CL. ViT-CL aims to combine the advantages of the transformer structure and the principle of contrastive learning to improve the performance of scene classification. First of all, considering that the scale of RS image datasets is hardly sufficient to train ViT models from scratch, transferring a large-scale pretrained ViT model to the target dataset, which can help ViT surpass inductive bias, is preferred. Second, as a combination of SupCon loss and CE loss, a joint loss is proposed to fine tune the pretrained ViT model. In this way, the two loss functions complement each other, forcing the model to learn more discriminating high-level semantic features and further making the model more robust. Finally, considering ViT is hard to optimize and sensitive to hyperparameters, we develop a two-stage optimization. In the first stage, only CE loss is adopted to fine tune the pretrained ViT model on the target dataset. In the second stage, the proposed joint loss is utilized to fine tune the model produced in the first stage. After the two-stage fine tuning, the optimized model is obtained, but only the cross-entropy loss part of the model is retained for the following inference.

II. RELATED WORK
In the last two years, some studies have begun to explore how ViT performs in RS images. Bazi et al. [43] introduced the ViT model into the RS images and improved the classification accuracy through data augmentation such as CutMix and Cutout. Also, they proved that the model performance could be maintained even if half of the layers were pruned to compress the network. Then, Bashmal et al. [44] proposed the dataefficient image transformers (DeiT), a ViT-based model trained  (a) Similarity between scene "freeway" and "runway." (b) Similarity between scene "railway station" and "industrial area." These images are from the NWPU-RESISC45 dataset [37]. by knowledge distillation with fewer data, and proved that the performance of ViT was superior to the CNN-based method on the remote sensing datasets AID and NWPU-RESISC. In [45], SCViT is proposed to overcome the disadvantage that the original model can only capture global spatial features. By improving the structure of ViT, the model not only considers the detailed geometric information of high spatial resolution images but also considers the contribution of different channels of the class token.
In addition, as the advantages of convolution structure and transformer structure complement each other, some studies have explored ways to combine these two network structures. Deng et al. [46] designed a joint loss function to build the joint framework CTNet. In this framework, the ViT model is used to capture semantic features, while the CNN model is used to extract local structure information. In [47], the advantages of the two models are integrated without improving the computational complexity by knowledge distillation, in which the ViT is worked as a teacher to guide the student model ResNet18. Besides classifying tasks, this article also proves that this method has good generalization ability for different tasks.
The remainder of this article is organized as follows: Section III introduces ViT and the supervised contrastive loss, then describes the proposed method ViT-CL in detail. Section IV contains both contrast and ablation experiments. The former compares our models with several classical CNN models and ViT-based methods on three well-known datasets, and the latter analyzes how the optimized model works. Finally, Section V concludes this article.

III. PROPOSED METHOD
denote an SRI dataset of size r, where X i represents the ith image and y i is its corresponding category label. X i ∈ R h×w×c , where h, w, and c represent the height, the width, and the number of channels, respectively. y i ∈ {1, 2, . . . , m}, where m is the predetermined number of categories.

A. Vision Transformer
A vision transformer is proposed to apply the vanilla transformer to the image task. The main goal is to generalize it to visual applications without integrating any data-specific architecture. ViT only retains the encoder module of the standard transformer, and the complete end-to-end architecture is shown in Fig. 4.
First, the input image is subdivided into nonoverlapping 2-D patches with dimensions p × p × c before being passed to the transformer encoder to adapt the standard transformer structure. The path size p is usually set to 16 or 32, and a smaller patch size will lead to a longer sequence and vice versa. Then, the n 2-D patches are flattened and passed to a liner layer to generate a patch sequence P ∈ R n×(p 2 ·c) , where n = h×w p 2 is the length of P . In the liner layer, a learnable matrix E ∈ R (p 2 ·c)×d is utilized to embed these patches into a d-dimensional space. After that, like most classification tasks with transformer structure, the embedded patch sequence is concatenated with a learnable classification token P 0 . Finally, the patch's spatial arrangement E pos , which helps the transformer to distinguish them, is encoded and added to the embedded patch sequence to obtain the embedding sequence Z (0) . The aforementioned process is formulated as follows: Next, the embedding sequence Z (0) is entered into the transformer encoder that contains L blocks. As shown in Fig. 4(b), there are two main subcomponents in each block: multihead self-attention (MSA) [see (2)] and multilayer perception (MLP) [see (3)]. Before entering these two components separately, the input needs to be preceded by a normalization layer (LN), which can stabilize the gradient of the loss to the input during backpropagation. And both the output of the two subcomponents employ residual skip connections to obtain a result as the input of the next subcomponent. The calculation process is as follows: Here, notice that the output of the Lth layer Z L is the final result of the encoder. For classification, the first token of Z L can be regarded as the final feature representation f of an input image after an LN processing. The calculation is as follows: Then, f is passed into an MLP head, which is composed of a full connection layer (FC) and the softmax loss function to predict the class label The construction of MSA, the core of the transformer encoder, is shown in Fig. 4(c). Attention can be understood as the weight of interaction between tokens, and self-attention means these tokens belong to one single sequence. For each token in the sequence Z, first, calculate the attention scores between itself and all the tokens of Z. And second, calculate the sum over all token embeddings weighted by these attention scores to obtain a new embedding for the current token. Before calculating self-attention, the sequence Z is mapped to three different sequences Q ∈ R (n+1)×d Q , K ∈ R (n+1)×d K , and V ∈ R (n+1)×d V by multiplying a learned matrix M QKV , where Q, K, and V represent query, key, and value, and d K , d Q , and d V are their dimensions. In theory, it just requires d K = d Q , and for convenience, there are d K = d Q = d V . The formula is as follows: Then, it comes to the SA block, shown in Fig. 4(c). The dot production Q · K T is calculated to measure the pairwise similarity between tokens in sequence Z. And to alleviate the problem of vanishing gradient, the result needs to be divided by √ d K . After a softmax operation, the final scaled dot attention is obtained. The entire procedure is as follows: Suppose that the number of heads is h, and the MSA block computes the scaled dot attention h times separately, using (7) with h different values for Q, K, and V, respectively. These result h SA values will be concatenated, and then, passed to a linear layer with parameter W 0 to ensure that the dimensions of the input and output of each MSA block stay the same. The formula is as follows: When it comes to the MLP block, there are two dense layers and a GeLU activation in between. It is too simple to expand on.
From the calculation process of MSA, it can be seen that this mechanism can capture long-distance dependencies between tokens. The generated feature representation, not only contains the information of all patches but also their higher order spatial interaction information. However, as mentioned previously, the transformer lacks some inductive biases, so it cannot generalize well when there is no adequate data. The authors in [48] has analyzed the effects of pretraining data scale, data augmentation, model size, and compute budget on the performance of the ViT model. And they proved that for most practical purposes, compared with training a ViT model from scratch, fine tuning a large-scale pretrained ViT model on the target dataset is both more cost effective and can produce better results. The author also gives some suggestions on how to choose a pretraining model.
1) The larger the pretrain dataset, the more generic the obtained model, and the larger the model, the longer the inference time; 2) The validation score obtained in a pretrain stage can be the direct reference index. And there is no need to transfer all available pretrained models to the target data and choose the model by comparing verification scores in a fine-tuning stage. The pretrained ViT model mainly contains three versions with different scales of parameters: small ViT, base ViT, and large ViT. And each version usually owns two different patch sizes, 16 and 32. See Table I for some vital parameters of the three version models.
Following the aforementioned suggestions and the recommendations of the official ViT documents, 1 The model B/16_21 k, which means base ViT with patch size 16, is chosen as the backbone network. More specifically, the model is pretrained on a large-scale dataset ImageNet-21 k (including 13 M images) [49], applying varying amounts of AugReg strategies [48].

B. Supervised Contrastive Learning
Most classification tasks usually employ CE loss as the objective function. But some studies have shown that this loss has drawbacks such as not being robust to noisy labels [49] and may produce poor margins [50], which can reduce the model's generalization ability and further affect the classification accuracy. As mentioned in the introduction section, RS images are characterized by big intraclass diversity and high interclass similarity. That is to say, RS images of the same class may be very different (shown in Fig. 2), while RS images of different classes may be very similar (shown in Fig. 3). So, poor margins, which means CE loss does not explicitly encourage discriminative learning of features, will be a loss for scene classification. However, SupCon loss can promote model learning discriminative feature representations by pulling together the clusters of similar samples in feature space while pushing apart the clusters of dissimilar samples. It can be employed to compensate for the drawbacks of CE loss. SupCon loss is produced by extending the self-supervised contrast learning [38] to the fully supervised setting [42]. In a supervised setting, the loss' selection criteria of positive samples changed to "whether it belongs to the same class," from that "whether it is from the same picture" in a self-supervised setting. Thus, the number of positive sample pairs in the comparison loss is expanded. It has been proved that this change can encourage the model better depict the intraclass similarity. For each anchor, the SupCon loss first calculates the similarity scores between it and all the other positive samples, and then, weighted sum these scores. The calculation formula is as follows: Here i ∈ I ≡ {1 · · · N } is the index of the anchor. A(i) ≡ I \ {i} represents the overall sample set besides the sample i, and P (i) ≡ {p ∈ A(i) :ỹ p =ỹ i } is the positive sample set, in which the samples have the same label as the anchor. The symbol "·" represents the inner (dot) product and τ ∈ R + is the temperature parameter. It can be seen from (9) that the contrastive denominator contains the summation over negative samples, and this form improves the model's ability to discriminate between signal and noise (negative samples). Overall, the significance of SupCon loss lies in narrowing the distance between the samples from the same class in the feature space, while widening the distance between samples from different classes. However, for each anchor, only positive samples in the batch contribute to the numerator of (9), so the batch size should be larger than the number of classes to ensure that there are enough positive samples in the batch. Meanwhile, a larger batch size call also guarantees enough negatives to form a sharp contrast with positive pairs.

C. ViT-CL
In this article, a method named ViT-CL is proposed to combine the advantages of SupCon loss and ViT. ViT-CL utilizes ViT as a backbone network, and then, optimizes the backbone network by a two-stage optimizing framework with a joint loss. The framework of ViT-CL is shown in Fig. 5.
After encoding by the ViT, each image X i can obtain its embedding feature f i . In the proposed framework, image features of one input batch will be passed to a joint loss, which is constituted by the CE loss and SupCon loss. In Fig. 5, each loss function is visualized as a task. The first task is the classification task, corresponding to the left one of the two tasks in Fig. 5. Here, the classifier structure of ViT is directly applied: First, feature f i is mapped to a new feature space for classification by the MLP head, and then, the CE loss is calculated. The calculation formulas are as follows: The second task is supervised contrast learning, corresponding to the right one of the two tasks in Fig. 5. Referring to the contrast learning framework [38], [42], a project network proj(·), which plays the same role as the MLP head in the classification task, is introduced. Some studies on contrast learning have shown that the project network is necessary and can help to improve the model's performance [38]. Here, proj(·) is instantiated as a two-layer MLP, whose hidden layer size is 2048 and output layer size is 128. Formulaic the calculations as follows: It should be pointed out that in the actual minibatch optimization process, the positive sample set P (i) of each anchor is limited in the batch where the anchor is, so the optimization parameter batch size would impact the performance of the model. Finally, the joint loss of one input batch is calculated as follows: where λ acts as a tradeoff between these two losses, which needs to be judiciously tuned to control the distinctiveness of learned features. Along with ViT's sensitivity to optimizer hyperparameters [51], tuning these hyperparameters including λ is time consuming. Instead, a simpler but effective strategy is proposed to tackle this problem: a two-stage optimization method. In the first stage, the pretrained ViT model selected in Section III-A is initially fine tuned only by the CE loss on the target RSI dataset. In the second stage, the result fine-tuned model of the first stage is fine tuned by the joint loss again.
Corresponding to the framework shown in Fig. 5, the model is fine tuned using only the classification part in the first stage. Then, in the second stage, both the classification part and the supervised contrastive learning part are used. After the two-stage join fine tuning, only the classification part of the framework is reserved to complete inference work.

A. Experimental Setup 1) Datasets Description:
In our experiments, three public remote-sensing datasets are utilized to evaluate the ViT-CL: Aerial Image Dataset (AID) [32], Northwestern Polytechnical University Dataset (NWPU-RESISC45) [37], and UC-Merced Land Use Dataset (UCM) [12], the detail information of the three datasets are displayed in Table II. Among them, the NWPU-RESISC45 dataset and UCM dataset are more challenging than the AID dataset.
2) Hardware and Software Environment: All subsequent experiments are conducted on a personal computer, and the detailed computing environment is shown in Table III.  TABLE III  PARAMETER STATISTICS FOR THE SMALL BASE AND LARGE VARIANTS OF THE  VISION TRANSFORMER 3) Parameter Optimization Setup: For comparison purposes, the overall accuracy (OA) is employed to evaluate the performance of different classic methods, which indicates the percentage of correctly classified images in the total number of images. When training or fine tuning models, 50% and 80% of the UCM dataset, 20% and 50% of the AID dataset, and 10% and 20% of the NWPU-RESISC45 dataset are randomly selected for training, respectively. One thing to note is that when using two-stage ViT-CL, the division of the train set (referring to the samples assigned to the set rather than the sample proportion) needs to be consistent in both stages. The pretrained ViT model used as the backbone of our ViT-CL is selected as described in Section III-A, a B/16 version pretrained on Imagenet-21 k with AugReg strategies. It can be downloaded from https://storage.googleapis.com/vit_models/augreg/.
In the optimization stage, adaptive moment estimation (Adam) is introduced to update the parameters of all methods and the learning rate (LR) is set to 0.0001. Also, stepLR is used to control the LR, whose step_size is set to 20 and gamma is set to 0.9. That is to say, LR is multiplied by a factor of 0.9 every 20 epochs. All the methods are fine tuned 100 epochs and for each epoch, batch_size is set to 128 limited by the memory of the GPU. Besides, the input images are resized to 256 × 256 pixels. In addition, with respect to the parameters in joint loss, λ is set to 0.2 through many experiments, and temperature τ is set to 0.07 as recommended in most comparative learning papers [40], [41], [52].
Furthermore, the confusion matrices (CMs) of the ViT models' prediction on the two more challenging datasets AID and NWPU-RESISC45 are calculated to prove the improvement of the ViT-CL model. The two CMs are shown in Fig. 6; (a) is for AID with a training ratio of 50% and (b) is for NWPU-RESISC45 with a training ratio 20%. It can be seen from the CMs that the proposed method performs well on the two datasets.
In the dataset AID, where the total number of categories is 30, the class number owning a greater than 90% accuracy is as high as 29. And among them, 25 have an accuracy greater than 95%. Besides, even the worst accuracy can reach 88%.
Moreover, from images in Tables V and VII, it can be seen that images belonging to the same scene can appear very different, and ViT-CL has the ability to capture diversity. When it comes to Tables VI and VIII, images of different scenes may look very similar or contain the same objects, ViT-CL also has the ability to distinguish between these scenarios. The aforementioned results fully show that ViT-CL can well distinguish both intraclass diversity and interclass similarity.

C. Ablation Study and Analysis
This article also conducted experiments on the variants of the model to illustrate the effectiveness of the two-stage joint optimization, including the following.
1) Fine-tune B/16_21 k: Fine tune the pretrained ViT model once, utilizing and only utilizing CE loss for classification. 2) One-stage ViT-CL: Fine tune the pretrained ViT model once utilizing joint loss and retain the classification part for classification.   TABLE V  PART PREDICT RESULTS OF DATASET AID, CATEGORIES ARE CHOSEN FOR THEIR BIG INTRACLASS DIVERSITY  TABLE VI  PART PREDICT RESULTS OF DATASET NWPU-RESISC45, CATEGORIES ARE CHOSEN FOR THEIR BIG INTRACLASS DIVERSITY   TABLE VII  PART PREDICT RESULTS OF DATASET AID, CATEGORIES ARE CHOSEN FOR THEIR HIGH INTERCLASS SIMILARITY   TABLE VIII  SOME PREDICT RESULT OF DATASET NWPU-RESISC45, CATEGORIES ARE CHOSEN FOR THEIR HIGH INTERCLASS SIMILARITY   TABLE IX  CLASSIFICATION ACCURACIES OF THREE VIT BASED METHOD ON THE AID, NWPU-RESISC45, AND UCM DATASETS (%) 3) ViT-CL: Our proposed two-stage joint fine-tune model. Table IX shows detailed comparisons for the three ViT-based methods. First, compared with fine-tuning B/16_21 k, one-stage ViT-CL additional introduces supervised contrast loss. Their OAs illustrate that merely adding supervised contrast loss does not work and may disturb the backpropagation process to obtain better results. Second, comparing the last two models, one-stage ViT-CL and the proposed ViT-CL, the difference between them is when to execute the fine tuning by joint loss. It can be seen  X  CLASSIFICATION ACCURACIES OF THREE VIT-BASED METHOD ON THE AID, NWPU-RESISC45,  that the two-stage optimization, which introduces supervised contrast loss after the pretrained model has achieved a good result on the target dataset by initial fine tuning, can make the joining loss effective. Finally, as for fine tuning B/16_21 k and ViT-CL, the main difference is that the latter is optimized by the two-stage joint framework, while the former is optimized only once by CE loss. Their results show that our proposed framework can improve the OAs by more than 1% on the two more challenging datasets AID and NWPU-RESISC45. And the lower proportion of the training set, the more significant the improvement.
Furthermore, this article statistic the distributions of category accuracy obtained by fine-tuning B/16_12 K and ViT-CL on different datasets (i.e., diagonal elements of the CM) to show how much the two-stage joint fine-tuning framework improving feature representations' expression ability compared with the CE loss. The maximum, minimum, average, and variance of each category's accuracy are statistical, respectively. The results are shown in Table X. It can be seen from Table X that the average classification accuracies of the ViT-CL on different three datasets are all higher than those of Fine-tune B/16_21 k, so do the minimum accuracies except the one on dataset NWPU-RESISC45 with train ratio 20%. Together with the lower variances of classification accuracies of the ViT-CL, all of these indicate that the introduction of contrast loss can make similar samples from the same scene more clustered, while the confusion degree among different scenes becomes lower.
More intuitively, the t-distributed stochastic neighbor embedding (t-SNE) algorithm [54] can reduce the dimension of the feature representations generated by different models so that feature projections can be visualized in a 2-D space. The 2-D visualization images of feature representations extracted by Fine-tuning B/16_21 k and ViT-CL on the two dataset AID (train ratio 50%) and NWPU-RESISC45 (train ratio 20%) are shown in Fig. 7, where (a) is for Fine-tuned B/16_21 k and (b) is for ViT-CL. From Fig. 7, it can be found that the feature structure is clear no matter whether the two-stage joint operation is adopted, which demonstrates the effectiveness of the backbone ViT model. Furthermore, compared with the feature clusters extracted by Fine-tuned B/16_21 k, the feature clusters extracted by ViT-CL are closer together, and their boundaries of them are clearer. This fact confirms the usefulness of our framework.

V. CONCLUSION
In this work, a two-stage end-to-end framework named ViT-CL is proposed. The framework combines the ViT model with supervised contrastive learning and gives full play to the advantages of the two so that it can further improve the accuracy of scene classification. The backbone ViT of this framework can capture long-range dependencies among patches via a self-attention mechanism. And the proposed joint loss function composed of cross entropy loss and supervised contrast loss can help the model learn more robust and discriminating semantic features. Besides, to avoid time-consuming parameter tuning, a two-stage fine tuning is employed to ensure the joint loss function can show its best performance. ViT-CL has been evaluated on three public remote-sensing image datasets, and the experimental results demonstrate the effectiveness in improving the overall accuracy of scene classification, compared to some classical CNN-based methods and improved ViT-based models. Moreover, with the ablation experiment, how the two-stage joint fine-tuning framework improves the performance of scene classification is discussed and it concluded that both "two-stage" and "joint" are necessary. In the future, we will employ unsupervised contrast learning or data enhancement strategies to build a scenario classification framework with lower time consumption and better performance.