Semisupervised Semantic Segmentation With Certainty-Aware Consistency Training for Remote Sensing Imagery

Semisupervised learning is a forcible method to lessen the cost of annotation for remote sensing semantic segmentation tasks. Recent related research works indicate that consistency training is one of the most effective strategies in semisupervised learning. The core of consistency training is maintaining model outputs consistent under various perturbations. However, the current consistency training-based semisupervised semantic segmentation frameworks lack the analysis of model uncertainty, which increases the generation of semantic ambiguity on remote sensing images. Therefore, we propose the certainty-aware consistency training (CACT) strategy to mitigate the influence of semantic ambiguity caused by model uncertainty. The CACT strategy consists of two novel parts: certainty-aware consistency correction (CACC) and class-balanced-adaptive threshold (CBAT) strategy. The CACC starts with generating a high-quality prediction target, then models the importance of the consistent output target and corrects the output predictions according to the certainty map, increasing the focus on reliable predictions. The CBAT strategy uses a dynamic class-balanced adaptive threshold to filter out unreliable predictions, further reducing the impact of semantic ambiguity. Finally, considerable experimental results on the DLRSD, WHDLD, and Potsdam demonstrate that our framework has an excellent performance on semisupervised remote sensing semantic segmentation scenarios.

and other economic construction fields [1], [2], [3], [4]. With deep neural networks (DNN) developing rapidly, the semantic segmentation performance in remote sensing images has dramatically improved by leveraging high-quality labeled data in supervised learning scenarios [5]. However, obtaining labeled samples is often strenuous, time-consuming, and costly. For some complex remote sensing scenes, experts' knowledge is needed to assist in labeling, which is one of the limitations of training an excellent fully supervised DNN [6], [7]. Considering the high annotation cost, exploring powerful learning algorithms to handle the problem of limited annotation data for semantic segmentation in remote sensing images is necessary.
Compared with limited labeled data, unlabeled data is ordinarily abundant and can be effortlessly accessed. Naturally, we would like to exploit the unlabeled data to improve the DNN models' performance with small labeled samples. For this reason, semisupervised learning [8], [9] has become a hot topic, which can solve the problem of only a few well-labeled samples by combining unlabeled data. The core of semisupervised learning is to define an unsupervised penalty function for unlabeled data. To actualize this, Rasmus et al. [10] proposed consistency loss for the unlabeled examples. They passed the same sample through the network with and without noise, then imposed consistency loss on the two predictions. In this case, the network plays a dual role: a student and a teacher. As the teacher, it engenders targets to guide the students' learning; As the student, it learns as usual.
As the model generates targets by itself through the teacher role in consistency training-based methods, the generated target may well be incorrect. Therefore, a series of works have focused on generating high-quantity targets. Those works can be simply grouped in two ways. One method is to carefully design perturbations instead of additive or multiplicative noise, which ensures that decision boundaries locate in low-density regions, e.g., UDA [11]. Another method is to select a better teacher model instead of barely replicating the student model, e.g., Mean Teacher [12]. The representatives of former methods in semisupervised semantic segmentation are [13], [14]. They added the perturbation on the image or the feature, then imposed consistency between the multiple predictions for the same sample. The representative of the latter method is CPS [15], which constructs two networks in parallel, then uses cross-pseudo supervision between the two segmentation networks. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Illustration of potential question. Due to the small interclass variance and high intraclass variance in remote sensing image, the model uncertainty will drive teacher model to generate more jittery output, increasing the generation of semantic ambiguity. Even if the abovementioned methods are helpful for semisupervised semantic segmentation, there are still some problems when applied to remote sensing images. Different from the natural image, remote sensing images can capture more extensive scenario information due to the top-view imaging method, which shrinks the variance between different landcover categories and enlarges the gap among the same landcover category. Besides, there is uncertainty during the model's training process [16]. Specifically, the output of samples in the model is jittery due to the model uncertainty. The slight interclass variance and high intraclass difference in remote sensing images make the jittery output more possible to generate semantic ambiguity, as shown in Fig. 1, impairing the performance of the semisupervised semantic segmentation framework. Considering this, we propose a certainty-aware consistency training (CACT) strategy for semisupervised remote sensing semantic segmentation (see Fig. 2).
In detail, the CACT strategy consists of the following two parts: certainty-aware consistency correction (CACC) and classbalance adaptive threshold (CBAT). For CACC, it first generates high-quality targets and then models a certainty map for the target of each data point on the teacher model. According to the certainty map, we reweight the consistency cost of each target so that it can increase the focus on reliable generated targets. For CBAT strategy, it filters out the most unreliable targets by setting a class-balanced-adaptive threshold.
To summarize, the contributions of this article are as follows. 1) A CACT-based semisupervised semantic segmentation framework is proposed, which can improve the segmentation accuracy in the case of fewer samples. 2) We propose a CACC strategy to mitigate the generation of semantic ambiguity in remote sensing images, which integrates multioutputs and reweights the consistency loss according to a certainty map, increasing the focus on reliable targets. 3) We propose a CBAT strategy, which can adaptively adjust thresholds for different objects, mitigating the influence of semantic ambiguity further.

A. Remote Sensing Semantic Segmentation
Remote sensing semantic segmentation is one of the remote sensing communities' primary and vital tasks. The traditional segmentation method consists of two main steps. First, the features of each pixel are extracted using hand-crafted feature descriptors [17], [18], [19], and then the pixel-level classification is achieved using classifiers, such as SVM. These traditional methods rely heavily on the careful design of hand-crafted features. However, the hand-crafted features are not robust for complex remote sensing scenarios. Recent advanced achievements in solving this problem mainly rely on DNNs, which extract features from data directly. Long et al. [20] first proposed fully convolutional neural networks for semantic segmentation in an end-to-end training manner, significantly improving semantic segmentation's precision. Subsequently, a series of DNN models for semantic segmentation emerged in the remote sensing field.
The main improvements of DNNs-based semantic segmentation models come in two aspects. The first aspect is the extraction of richer contextual information for the recognition of complex remote sensing scenes. To get the full content, Nogueira et al. [21] used extended atrous convolution [22], [23], [24] to achieve dynamic multiscale contextual information acquisition without increasing network parameters. Ding et al. [25] proposed a two-stage multiscale framework architecture for learning contextual information. Recently, some attention-based methods [26], [27], [28], [29] also show the powerful contextual information extraction ability for semantic segmentation, which significantly improved segmentation performance. The second aspect is a more precise boundary location. To achieve this, the works in [30], [31], and [32] used additional data to extract the boundary information of objects, thereby improving the accuracy of object boundary localization and segmentation. In addition to the abovementioned two main aspects, there are some works to deal with some specific problems in the remote sensing semantic segmentation, such as foreground-background imbalance [33].
However, even though DNNs-based remote sensing semantic segmentation has made significant progress, they are still constrained by extensive well-labeled data. When the training samples are insufficient, the performance of the DNN models will descend rapidly. In addition, for pixel-level semantic segmentation tasks, the annotation of samples is very expensive. Especially for some complex scenes, expert experience may be essential to assist in labeling.

B. Semisupervised Semantic Segmentation
Semantic segmentation needs manual pixel-level annotation, which is time-consuming and expensive. Exploring available unlabeled data helps us to promote the learning of segmentation models and is one of the ways to address the lack of labeled samples. Recently, many researchers have attempted to adapt classical SSL methods to DNN (DSSL) models. DSSL models can be simply grouped into the following categories: consistency training based models [12], [34], pseudolabel (self-training) based models [35], [36], [37], [38], and GAN-based models [39], [40]. In this work, we focus on consistency training-based works, which abides by the cluster and smoothness assumption for semisupervised semantic segmentation.
Several works have applied consistency training for semantic segmentation and show its huge potential. The consistency training will enforce the consistency of the predictions/intermediate features with various perturbations. For example, French et al. [13] proposed the CutSeg, which uses CutMix [41] to augment input images. Then, the predictions of the augmented image were forced to be consistent. The CCT model [14] add perturbation at the feature level to force the output predictions of multiple decoders for the same object to be consistent, which is more conforming to the smoothness assumption. GCT model [42] further use two segmented networks with the same structure but different initialization and then strengthened the consistency between the predictions of the perturbed network. CPS [15] also used two parallel segmentation networks, making the cross pseudo supervision that enforces the consistency between the two segmentation networks. Some works [40], [43], [44] also combining the consistency training with pseudolabel to exploit the unlabeled data. The consistency constraint between the predictions of augmented images will make the decision function lie in the low-density region, promoting the model's recognition ability.
Although the above-related works have confirmed the effectiveness of consistency training, there are still some problems with existing consistency methods. Existing consistency training-based methods lack of analysis of model uncertainty. The model uncertainty will increase the generation of semantic ambiguity in remote sensing image, leading to the decline of model performance.

A. Problem Definition and Notation
Before presenting an overview of the proposed semisupervised semantic segmentation framework, we first introduce the notations. Let X = {X l , X u } ⊂ R 3×H×W denotes the entire dataset, consisting of a small labeled subset X l = {x l } N l l=1 with labels Y l = {y l } N l l=1 ⊂ R C×H×W and a large unlabeled subset . C is the number of classes, H and W represent the height and width of the input image, N l and N u denote the number of labeled data and unlabeled data, respectively. Semantic segmentation task intends to learn the projection function F , which maps the input image x to semantic map y. Formally, semisupervised semantic segmentation aims to solve the following optimization problem: where L s denotes the per-sample supervised loss, e.g., crossentropy function, L u denotes per-sample unsupervised loss, e.g., consistency loss. Lastly, θ represents the model parameters we want to learn, and ω indicates the weight used to balance the supervised and unsupervised losses.

B. Teacher-Student Model
Consistency training-based methods usually construct two roles: teacher and student, explicit or implicit. As shown in Fig. 3, we construct the teacher-student model in the form of explicit. The teacher-student model has two parallel independent segmentation networks: teacher network F t and student network F s . Teacher network and student network are segmentation models with the same structure and different initialization. During the training process, the model parameters θ t of the teacher network F t at a training step n is an exponential moving average of the weights θ s of the student network F s where α ∈ [0, 1] is smoothing coefficient. When α = 0, the teacher model share the same weight parameters with student model; when α = 1, the teacher model decouples from the student model. Following the mean teacher, we set α = 0.99 in our work. As mentioned above, there are two kinds of losses for semisupervised semantic segmentation: supervised loss and unsupervised loss. In the supervised learning, we use the cross-entropy loss as the penalty, which is defined as where x l is the input image, y  For the consistency training to be a net win, it would have to perturb the samples as much as possible. The classical augmentation methods, such as translation, rotation, and scaling, have evidence that they provide very limited variation for the sample. Other augmentation methods, such as Mixup [45], perturb input samples a lot by mixing the different classes. However, these methods are not suitable for the semisupervised semantic segmentation task because they will make the boundary pixel sample cross the decision boundary. Regarding previously classical perturbation methods for consistency regularization, French et al. [13] identified CutMix [41] and CutOut [46] as promising candidates for semantic segmentation, which provides large perturbations of samples at the same time avoid crossing the decision boundary of the same class. Therefore, we adopt the CutMix for the teacher-student model, as shown in Fig. 3. For unlabeled image x a and image x b , the mask M , in which the value is 1 inside the white rectangle, and the value is 0 in another part, is used to mix them. The mixed-function can be defined as the operator denote the an elementwise product. Furthermore, the unsupervised loss can be rewritten as where |X u | is the number of unlabeled samples, || · || 2 denotes l 2 norm, the F s (·) and F t (·) are the output of student model and teacher model after the softmax function, respectively.

C. Certainty-Aware Consistency Correction
The consistency training-based semisupervised semantic segmentation method consistency training forces the output prediction to keep consistent under different perturbations. As mentioned above, the semisupervised model usually serves a dual role: a teacher and a student. Targets generated by teacher model are used to guide the learning of the student model. However, there is uncertainty during the training process of the teacher model, which leads to generate unstable targets. Furthermore, the differences between ground objects in remote sensing images are shrinking, increases the generation of semantic ambiguity. To reduce the influence of model uncertainty and semantic ambiguity, we propose the certainty-aware dynamic consistency training strategy, which consists of CACC and CBAT.
The CACC starts with the generation of high-quality targets (predictions). To get more reliable predictions, we first get k different predictions probability maps p = F t (x) ∈ R C×H×W by performing k stochastic forward passes to the perturbed teacher model for a given input unlabeled image x u , as shown in Fig. 4. As described in the literature [47], the diversity in predictions is significantly large for the teacher-student model compared to purely supervised learning. The variety of predictions in remote sensing images can lead to semantic ambiguity due to slight interclass difference and sizeable intraclass variance, which increases the probability of generating semantic ambiguity. Therefore, we average the k different prediction results and sharpen the averaged results to get more stable predictions. The sharpened probability mapp, can be defined aŝ Fig. 4. Illustration of teacher model's output predictions. We performed k stochastic forward passes to the teacher model. wherep c is the cth channel ofp. F j t (·) represents the jth output predictions of teacher model after softmax operator. p = 1/k k j=1 F j t (x) is the averaged probability map. T ∈ (0, 1] is the sharpening coefficient that controls the degree of sharpening. When T → 0, the output will approach a Dirac distribution, and the output approach the original predictions when T → 1. Averaging multiple predictions increases the reliability of predictions to some extent. To make the predicted features more distinct, we sharpen the average predictions. For example, one model's output prediction probability is [0.38, 0.32, 0.15, 0.15], the top-2 classes have similar prediction probability. When we sharpen it with a sharpening coefficient T = 0.5, we can get the sharpened probabilistic predictions [0.49, 0.35, 0.008, 0.008], enlarging the gap between different classes.
The mean is a quantity that represents an overall trend in a set of output distributions, which can mitigate the generation of semantic ambiguity. Furthermore, we obtain a certainty map M c , which characterizes the importance of each sample. Then, we correct the model's output prediction for each pixel with the certainty map, as shown in Fig. 5. In statistical learning, the certainty can be represented as predictive variance, entropy variance, predictive entropy, and so on. In this research, we adapt the predictive entropy to describe the certainty of the output predictions. The lower the predictive entropy, the more reliable its forecast is. The certainty map M c can be defined as where H(·) is a nonlinear function. We formulated the nonlinear function H(·) as The value of H(·) decreases with the increase of the variable u, satisfying the condition that the certainty decreases with the increase of entropy. After getting the sharpened predictions and certainty maps, we can define the certainty-aware consistency loss whereF t (·) denotes the sharpened probability map. Object categories with significant intraclass and minor interclass variance in remote sensing images are more susceptible to the DNN models, resulting in the generation of semantic ambiguity. Different from previous methods, we model the certainty of each pixel's predictions and give each pixel sample an additional weight according to the certainty map. The more reliable the prediction, the greater the weight. Conversely, the smaller the weight. Thus, we can increase the focus on reliably predicted pixel samples and reduce the impact of semantic ambiguity.

D. Class-Balanced Adaptive Threshold
The CACC imposes different constraints on different samples to increase the focus on reliable samples, which reduces the transmission of false semantic information to a certain extent. To further mitigate the impact of semantic ambiguity caused by model uncertainty and inspired by the threshold method in semisupervised learning, we propose a CBAT strategy. The CBAT strategy can filter out the unreliable target by setting a threshold. Unlike the fixed threshold in [11] and [48], the CBAT can balance the diversity between different categories and adjusts adaptively by itself. Generally speaking, the fixed threshold ignores the variance in different classes. Common sense is that different categories should have correspondingly unique learning difficulties. For instance, homogeneous things, e.g., fields, are usually better identified than anisotropic objects, e.g., artificial objects. The origin of this phenomenon is that homogeneous objects have tiny intraclass variances, while anisotropic objects have great intraclass differences. Significant intraclass differences lead to larger fluctuations in prediction results and are more prone to generate semantic ambiguity.
Therefore, we proposed the CBAT instead of the fixed threshold. We use class level entropy to characterize the learning difficulty of different classes. The class with low (high) entropy is considered easy (difficult) to learn, as shown in Fig. 6. For the complicated category, we expected that more samples could be utilized so that we lower the threshold appropriately to improve the learning effect of the antagonistic sample classes. The average class-level entropy on the unlabeled data can be where I c is the entropy of cth class, N c is the number of pixels predicted to cth class in the whole unlabeled dataset, can be calculated as 1(·) represents indicator function. When the input condition is true, the value of indicator function is 1. Otherwise, the value is 0. Theŷ represents the pseudolabel of the pixel (h, w) in unlabeled image x u , p (h,w,c) x u represents the softmax probability of pixel (h, w), and p x u = F t (x u ) is the softmax probability map about all input unlabeled image. I (h,w) x u refers to the pixel-wised entropy overall C classes The predicted class-level entropy is more higher, the class is more difficult to learn. We lower the threshold of hard-to-learn class, so it can get more support samples for hard classes. Consequently, the CBAT can be defined as where τ 0 is the initialized fixed threshold. I nc = I c max I is the normalized class level entropy. Therefore, the certainty-aware consistency constraints can be defined as where F s (x u ) and F t (x u ) as shown in (10), M c can be obtained through (8),p c x u is the confidence of the pixel belong to cth clasŝ benefiting from the CBAT strategy, the learning effectiveness of each category can be better promoted. Lastly, the total loss L can be written as

IV. EXPERIMENTS
In this section, we implement sufficient experiments to evaluate the performance of the proposed method, including the ablation studies and comparisons with the state-of-the-art model. First, we will briefly introduce the remote sensing image datasets for semisupervised semantic segmentation, which are used in experiments. Then, the evaluation metrics and implementation details of the proposed method and other compared methods are explained. After that, the quantitative analysis results of our method on the related datasets are introduced in detail. Following this, we provide extensive visual analysis to validate the proposed model's effectiveness. Finally, we conduct time complexity and convergence analysis, showing the proposed method's superior performance.
A. Datasets 1) DLRSD: DLRSD [49] dataset is a dense labeling dataset that can be used for pixel-based tasks such as semantic segmentation. There is a total number of 2100 images with the size 256 × 256 pixels in DLRSD, which originates from the UC Merced archive [50]. The images, whose pixel resolution is 0.3 m, were manually extracted from large images in the USGS National Map Urban Area Imagery collection for various urban areas around the country. Among these images, we take 1686 images for training and 414 images to evaluate our approach.
2) WHDLD: WHDLD [51] dataset is the second dense labeling dataset used in our experiment. The images in WHDLD are cropped from a large HR remote sensing image of the Wuhan urban area. The dataset's pixels are divided into the following six categories: building, road, pavement, vegetation, bare soil, and water. WHDLD contains 4940 RGB images with the size of 256 × 256 pixels, whose resolution of pixels is 2 m. Among them, 3720 images are for training and the rest of 1220 images are used for valuation.
3) Potsdam: Potsdam dataset [52] consists of 38 tile images with the size of 6000 × 6000 pixels. The spatial resolution of each image is 0.05 m. The dataset are divided into six landcover categories:clutter/background, car, tree, low vegetation, buildings, and impervious surfaces (e.g., roads). Among them, training samples are 24 tiles, and the remaining 14 are for testing. Those images are cropped into nonoverlapping patches by a window size of 512. Thus, we can get 2904 cropped images for training and 1694 for testing.
For the semisupervised semantic segmentation task, we regard a certain proportion of the data as labeled samples and the  I  NUMBER OF LABELED IMAGES, UNLABELED IMAGES, AND VALUATION IMAGES  UNDER DIFFERENT LABEL RATIO FOR DLRSD, WHDLD, AND POTSDAM  DATASET remaining data as unlabeled samples. The division ratio is 1/12, 1/8, 1/4, and 1/2, respectively. The number of labeled images/patches and unlabeled images/patches is shown in Table I.

B. Evaluation Metrics
To measure the performance on each dataset, we use mean F1 score (mF1) and mean intersection over union (mIoU) as evaluation metrics, which represent as follows: where TP, FP, TN, and TP represent true positives, false positives, true negatives, and false negatives, respectively. The mF1 score can be defined as where P is precision and R is recall. β is the equivalent factor between precision and recall. The precision and recall can be defined as Note, we adopt mIoU as the major metric.

C. Implementation Details
We use ResNet-50 [53] pretrained on ImageNet [54] as the backbone of the segmentation model. The segmentation model adopts deeplabv2 [22] or deeplabv3+ [24], which are described detail later in Sections IV-D and IV-E2. We train the model with a standard stochastic gradient descent optimizer with 0.9 momentum and weight decay 10 −3 . Our learning rate is scheduled by poly, starting with 10 −2 for the DLRSD, and WHDLD dataset, 5 × 10 −3 for the Potsdam dataset. The smoothing coefficient α and temperature coefficient T are set to 0.99 and 1/2, respectively. All the comparative experiments are trained with a batch size of 16, including 8 labeled images and 8 unlabeled images. We train the model with 120 epochs for the DLRSD dataset, 60 epochs for the WHDLD dataset, and 80 epochs for the Potsdam dataset. During testing, the output predictions of the student model are used to evaluate. All compared methods adhere to the same settings and execute on the PyTorch 1.7.1 framework with one NVIDIA A100 GPU.

D. Comparing With Semisupervised Works
We compare the proposed method with several prevalent semisupervised semantic segmentation methods in computer vision, including mean teacher (MT) [12], GCT [42], CutSeg [13], CCT [14] and CPS [15], which are described in Section II-B. For fair comparisons, all the compared methods used the deeplabv3+ with the pretrained ResNet-50 as the segmentation network.
1) Comparison on DLRSD Dataset: Table II, Table V, and Fig. 7(a) show the segmentation result about all comparison methods under different label ratios. As shown in Table II, our model outperform to all other comparison methods. In detail, our method obtains a mIoU 8.59 points higher in 1/12 ratio, 9.44 points higher in 1/8 ratio, 6.74 points higher in 1/4 ratio, and 2.01 points higher in 1/2 ratio in comparison with the SupOnly. 1 Compared to the other semisupervised semantic segmentation algorithms, our method has an improvement on mIoU with 2.01 points, 2.11 points, 2.68 points, and 0.81 at least in 1/12, 1/8, 1/4, and 1/2 ratios, respectively. Fig. 7(a) shows a more intuitive result. We can get comparable performance with the 1/4 label ratio compared to other methods with the 1/2 label ratio. Furthermore, we observed that the fewer the labeled samples, the more pronounced the performance improvement of our method is. It shows that the proposed CACT strategy can reduce the generation of semantic ambiguity and improve the network performance of semisupervised semantic segmentation.
Besides, we further analyze the reasons for the success of our model by analyzing the IoU of each class. As shown in Table V, the discrimination accuracy of manufactured object categories in our model has significantly improved. For example, the mIoU has an increment with 8.07-15.90 points for the class of "Plane," 2.05-12.79 points for the class of "Buildings," 0.95-7.09 points for the class of "Cars," and 0.91-20.51 points for the class of "Tanks." These objects are usually either small or made of different materials in remote sensing images, which enlarges the intraclass variance and shrinks the interclass difference. The outputs of these classes of objects are more unstable, increasing the generation of semantic ambiguity. Our model can identify these objects better, indicating that our method can reduce the impact of semantic ambiguity. What is more, this is also partly due to our proposed CBAT, which full excavates the learning difficulty of different categories.
2) Comparison on WHDLD Dataset: To further verify the effectiveness of the proposed method, we also verified it on the WHDLD and Potsdam dataset. Table III and Fig. 7(b) show the comparison between our proposed method and other semisupervised semantic segmentation methods on the WHDLD dataset. Our method has a superior performance to all the comparison methods. In detail, our method has an improvement on mIoU with 4.24 points, 3.45 points, 2.19 points, and 2.01 points   compared to SupOnly in 1/12, 1/8, 1/4, and 1/2 label ratio, respectively. Comparing to the GCT, our model also improve the mIoU by 1.40 points (1/12 label ratio), 1.02 points (1/8 label ratio), 0.95 (1/4 label ratio) points, and 1.48 points (1/2 label ratio). As shown in Fig. 7(b), we can get superior performance using the 1/4 ratio labeled image compared to the result of other methods by using the 1/2 ratio.
3) Comparison on Potsdam Dataset: Table IV and Fig. 7(b) show the numerical segmentation result between other semisupervised semantic segmentation methods and our proposed method on the Potsdam dataset. Our method also gets the competitive performance. In the 1/12 label ratio, the mIoU increases by 2.10-6.02 points, and mF1 increases by 1.56-4.20 points. For other label ratios, the performance of our model has also improved to varying degrees. Fig. 7(b) illustrates a more intuitive result. Our model has significant improvement when the sample is fewer. This success is owing to the CACT strategy.
The superiority of our model can be summarized from two aspects. First, our model attempts to analyze the certainty of each sample's predictions, which mitigate the generation of semantic ambiguity. Second, the proposed CBAT filter out the unreliable sample and takes into account the diversity of different categories, which does aid in the learning of complex categories and mitigates the influence of semantic ambiguity.

E. Ablation Study
In this section, we conduct ablation experiments to explore the proposed model and verify the effectiveness of our method.

1) Ablation Study for the Proposed Components:
In order to verify the effectiveness of proposed components in our framework, we conduct ablation experiments on the DLRSD dataset. We adopt the teacher-student model without the CACC and CBAT as the baseline, and the segmentation model adapts deeplabv3+ with pretrained ResNet-50. As shown in Table VI, the base model achieves a 62.80% mIoU (1/12 label ratio). When adding CACC and CBAT to the baseline, respectively, it yields 66.73% and 66.48% mIoU with an improvement by 3.93 and 3.68 points. The increment of segmentation accuracy means that our CACT can reduce the influence of semantic ambiguity and bootstrap the model to produce more stable predictions.
Last but not least, it yields the best 67.45% mIoU, which has an increment by 4.65 points through combining two components. This shows the independence and complementarity of each component in our method.
2) Ablation Study for Different Segmentation Networks: Table VII shows the experimental results under the selection of different segmentation networks. The comparison experiments are conducted on all three datasets with different label ratios. When we use the deeplabv2 as the segmentation networks, the mIoU improves by 8.51 points (1/12 label ratio) on the DLRSD dataset, 4.32 points on the WHDLD dataset, 5.40 points on the Potsdam dataset, and 10.45 points on the Potsdam dataset; when we use the deeplabv3+ as the segmentation, the mIoU increase by 8.59 points on the DLRSD dataset, 4.34 points on the WHDLD dataset, 6.02 points on the Potsdam dataset, and 11.26 points on the Potsdam dataset under the 1/12 label ratio. Besides, our method has dramatically improved compared to supervised learning under other different label ratios, whether using deeplabv2 or deeplabv3+ as the segmentation network. It  shows that our semisupervised semantic segmentation method can be easily implemented by replacing the segmentation model.
3) Analysis of the Hyperparameter ω: We investigate the influence of different ω, which is used to balance the supervision loss and unsupervised loss in (16). The experiments are conducted on the DLRSD dataset with a 1/8 label ratio. As shown in Fig. 8(a), the model's accuracy first increases and then decreases with the increment of ω. If the value of ω is small, the model focus on fitting to labeled data and does not fully exploit the unlabeled sample information. Furthermore, if the value is too large, the model pays too much attention to unlabeled samples, which brings more noise interference. Therefore, the model's performance will first increase and then decrease as ω increases. We get the best performance when the ω is around 15. For convenience, we set ω = 15 for all the experiments in our work.

4) Analysis of the Sharpening Coefficient T :
We also analyze the effect of different sharpening coefficients T on model performance, which is in (6). The experiments are conducted on the DLRSD as the same as the hyperparameter ω. As shown in Fig. 8(b), we get the best performance when the T is around 0.5. When T → 0, the output probabilistic of pixel samples will approach a Dirac distribution, which discards other categories of information, resulting in overfitting. When the T → 1, the model's output does not change and does not play a role in sharpening. Consequently, We make a tradeoff and choose T as 0.5. 5) Analysis of the Hyperparameter τ 0 for the CBAT: We also analyze the impact of hyperparameter τ 0 in (13), which is the initial threshold to dominate the credibility for the generated targets. Specifically, the parameter analysis is conducted on the DLRSD dataset. We experiment by opting different τ 0 ∈ [0.6, 0.9]. Table VIII shows the concrete experimental results for the CBAT under the different values of the hyperparameter. As the initial threshold τ 0 increases, mIoU first increases and then  decreases. When τ 0 is set too low, it will select more useless pixel samples, leading to overfitting; when τ 0 is set too high, it will filter out most samples, making the class-level entropy inaccurate. The model achieves the best accuracy when τ 0 is approximately 0.7.

1) Visual Analysis for Each Components:
The effect of the CACC and CBAT can be visualized in Fig. 9. For each image, we visualize the heatmap of baseline, CBAT, CACC, and the integration of CCAC and CBAT, that is, Fig. 9(b)-(e). In addition to the heatmap, we also present the output predictions of each component as shown in Fig. 9(f)-(i). Note that the heatmap's color represents the prediction result's uncertainty. The red and yellow in the image mean that the predictions are more unreliable, which means that those regions are more prone to generate semantic ambiguity, while the blue color indicates the more reliable predictions. It can be seen that the baseline (MT), which uses regular consistency training, produces more unreliable predictions in Fig. 9(b) and (f). The reason for making unreliable predictions is that the high intraclass variance and slight interclass difference in remote sensing images are susceptible to generating semantic ambiguity. Compared with the baseline, our proposed CACC and CBAT strategies can significantly suppress the generation of semantic ambiguity while producing more stable predictions. The success of our model in mitigating the influence of semantic ambiguity comes from two aspects. On the one hand, CACC models the reliability of the prediction and focuses on the reliable forecast according to the certainty map so that we can obtain a more reliable output, the result as shown in Fig. 9(c) and (g). On the other hand, the CBAT directly filters out false predictions through a dynamic CBAT, improving the model's performance, as shown in Fig. 9(d) and (i).  2) Qualitative Results: Figs. 10-12 show the visualization of the state-of-the-art method and our method. It can be seen that our model achieves better segmentation results. In more detail, there are slight interclass differences among the types of objects in Fig. 10. This characteristic brings the model more prone to generate semantic ambiguity, resulting in the degradation of model performance and transfer of false knowledge. CPS, CCT, and other methods do not perform discriminant analysis on the predictions, resulting in the poor segmentation performance. Instead, we model the certainty of each sample, which effectively reduces semantic divergence and achieves better segmentation results. In Fig. 11, other methods cannot segment thin lines well. This is because fine structures of image, such as boundaries, are more prone to semantic confusion in the output space. Our model can also better solve this problem, which proves the effectiveness of CACC and CBAT. In Fig. 12, other methods fail to distinguish the background, while our model successfully identifies them. The success of our model is owing to our model having more stable predictions, which reduce false knowledge transfer. Generally speaking, our model effectively reduces the semantic divergence caused by vagueness through modeling and rectifying the ambiguity of the output.

G. Time Complexity Analysis
To further demonstrate the performance of our method, we conduct a time complexity analysis on it. We perform experiments on the DLRSD dataset with a 1/8 label ratio and one  NVIDIA A100 GPU. All the models use deeplabv3+ based on ResNet50 as a segmentation network, and the number of training iterations is the same for all experiments. In other words, the training epochs are the same. As shown in Table IX, our model's training time is similar to that of GCT. Nevertheless, the mIoU of our model is improved by 3.20 points. Compared with other methods, such as CPS and CCT, the training time of our method is increased. This is because we perform multiple forward passes to the network to obtain more reliable predictions and the corresponding certainty maps, which are used to correct the prediction results during the training process. Therefore, although the time is slightly increased, the multiple preexisting passes make our results more reliable, as evidenced by our model's best mIoU result.
In addition, in the actual model deployment process, we pay more attention to the inference and reasoning time of the model. As shown in Table IX, all models use the student model for reasoning during the inference process, so the model's inference time would not change too much, about 105 fps. Consequently, the cost of slightly increased training time for models in semisupervised semantic segmentation is acceptable.

V. CONCLUSION
In this work, we propose a CACT strategy to reduce the semantic ambiguity caused by model uncertainty in the current semisupervised semantic segmentation framework for remote sensing images. The success of our model in mitigating the influence of semantic ambiguity caused by model uncertainty comes from two aspects. On the one hand, this strategy generates high-quality targets and enhances reliable prediction. On the other hand, it removes unreliable prediction through the CBAT, further reducing semantic ambiguity's impact on segmentation results. Last but not least, extensive experiments have shown the effectiveness of our CACT-based semisupervised semantic segmentation framework for remote sensing images.
Even though the proposed method shows an outstanding performance, it may still fail when there is a significant difference in the distribution between unlabeled and labeled data. This is because the proposed method still adheres to the assumption that all data obey the same distribution. In the future, we will explore more robust semisupervised segmentation methods for the data from different distributions.