Cloud Detection Method Based on Improved DeeplabV3+ Remote Sensing Image

Cloud cover is a phenomenon that inevitably exists in remote sensing images, and ground information is lost due to the presence of clouds. To a large extent, it causes degradation of the remote sensing image quality. Therefore, the detection of clouds in remote sensing images is the foundation and key to further emphasizing and use of remote sensing image information. To tackle the question of misjudgment, omission, and long training time of current deep learning-based cloud detection methods, this article suggests an improved cloud detection method for remote sensing images with a deeplabV3+ network. The method aims to detect cloud-covered regions in remote sensing images more accurately, quickly and efficiently. The training time is reduced by improving the Xception backbone network to reduce the amount of parameters; the improved CBAM module is added after the Atrous Spatial Pyramid Pooling (ASPP) module to strengthen the sensory field to better capture the contextual information; and the GAU module is used to replace the traditional bilinear interpolation up-sampling in the decoder part, which results in a better quality of the up-sampling, more spatial sense of the fusion, and improved accuracy of the segmentation. Experiments were conducted by using datasets from the homegrown ZY-03 satellite and comparing them with traditional DeeplabV3+, MobileNet and U-Net. And the generalization ability is verified using the public Sentinal-2 dataset. Compared to legacy deeplabV3+ networks, the accuracy rate is improved by 3.91%, the improved precision by 3.12%, the improved recall by 2.78%, and the improved Mean Intersection over Union (MIoU) by 5.75%. Compared to Mobilenet, the improved accuracy by 1.79%, the improved precision by 1.47%, the improved recall by 1.42%, and the improved Mean Intersection over Union (MIoU) by 4.15%. Compared with the Unet network, the improved accuracy by 1.28%, the improved precision by 1.36%, the improved recall by 1.32%, and the improved Mean Intersection over Union (MIoU) by 3.36%. The method presented in this article demonstrates superior performance in cloud detection compared to several other networks.


I. INTRODUCTION
Due to the continuous advancement of remote sensing technology, more and more remote sensing images are used in various areas, including scientific research and social services.At Present, the Chinese satellites are represented by GF-1, GF-2, and ZY-03, which are near the international top level in terms of various indicators.Among various types of The associate editor coordinating the review of this manuscript and approving it for publication was Stefania Bonafoni .remote sensing images, optical remote sensing image data acquired by optical remote sensing satellites are used in such fields as target detection [1], semantic segmentation [2], scene identify [3], and change detection [4].Its related technologies are also widely used in various aspects such as map navigation, weather/ocean monitoring, disaster prevention and detection, and military reconnaissance.At present, the quantity and quality of China's domestic satellite images are also growing, It has a crucial role to play in most areas such as agricultural engineering [5], environmental protection [6], land resources [7], urban planning, and national defense construction [8].However, not all images can be used in these ways.A big reason for this is the cloud cover, which is a collection of water droplets or ice crystal colloids resting in the air and a tangible result produced by the huge water cycle on Earth, playing an crucial role in the global climate [9].The International Satellite Cloud Climatology Program has shown that the average cloud cover over the Earth is more than 66%.The lack of optical remote sensing image information caused by cloud cover has a great impact on downstream remote sensing image processing and identification.Therefore, the development of cloud detection technology for remote sensing images is the key to determine the degree of missing information in remote sensing images and the further utilization of it.
This detection methods for optical satellite remote sensing images can be roughly categorized into three groups based on the principles of cloud detection methods: spectral thresholding methods, statistical learning-based methods, deep-learning based methods, etc.
Most prior cloud detection methods have been based on spectral information [10], [11], [12].Spectral thresholding methods utilize the reflectance of several optical bands of satellite remote sensing images for threshold discrimination.Spectral thresholding methods can be categorized into single-temporal phase methods and multi-temporal phase methods; Single-temporal phase methods obtain cloud detection results by threshold discrimination of spectral reflectance of a single remote sensing image.Luo et al. [13] proposed the use of a threshold decision matrix to recognize clouds in MOIST satellite images.Zhu et al. [14] proposed the F-mask method, a two-step thresholding method that combines spectral thresholding and probabilistic thresholding for cloud detection on Landsat satellite images and combines cirrus bands and a variety of auxiliary data to improve the detection accuracy [15], [16], [17].In addition, a dynamic threshold detection method for satellite visible infrared imaging radiometer data has been proposed, which is based on the atmospheric radiation theory and the 6s transmission model, simulates the relationship between the surface reflectance of each channel and the reflectance of the ground surface, and then sets the dynamic thresholds from visible to near-infrared and combines with a fixed threshold for the thermal infrared channel to realize the detection of clouds, which further improves the cloud detection accuracy compared with the conventional dynamic threshold method.The Spectral thresholding method is too dependent on cloud properties and the complexity of the surface scenes in satellite remote sensing images makes it problematic to accurately distinct clouds from non-clouds, so the spectral thresholding method has the problem of distinguishing between clouds and bright features and shadows and dark targets.
In the past decades, statistical learning methods have been extensively used in the realm of remote sensing image manipulation and the representative methods include Support Vector Machine [18], Artificial Neural Network [19], Random-Forest [20], Maximum Likelihood [21], Markov Random Field(MRF) [22], K-Nearest Neighbor [23], and Decision-Tree models [24], etc.Among them, SVM has become a popular classification method for remote sensing image cloud detection tasks due to its excellent classification performance [25] the algorithm is applied to a variety of cloud detection tasks with sensor data such as MODISIT, NOAA/AVHRR [26], Landsat-8 [27], ZY-3/GF-1 [28] and Quick Bird [29] etc.However, manually designed features (texture information, morphological features, etc.) as inputs to SVM classifiers are still difficult to deal with cloud detection problems in complex environments.This has prompted more research efforts to turn to the search for more functional and accurate feature representation capability to improve the performance of cloud detection in remotely sensed images.
In the last couple of years, as a result of the quick growth and continuous maturation of deep learning in the image detection and domain of identification, researchers at home and abroad have begun to take deep learning methods to the study of remote sensing image cloud detection and its performance has gradually surpassed the cloud detection algorithms based on traditional methods [30], [31], [32].Deep neural networks have more intermediate layers than artificial neural networks and thus have greater feature extraction capabilities to extract complex nonlinear features from images.Mohajerani et al. [33] use full convolutional neural networks for cloud recognition in Landsat8 images.Liang et al. [34] realized the cloud detection in ZY-03 satellite images by combining conditional random fields on the basis of full convolutional neural network, which improves the detection accuracy compared with FCN.Hu et al. [35] introduced a high frequency characterization extraction module and multiscale convolution, which are based on UNet and added a spatial a priori self-attention mechanism, which resulted in high Mean Intersection over Union.Yao et al. [36] proposed a light-weight cloud detection network incorporating the channel attention mechanism in DeeplabV3+ to improve the training efficiency and performs well in the Landsat8 biome dataset and Sentinel-2 dataset with high robustness and detection accuracy.Liu et al. [37] proposed a pixel cloud detection algorithm based on convolutional networks and deep-forests.Firstly, SLIC and SEED are combined to segment remote sensing images into super-pixels; after that, structured forests are utilized to calculate the edge probability of each pixel, on the basis of which the super-pixels are segmented more accurately.Chen et al. [38] proposed a cloud detection network (SSCA-net) without manually producing labels, which saves time and effort in detecting the effect also has a good performance.
In summary, many experts and scholars, both domestically and internationally have conducted a plenty of in-depth analysis of cloud detection and proposed many cloud detection algorithms.But the traditional methods are affected by the complexity of the cloud features in optical remote sensing images and by the influence of water, snow and other features with similar texture characteristics, many algorithms suffer from the omission of misjudgment of the cloud and other problems.In addition, using deep learning algorithms for detection has problems such as long training cycles.Therefore, this paper proposes a cloud detection model based on improved DeeplabV3+, which reduces the training time by improving the Xception network framework to reduce the amount of parameters, adds the improved CBAM module after the spatial pyramid (ASPP) module to strengthen the sensory field to better capture the contextual information and replaces the traditional bilinear interpolation up-sampling with the GAU module in the decoder part, which results in a better quality of up-sampling and a better fusion with a higher quality of fusion.The GAU module is employed in the decoder segment supplanting the conventional bilinear interpolation up-sampling, which improves the up-sampling quality better, and the fusion more spatial and the segmentation accuracy improved.Therefore, the enhancement can diminish the quantity of parameters, thereby lessening the training duration and provide better spatial sensing ability to improve the detection accuracy.

A. ZY-03 SATELLITE DATA
The original images used in this paper are from the 01 and 02 satellite image series of the ZY-03 satellite.It is worth mentioning that ZY-03 is a satellite developed by China itself and designed for civil high-resolution stereo mapping [39].It main feature is that is equipped with for four optical cameras units of different resolutions, including a front-view panchromatic camera, front and rear-view panchromatic cameras, and a front-view multispectral camera.The data maps of front-view panchromatic and front and rear-view panchromatic images are gray images, while the data maps of front-view multispectral images are RGB color images.The specific performance parameters of the satellite camera can be demonstrated in Table 1 in the text.The original size of the image was 1107 × 1122 pixels, but limited by the processing power of the existing computer hardware, we had to crop the image to a size of 513 × 513 pixels for analysis and training.For the purpose of reduce the interference of climate and location brought by high-altitude satellite detection, the data set selected in this paper includes images at different times in the same region (Figure .1), images at different times in different regions (Figure .2), and images in different climatic conditions (Figure .3).Select 16619 images after the original images are cut after data augmentation, a certain number of negative samples are added to the data set in this paper.The ratio of positive and negative samples is 5:1, and the same time divide the dataset into training, testing and validation sets according to the percentage of 7:2:1.

B. IMPROVEMENTS TO THE DEEPLABV3+
The traditional DeeplabV3+ has the following deficiencies in cloud detection: nulling, misjudgment, and omission of fine targets; the number of DeeplabV3+ network parameters are huge; the training time is too long, and the network convergence speed slowly; and the segmentation effect is poorer for targets with small sizes.In view of the above shortcomings, this paper carries out the following improvements: firstly, the Xception backbone network is improved.secondly, the improved CBAM attention module is keyed after the ASPP module, and the null rate of the ASPP module is experimentally verified to get a set of the best null rate of [1], [3], [9], and [12], and finally the decoder part is used to replace the original up-sampling fusion module with the GAU module.The improved DeeplabV3+ is shown in Figure 4:

C. IMPROVEMENT OF THE XCEPTION BACKBONE NETWORK
The Xception model has an extensive quantity of parameters and a complex computational process, which results in a large   computational effort.In the training and inference phases, Xception needs to consume relatively more computational resources and time.In the attempt to minimize the calculation and training time in the Exit flow to add three layers of 2 × 2 average pooling layers, the average pooling layer can down-sample the feature map and convert local features to global features.By gradually decreasing the feature map to a lower resolution, the model is capable of capturing more contextual information with a larger sensory field.This is important for cloud detection tasks because the size and shape of clouds can be relatively dispersed throughout the image.Since clouds are usually relatively large and dispersed targets, it is not necessary to preserve all the details and original resolution of the features during cloud detection.By adding an average pooling layer, the resolution of the feature map can be reduced, thereby diminishing the amount of subsequent computation and increasing the speed and efficiency of the operation.The boundary of the cloud is usually fuzzy, adding an average pooling layer can somewhat blur and smooth the boundary of the feature map, making the segmentation result more robust and continuous.The improved Xception is shown in Figure 5:

D. IMPROVEMENT OF CONVOLUTIONAL BLOCK ATTENTION MODULE(CBAM)
The ASPP module is usually able to capture semantic information at different spatial scales, but it is weak for channel information capture.To enhance the model capability, an improved CBAM module is introduced behind the ASPP module.Summing the initial feature maps with the results of the channel attention adjustment, as shown in the yellow line in Figure 6.preserves the low-level features in original feature maps and combines the high-level features that have been adjusted by the attention adjustment.This maneuver contributes to the reservation of details and low-level semantic information in the original image and introduces more discriminative high-level semantic information, which improves feature representation and allows the gradient to propagate directly to earlier layers because of the introduction of short-circuit connections.This helps to lighten the gradient vanishing problem and promotes effective gradient propagation and optimization.The channel attention module initially conducts global average pooling and global maximum pooling on the input feature mapping F to compute global statistics, and then the output is then fed through FC1 and FC2, which are two connected layers.Finally, the output is normalized using the Sigmoid function.Enhancement is demonstrated in equation ( 4); equation 4 adds the channel attention result to the original feature map [40]. (1) In this equation, W and H are the width and height of F, respectively.FC1 and FC2 are both fully connected layers.Sigmoid and ReLU are nonlinear activation functions.The final output of the channel attention operation is denoted as F ′ .
The spatial attention module concentrates on the spatial region, which has the greatest impact on the final result.First, calculating the mean and maximum values of F' in all channels, then fusing them together, after convolution operation, and finally normalizing them by a Sigmoid-type function as: In the equation, dim denotes the channel dimension and F'' denotes the final output of the improved CBAM.

E. GLOBAL ATTENTION UPSAMPLE(GAU)MODULE
In cloud detection tasks, global context information is important for accurate cloud boundary detection and cloud region classification.Traditional up-sampling methods (e.g., bilinear interpolation) may not be able to fully utilize the global context information and may Potentially result in the vanishing of detailed information in the feature map.The GAU module, on the other hand, is able to better capture the contextual dependencies at a distance by introducing a global attention mechanism.The representation of feature maps is enhanced in the following ways: Boundary accuracy improvement: the GAU module can help the decoder capture broader situational information, especially when detecting cloud boundaries, global attention makes the model be more observant of the particulars and contextual relationships of the boundaries, which improves the accuracy of the boundaries.
Enhanced feature comprehension: through the introduction of global context information, the GAU module is able to provide more semantic information to enhance the decoder's feature comprehension of different cloud regions.This helps to improve the classification accuracy and differentiation ability of clouds.
Up-sampling effect improvement: The GAU module adopts an attention mechanism in the up-sampling operation, which can better retain the detail information in the feature map and avoid information loss during the up-sampling process.This helps to recover high-resolution feature maps and improve the detail performance and boundary clarity of cloud detection.
The GAU module first changes the weight of each channel of the low-level features, weights the low-level feature information after 3 × 3 convolution to enhance the consistency of the intra-class features and combines the low-level features with the high-level features to perform global up-sampling and then compresses and extracts the global information through the ReLU activation function and the 1 × 1 convolutional layer, obtains the weight vectors of the low-level features to obtain the weighted features and, finally, combines the weighted features are fused with the high-level features to get the final output and its structure is shown in Figure 7.

F. CROSS-ENTROPY LOSS FUNCTION
This cloud detection experiment was set up with only two categories, background and cloud, so the binary classification cross-entropy loss function is chosen as the loss function of the model.The cross-entropy loss is a common and widely used loss function that has shown good results in classification tasks.It is simple to compute, does not require additional parameter adjustment and is easy to implement and use.This loss function has good results in the case of binary classification.By establishing contrast and disparity, the cross-entropy loss function can effectively train the model to learn feature representations that can distinguish between background and cloud.The cross-entropy loss function has better properties for gradient descent optimization algorithms.It can provide a relatively more stable gradient direction during the optimization process, avoiding the circumstances of vanishing or exploding gradients and allowing the model to converge to a better solution faster.In the cloud detection task, there is usually an imbalance between the samples in the background and cloud regions, i.e.The cross-entropy loss function can effectively deal with this imbalance situation by adjusting the category weights or adopting some category balancing techniques to ensure effective training and classification of cloud categories.To some extent the cross-entropy loss function can implicitly target the task goal of cloud region segmentation, which enables the model to better learn the distinguishing features and boundaries of cloud and background by minimizing the gap between the prediction results and the real labels.
The cross-entropy loss function, denoted as CE-loss, the formula is as follows: Among them, N denotes sample's number in a mini-batch, C denotes the number of categories.l

A. DATA AUGMENTATION
Since deep learning relies on a substantial quantity of labeled samples as training data, the number of labels as well as the accuracy of the labels in outlining the target features have a significant influence on the accuracy of the trained model segmentation, so with the aim of enhance the accuracy of the model, ample labels are required to enhance the learning of the neural network on the features of the target features.Since the production of deep learning labels is time-consuming and labor-intensive to settle the issue, this paper under the premise of ensuring the type and quality of labels, data augmentation of the original labels, the effect of augmentation is shown in Figure 8, to decline the cost of manually producing labels.It boosts the generalizability of the model.This paper utilizes the following methods of data augmentation: (b) Brightness Enhancement: Brightness Enhancement adjusts the overall brightness of an image, increasing the exposure of the image to make it brighter.This can improve Rotating the image can also simulate the rotation or tilt situation in the actual shooting scene, thus increasing the robustness of the model.

B. EXPERIMENTAL SETTINGS 1) EXPERIMENTAL ENVIRONMENT
The experimental environment setup for this experiment is shown in Table 2.

2) EXPERIMENTAL PARAMETERS a: BATCH-SIZE
The batch-size represents sample's number entered into the model each time during the model training process, which is mainly constrained by the size of the memory of the experimental equipment and the overall number of iterations, etc.The larger the batch-size is, the larger the memory is occupied for a single training session.The larger the batch size, the larger the memory occupied by a single training, and as the iterations increases, the data obtained from the operation will be stored in memory, which is easy to cause memory overflow, resulting in network training failure; when the input sample's number is too large, the function optimization convergence speed is slow and inefficient, and when sample's number is too small, it is easy to cause a waste of computing resources such as the GPU.Therefore, choosing the appropriate batch size will effectively reinforce the efficiency of model training.The learning rate is an important hyperparameter in deep learning that is used to control the size of the step in which the model parameters are updated in each iteration.To find the most appropriate learning rate, a learning rate scheduling strategy or learning rate decay can be used.Common learning rate scheduling strategies include fixed learning rate, decaying learning rate (e.g., exponential decay and cosine annealing), adaptive learning rate (e.g., Adam, etc.), etc.
When training a neural network, the gradient is computed by a backpropagation algorithm, and the model parameters are updated using gradient descent or its variants.And the learning rate determines the magnitude of each parameter update.The size of its adjustment is guided during the optimization of the network parameter, and when the learning rate is too high, it is easy to have an explosion of the loss value and cannot converge; when the learning rate is too low, the convergence speed is slow and is prone to overfitting.The hyperparameters of this experiment are set as Table 3.

3) EVALUATION INDEX
For the purpose of quantitatively assessing the model capability of cloud detection in this paper, this paper quantitatively evaluates the cloud recognition results using the four commonly used evaluation metrics in semantic segmentation, namely Accuracy, Precision, Recall and mean intersection and merger ratio (mIou).
(1) Accuracy: indicates the probability that all samples are correctly categorized, calculated as shown in (8): (2) Precision: indicates the proportion of samples predicted to be positive that are truly positive, calculated as shown in (9): (3) Recall: indicates the proportion of positive samples in the sample that are predicted to be correct, calculated as shown in (10): TP: The predicted result is a positive example and the actual value is also a positive.This indicates that the model correctly predicted the positive example sample as a positive example.
FN: The predicted result is a negative example and the actual value is a positive.This indicates that the model incorrectly predicted a positive sample as a negative example.
FP: The predicted result is a positive example and the actual value is a negative.This indicates that the model incorrectly predicted the negative case sample as a positive case.
TN: The predicted result is a negative example and the actual value is also a negative example.This indicates that the model correctly predicted the negative example sample as a negative example.
These indicators are commonly used to estimate the performance and accuracy of binary classification models.By using these metrics, we can calculate evaluation metrics such as Accuracy, Recall, Precision and mean intersection ratio (mIou).These metrics can help us better understand the performance of the model and have a preliminary judgment on the training results.

C. EXPERIMENTAL ANALYSIS 1) LOSS ANALYSIS
The loss generated during the training process will be analyzed first; as shown in Figure 9, the loss function decreases faster during the first 20 iterations and fluctuates around the 50th round, within a specific range, fluctuations in the loss function are normal, especially in the early stages when the model has just started to learn.This is because in the initial stage, the parameters of the model have not been fully adjusted, so that the loss function may experience large changes.As the model gradually converges, the loss function usually stabilizes.After that it continues until the 100th iteration, when the loss function shows a new downward trend.When the loss function shows a new downward trend, it means that the model is further optimizing itself and trying to find a better solution.As the number of iterations increases, this downward trend will gradually diminish and the loss function will smooth out from around the 100th round and converge completely by the 200th round.

2) DIFFERENT TYPES OF CLOUD ANALYSIS
For the purpose of further validating the ability of the improved deeplabV3+ for cloud detection, comparative experiments with the Unet, the traditional DeeplabV3+, and the network model using the mobilenet backbone network are conducted in this paper to analyze the different types of clouds as shown in Figure 10.Firstly, the analysis of thick clouds is carried out; only the model using Mobilenet backbone network has omission.The remaining three models achieve almost perfect results in the identification of thick clouds in patches and the four methods are basically the same under the type of point-like clouds, Basically, they can be recognized successfully, but in the thin clouds situation, as shown in the red circle in the third row, the conventional deeplabV3+ model and the Mobilenet network model have leakage, which is mainly caused by the fact that when the cloud layer is thin, some of the radiated information can penetrate through the cloud layer and the optical satellite sensors are able to capture a small amount of feature information, which confuse the thin clouds with the surface information and thus increase the difficulty of the model to detect the thin clouds, while the Unet model, due to its unique ''U'' shaped network structure, has a better recognition effect for thin clouds, but the method Improved DeeplabV3+ structure in this paper, although there are still some leakage judgments, compared with other models, the improved model shows obvious improvement in the reduction of leakage judgments.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

3) WATER BACKGROUND CLOUD ANALYSIS
The second set of images is a remote sensing image of a Waterbody as background.Waterbody scenes are common in remote sensing imagery and are characterized by the wide coverage and diverse types of water bodies.Most of the Waterbody are distinctly different from clouds, but the presence of impurities such as sediment and floating matter in the waterbody, as well as the variation of chlorophyll content, lead to the instability of texture features and spectral features of the water bodies.Therefore, sometimes the characteristics of the water body are similar to those of the clouds, which may interfere with cloud detection.As illustrated in Figure 11.It is proved that for waterbody Unet and this paper method miss less than the other two methods.Moreover, this paper method has better recognition rate compared to Unet.For the sea as a background, this paper's method for cloud detection effect is better significant, the remaining three methods have more leakage.

4) MOUNTAIN BACKGROUND CLOUD ANALYSIS
The images of the third group are based on the variable mountainous terrain.The spectral information of the mountainous regions is quite complex due to the extreme unevenness of the terrain and the non-uniformity of the vegetation distribution.In areas with high terrain as well as weak vegetation cover, reflectivity is relatively high, which can lead to the occurrence of misclassification.As shown in Figure 12, except for the deeplabV3+ mobilenet method, which has a great leakage on some bare rocks and in the highlighted areas, the Improved method in this paper performs better overall with the remaining three, although there is still a little leakage.

5) SNOW AND ICE BACKGROUND CLOUD ANALYSIS
Snow and ice have high reflectivity, especially in the visible and near-infrared regions, and are similar to clouds (especially cumulus clouds), which makes them difficult to distinguish.Surface snow and ice cover vary seasonally and geographically, which adds to the complexity of distinguishing snow ice from clouds.In terms of spatial texture and structure, clouds and snow or ice are similar on remotely sensed images, which further complicates the identification of clouds.As shown in Figure 13, deeplabV3+mobilenet method in the ice and snow surface misjudgment, the omission is the most serious, and the traditional deeplabV3+ and Unet method also exists a large number of misjudgment omission is just a little better than deeplabV3+mobilenet method, while this paper's method for the misjudgment of the ice and snow is almost no, but there is a small amount of omission, because in the spatial texture and structure of the cloud, snow, ice is extremely similar to only be able to do as far as possible to minimize the effect of this paper's method is still relatively significant from the results.

6) GENERALIZATION CAPABILITY VERIFICATION ANALYSIS
For the purpose of verifying the generalization ability of the Improved model, we select the public dataset SEN12MS-CR-TS with different time phases, and the Sentinel-2 satellite data therein is used for the verification, and the results are shown in Figure 14.Through the visual effect of the image.In summary, it can be inferred that the cloud detection results are superior to the traditional deeplabV3+ method, and the recognition accuracy is also very close to the original image label.It can be seen that the model shows good performance in the public dataset as well.

7) DIFFERENT AIR CONDITIONS ANALYSIS
For the purpose of verifying that the method in this paper can still perform well under different air conditions, three images with different air conditions were selected from April, July, and December, which have large differences in climatic conditions, and were used to verify the stability of the model.As shown in Figure 15, the first image of several models in the ice and snow and other highlights or detection results are still poor; the second and third images clearly show that this paper's method in the results is better than several other traditional models.Cloud edges are more rounded with better detailing.Therefore, the improved model still has excellent performance under different air conditions and different climates.

8) DIFFERENT ANGLE ANALYSIS
The appearance of remotely sensed images is also affected by the angle of sunlight.The altitude angle and incidence angle of the sun may vary for images taken at different times of the year, thus affecting the brightness and shadows of the image.To prove that the improved model also has good performance for regions with different angles, three images with similar dates in the same region but with different shooting angles are selected.As shown in Fig. 16, the first image shows that the model of this paper is still better at recognizing the details of the edges, the second image has fewer clouds so that the results of several models are OK, and the third image also shows that the model of this paper is more accurate at recognizing the thin and broken clouds.It can be concluded from the above that the improved model in different shooting angles, different light irradiation, etc., can also present a relatively good state.

9) COMPARATIVE EXPERIMENTAL ACCURACY ANALYSIS
The four network models have been evaluated and analyzed above through the visualization results, and now we will analyze the improved network model of this paper as well as other network models through the accuracy evaluation indexes, and the accuracy evaluation results of each method are illustrated in Table 4. From the table, it can be inferred that this paper's method outperforms the other methods in terms of the values of accuracy, precision, recall and Mean Intersection over Union (MIoU).The encoder and decoder of the traditional DeeplabV3+ can learn both high-level features and low-level feature information and the encoder has a spatial pyramid module that can learn multi-scale feature information, so the method is more balanced in the accuracy, precision and recall compared to the other methods, but the Mean Intersection over Union (MIoU) is the lowest among the four modeling methods.Although Mobilenet is the backbone network with a pyramid pooling module, it can learn multi-scale information and global information, but it is not enough to learn some low-level features, so the accuracy and precision rate of this method is higher than that of other Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
methods, but the recall rate is the lowest; the Unet network structure is ''U,'' which is divided into decoder and encoder to take care of both high-level and high-level features, but the average intersection ratio is the lowest among the four models.The structure of the Unet network is ''U,'' and its network structure is divided into decoder and encoder, which can take into account both high-level and low-level semantic features and the learning of features is more perfect, but the Unet network does not have a multi-scale module and it cannot take into account the information of features at different scales, so compared with other models, its accuracy and recall are higher, but the precision rate is lower.The method in this paper, extracts the features of each channel by using a lightweight backbone network, then uses the multi-scale feature pyramid module in tandem with the CBAM module to extract the global contextual information at different scales and finally fuses them by using global attentional up-sampling, so it has a high accuracy and precision rate.

10) ABLATION EXPERIMENT ANALYSIS
For the purpose of verifying the lightweight treatment adopted in this paper for the feature extraction network Xception, the improvement of the CBAM module, and the necessity of replacing the original up-sampling fusion method by adding global attention up-sampling fusion, seven sets of scenarios are designed for ablation experiments.
(1) Method 1: Based on the DeeplabV3+, lighten it for the Xception network while keeping no improvement elsewhere.
(2) Method 2: Based on the DeeplabV3+, add the improved CBAM module to improve it while keeping the other parts of the network unchanged.
(3) Method 3: Based on the DeeplabV3+, only replace the fusion method to add a global attentional up-sampling module while keeping no improvement elsewhere.
(6) Method 5: Add a global attentional up-sampling module to Method 1.
To analyze the model effect, the ablation experiments use the four standard evaluation indicators commonly used in semantic segmentation models, namely, accuracy, precision, recall and MIoU, as described above, in addition to adding a temporal indicator of training duration.
As can be seen from Table 5, method 1 compared to the traditional DeeplabV3+ model accuracy is slightly improved, the remaining three indicators are slightly decreased, but the indicators are very small decline within 0.5% and in the training time to save nearly 50% of the time, the loss of a certain amount of performance in exchange for a significant decrease in the length of the training time is desirable.Method 2 compared to the traditional deeplabV3+, the accuracy rate is increased by 2.31%, the improved precision by 1.18%, the improved recall by 0.69%, the Mean Intersection over Union (MIoU) increased by 1.52%, the training time is slightly increased, because the improved CBAM module is very small and will not increase the overall computational volume.Method 3 compared to the traditional deeplabV3+, the accuracy rate is improved by 2.01%, the precision rate is improved by 0.57%, the recall rate is improved by 0.98%, the Mean Intersection over Union (MIoU) is improved by 2.48% and the training time is improved by 8%.Method 4 is the addition of the improved CBAM module to Method 1.The table shows that all four metrics are slightly improved, and the increase in training hours is minimal.Method 5 is to replace the bilinear interpolation module with the GAU module based on method 1, and it can be concluded that several indexes are slightly decreased compared to method 4, and the training time is also increased.Method 6 replaces the bilinear interpolation module with the GAU module based on method 2, and this combination results in a decrease in metrics and an increase in training time.Through several ablation experiments to obtain method 7 is also the improved method in this paper; this paper method, compared with the traditional deeplabV3+, the accuracy is improved by 3.91%, the precision is improved by 3.12%, the recall is improved by 2.78%, and the Mean Intersection over Union (MIoU) is improved by 5.75%, and the training length is saved by about 45%, so it indicates that the network model of this paper, no matter in the aspect of the model precision, or in the aspect of the training speed, has got a significantly improved.The ablation experiments and the analyses in the previous subsections show that the improved network model has significantly improved both in terms of model accuracy and training speed.

IV. DISCUSSION
At present, remote sensing technology has been widely used in various fields.From the initial visible light to the present multispectral and hyperspectral, the spectral technology has also developed continuously.However, the presence of clouds in the atmosphere has led to the quality degradation of a vast number of remote sensing images, which not only wastes resources, but also has a great impact on the analysis and research of subsequent remote sensing images.
A commonly used model for cloud detection in conventional optical remote sensing images is DeepLabV3+, which is based on semantic segmentation and is mainly used for object category detection at the pixel level [41].
However, this model has some problems in dealing with cloud detection in remote sensing images.Clouds in remotely sensed images usually have fuzzy boundaries, which makes it difficult for the traditional DeepLabV3+ to accurately capture the boundary information of the clouds, which affects the detection accuracy.Firstly, the traditional DeepLabV3+ requires a large amount of labeling data to adapt to different types of remote sensing images, which may be difficult in practical applications.Secondly, clouds in remote sensing images usually have irregular shapes and fuzzy boundaries, which further increases the difficulty of accurately segmenting clouds.Traditional architectures such as UNet may also performs poorly in dealing with this complexity, as the structure of UNet may have limitations in dealing with tasks that require a larger range of contextual information for decision making [42].
To solve the above issues, we propose an improved DeepLabV3+ neural network, which aims to detect cloudcovered regions in remote sensing images more accurately, quickly and efficiently.The network introduces multi-scale feature fusion, and by improving the backbone network, it can better perceive cloud information and accurately capture cloud boundaries and reduce the training time.To improve the cloud detection accuracy, an improved attention module is introduced to increase the sensing field and adapt to irregularly shaped clouds.In addition, a global attention upsampling module is introduced to improve the generalization performance of the model under different situations.With these improvements, our neural network is able to detect cloud-covered areas more accurately, quickly, and efficiently in the cloud detection task, thus improving remote sensing image's quality and the accuracy of subsequent analysis and research.

V. CONCLUSION
This study takes the ZY-03 image as a research sample using the Sentinel 2 data to verify the generalization ability of the model in this paper.Focusing on the method of how to improve the performance of cloud detection, its main research conclusions are as follows: (1) The Xception backbone network is firstly lightened by introducing a three-layer average pooling layer for dimensionality reduction, which can decline the spatial dimensionality of the feature map through the average pooling, thus reducing the computation of the subsequent layers.This reduces the model complexity and computational resource requirements while maintaining reasonable performance.By reducing the size of the feature map, the pooling manipulations can focus on a larger range of sensory fields, which can better ability to capture global semantic information in images.This helps improve the model's understanding of the overall structure and scene.By reducing the size of the feature map, average pooling enhances the representation of important features and reduces the impact of redundant and noisy information.This can enhance the accuracy and robustness of the model in cloud detection tasks.
(2) An improved CBAM module is connected in series after the ASPP module and by adding original feature map with the channel attention-adjusted results, the low-level features in the original feature map can be preserved and combined with the attention-adjusted high-level features.This operation helps to retain details and low-level semantic information in the original image and introduce more discriminative high-level semantic information, thus improving the ability of feature representation.After adding the improved CBAM module to the ASPP module in DeepLabv3+, it can enhance the extraction of spatial features, improve the perception of local details and improve the robustness of the model.
( The improved DeepLabv3+ neural network designed in this paper still has a small number of omissions and misjudgments in the images with ice and snow as backgrounds, and the model can be improved by research to reduce the misjudgments and to enhance the accuracy of the detection of ice and snow area clouds.More ice and snow features, such as texture and color information, can be considered to be introduced.In addition, it is also important to recognize cloud shadow regions, and the use of different network architectures, the introduction of shadow information, or the application of other advanced image segmentation methods can be explored to raise the effectiveness of cloud shadowing detection.

FIGURE 2 .
FIGURE 2. The different times in different regions.

c
denotes the real label corresponding to sample n that has undergone One-hot Encoding, p(n) cdenotes the probability of sample n for category c.
b: NUMBER OF TRAINING ROUNDS (EPOCHS)In the process of training the model, one iteration indicates that the model trains a batch-size data, while the number of training rounds indicates the number of times each sample data is trained by the network and one epoch indicates that all samples are trained by the network once.The relationship between the number of iterations, the number of training rounds and the sample set size (data size) is calculated as shown in equation (7

( 4 )
Mean Intersection over Union: It represents the intersection of the predicted and true values divided by the average of the concatenation and is calculated as shown in (11): MIoU = TP TP + FP + FN (11)

FIGURE 10 .
FIGURE 10.Different types of cloud analysis diagram.

FIGURE 13 .
FIGURE 13.Snow and ice background cloud analysis diagram.

FIGURE 15 .
FIGURE 15.Different air conditions analysis diagram.
) Replacing entire decoder fusion part of the DeepLabv3+ network using the Global Attention Up-Sampling (GAU) module, the GAU module introduces a global context-aware mechanism in the decoder part.It captures the global semantic context of the input features by attentively weighting the global features.This helps to improve the model's understanding of the entire image scene and enables the cloud detection model to better perceive and utilize the global information in the image.For further analysis and utilization of remote sensing images, remote sensing image cloud detection is essential.In this study, ZY-03 satellite data is used for the study and Sentinel-2 data is used for the verification of generalization ability.Comparison experiments are carried out by four different network models on images with different backgrounds (locations), different angles, and different air conditions.Through quantitative and qualitative comparison, the improved network in this paper has better cloud detection performance compared to several other traditional networks.

TABLE 2 .
Experimental environment and tool description.

TABLE 4 .
Comparison of experimental.