CNN Cloud Detection Algorithm Based on Channel and Spatial Attention and Probabilistic Upsampling for Remote Sensing Image

In the field of remote sensing image, how to transmit image information more efficiently with limited bandwidth has always been a research hotspot. Compared with other ground objects, cloud pixels in remote sensing image are invalid information, so it is a meaningful research work to remove cloud before transmitting image and reduce the waste of useless information. In remote sensing image, due to the existence of thin clouds and the complexity of the underlying surface, most of the cloud detection algorithms struggle to achieve effective separation of clouds and ground objects. A deep learning (DL) cloud detection algorithm based on attention mechanism and probability upsampling has been proposed in this article. In order to enhance the information of the key areas, in the channel attention module, crucial information is highlighted in the channel dimension of the encoder, and the useless information is weakened. The spatial attention module is in the spatial dimension. The information fusion between each point in the image is strengthened. To reduce the information loss caused by the down-sampling module, a probabilistic upsampling block (PUB) is proposed to restore the image. Eventually, experiments are performed on Gaofen-1WFV data, and the results indicate that the algorithm proposed in this article has better detection results than other cloud detection algorithms in different scenarios.


I. INTRODUCTION
W ITH the development of remote sensing technology, satellite images are increasingly being used in various research [1]- [4]. An increasing amount of remote sensing data is being used in agriculture, environmental protection, urban development, military, the monitoring of land changes, hydrology, and so on [5]- [8]. However, nearly 70% of the earth's surface is covered by clouds [9], including thick clouds and thin clouds. The underlying surface of thick clouds cannot be known from the images, undoubtedly, the areas covered by clouds in remote sensing images are invalid information [10]. Although thin clouds do not completely cover the ground features, they still cannot fully know the ground features under the thin clouds when mixed with the underlying surface, therefore, thin clouds and thick clouds are the same invalid information. Because the transmission bandwidth on remote sensing satellites is limited, when these cloud regions are detected and identified as invalid information, the Region of Interest compression method will greatly improve the transmission efficiency between satellite and ground. Therefore, accurate and efficient detection of cloud regions is a hot spot in remote sensing image preprocessing. Traditionally, the research methods of cloud detection are the combination of multi-band threshold, texture analysis, pattern recognition, and so on. Li et al. [11] presented the MFC algorithm, which combined band threshold with texture analysis method, firstly, threshold segmentation based on spectral features was realized, followed by which, a preliminary cloud mask was generated through mask thinning guided filtering, finally, the geometric features and texture features were combined to improve the cloud detection results. Li et al. [12] adjusted the segmentation threshold value by analyzing the physical properties of clouds and made it more suitable for mitigating the effects of clouds. The textural feature difference between clouds and the underlying surface is strengthened before texture identification. Using some morphology operations were eventually used to further refine the coarse cloud regions and extract the thin clouds. Chen et al. [13] aimed at the problem of complicated types of cloud and land. Firstly, the image to be detected was enhanced, and then the texture features of the image were analyzed in multi-scale space to distinguish between cloud and ground. Liu et al. [14] proposed a thin cloud removal method based on the cloud physical model, which uses the correction method and adaptive brightness factor to decrease the effect of transmission and obtain the final image. The results show that the method can more effectively remove thin clouds, improve the contrast of the image, and retain more details. These traditional methods usually require a lot of time for adjusting the parameters and tuning threshold. Also, this multistage process usually has poor detection accuracy.
With the development of new technologies, artificial neural networks (ANNs) have achieved impressive development. ANN is basically a mathematical model that simulates the human brain processing information. It is composed of a large number of processing units and can independently process multiple sets of information. The advantage of ANN is that it This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ can deal with a lot of nonlinear problems and find the effective and optimal solution of the model by some constraints. In the 1990s, Key and Barry [15] first used this network to cloud detection of remote sensing images, and then more and more researchers began to use it in the field of cloud detection. Shi [16] based on the five-channel data of NOAA-AVHRR, used a simple neural network model to classify the images, including cumulonimbus, cumulus, cumulus, cirrus, medium cloud, low cloud and land, water, and unknown pixels. In 2006, Hinton put forward the concept of deep learning (DL). In the DL network, single-layer neurons are firstly constructed layer by layer so that a single-layer network is trained every time. After all the layers are trained, the optimization process begins. After multi-layer nonlinear feature extraction, high-level features with strong expression ability can be derived, and DL of data features can be realized without human participation. With the improvement of DL systems and the contribution of many scientific workers, in recent years, the deep convolutional neural network has achieved substantial success in the field of computer vision [17]- [21]. End to end does not require any human intervention, and through its powerful feature expression ability, it has become the prime research method in many fields of image processing [24]. Zhang et al. [25] integrated wavelet features into the DL network and achieved the task of speckle removal for SAR image. Yolo series [26] uses an end-to-end DL algorithm to accomplish the task of high-speed target detection.
In the exploration of DL methods for cloud detection, some new methods are proposed, which significantly improve the performance of cloud detection. By comparing the traditional cloud detection methods [28], it can be concluded that the feature learned by the convolutional neural network is better than the traditional manual feature. Owing to the complexity and diversity of the underlying surface, it is usually difficult to identify thin clouds compared to thick clouds, the multi-scale feature convolution neural network proposed by Shao et al. [29] can detect thin clouds as well as thick clouds at the same time, and the detection results are impressive. Li et al. [10] proposed a multi-scale convolutional feature fusion method based on DL. Dense connection groups are added in the symmetric encoder-decoder module to seek local and global information; the algorithm exhibits good detection performance in bright regions. Segal-Rozenhaimer et al. [30] proposed a convolutional neural network that can be adaptively applied to a variety of datasets. The robustness of this algorithm is verified by experiments. Aiming at the complicated underlying surfaces and the variety of cloud types, Liu et al. [14] proposed an innovative model named fuzzy auto encode model (FAEM), which combines the coding network and fuzzy function to achieve high-precision cloud detection of remote sensing image in complex environments.
In this article, we propose a cloud detection algorithm based on a convolutional neural network. Starting from the particularity of cloud detection task, we mainly focus on the relationship between the spectral segments of multispectral image and the points of spatial dimension. At the same time, we focus on the texture complexity of cloud images and a large number of thin clouds.
II. BACKGROUND OF U-NET U-Net [32] is an image segmentation algorithm based on FCN-network architecture [33], which has been widely used in various image segmentation fields, such as medical image segmentation, industrial detection, and satellite image segmentation. U-Net network consists of two parts: encoder and decoder. The encoder is similar to VGG-Net [34] for feature extraction. Through convolution and downsampling, the input image size is compressed gradually, and the number of channels becomes more. In the decoder, the image is recovered by linking the convolution layer and the upsampling layer, and fused with the corresponding size feature extraction layer of the coding segment after each upsampling. At the final layer, a convolution is used to map each feature vector to the desired number of classes.
As an efficient and lightweight convolutional neural network, U-Net has attracted the attention of many scholars [35]- [37]. U-Net was first applied to medicine image segmentation. With the later expansion, U-Net is introduced into the field of remote sensing image segmentation. For example, Jeppesen et al. [38] and others have made improvements based on U-Net and proposed RS-Net remote sensing cloud detection algorithm. Guo et al. [39] added a channel attention module to U-Net, and realized efficient cloud and non-cloud segmentation algorithm.
In this article, our proposed algorithm based on encodedecode DL image segmentation network U-Net. We added a channel attention module between the decode and the encode of each layer, A spatial attention module is added in the last layer of the network, to improve the information loss caused by downsampling, probabilistic upsampling block (PUB) is added, in order to verify the effectiveness of the algorithm, and high score data is used to carry out the experiments.

III. METHOD
In this article, our proposed algorithm consists of three parts: channel attention module, spatial attention module, and probability upsampling module. To adaptively adjust the characteristic response value of each channel, the channel attention module fuses the information of the encoder-decoder, and models the dependency relationship between the channels. The obtained information is added to the decoder for image restoration. By modeling different position relationships, the spatial attention module adjusts the feature response value of the spatial dimension. The probabilistic upsampling module fuses the downsampling information from encode to decode, which improves the image edge problem caused by the roughness of downsampling as demonstrated in Fig. 1.

A. Background of Attention Mechanism
The visual attention mechanism is a special brain signal processing mechanism of human vision. By quickly scanning the global image, human vision can identify the target area that needs to be focused, that is, the focus of attention, and then invest more attention resources in this area to obtain more detailed information of the concerned target, so as to suppress other useless information. This is a means of rapidly screening the high-value information from a large volume of information by using limited attention resources. Also, this is a survival mechanism formed in the long-term evolution of human beings. The human visual attention mechanism significantly improves the efficiency and accuracy of visual information processing. Inspired by the process of human visual attention, the visual attention mechanism has been introduced into DL and is widely used in natural language processing, speech recognition, and image processing. In the year 2014, the Google team drew attention toward the RNN model for image classification [40] and achieved impressive performance. In 2017, researchers introduced the attention mechanism into the CNN network [41]. Subsequently, the attention mechanism based on CNN has been widely used. CNN's attention module can be categorized into two parts: the channel attention module and the spatial attention module. Channel attention emphasizes the correlation among the dimensions of the channel, which focuses the network attention on the useful channel information and suppresses the useless channel information. Hu et al. [42] proposed a new architectural unit, which has been termed as the squeezeand-excitation (SE) block. The characteristic response value of each channel can be adjusted adaptively by modeling the relationship between the channels. Compared to SE-Net, the CBAM proposed by Woo et al. [43] not only adds spatial attention but also introduces the parallel structure of maximum pooling and average pooling in channel attention, and its effectiveness is verified by experiments. Spatial attention focuses on the region of interest in the spatial domain. In recent years, Wang et al. [44] proposed the spatial attention structure of non-local, which is widely used in various tasks. This structure improves the expression ability of the network by capturing long-distance dependence and expanding the receptive field to the whole picture.

B. Channel Attention Module
In this article, the proposed channel attention module includes two parts: a multi-scale sampling module and a spatial compression module, as shown in Fig. 2.
In order to increase the expression ability of the module, we choose to perform a multi-scale parallel convolution operation on the input before entering the spatial compression module. Szegedy et al. [45] indicate that width is a key factor in improving the performance of the model. More filter parallel structures with different dimensions can obtain features of different sizes of receptive fields, and the fusion of these features can improve the network performance. In the research of multi-scale modules, the prominent focus is put on the size and number of filters. The final structure has been chosen as a four-channel parallel structure, and 2-D convolution with the size of (3, 3) and 3-D dilated convolution with a three-way expansion rate of {2, 5, 7} have been used. Dilated convolution [46]- [48] is widely used as it can improve the network performance while keeping the parameters intact. Dilated convolution increases the receptive field without increasing the parameters and enables images to establish long-distance information association. Considering the channel attention module, we use the 3-D dilated convolution. Hence, longdistance information can be acquired in the channel dimension.
The fusion of this information can enhance the expression ability of the network. Experiments also verify this concept of using 3-D dilated convolution to get more expressive features.
The space compression module transforms the information in the original image into another space and retains the key information. The key part of the image is enhanced, and the useless information of other parts is suppressed. The information obtained by the multi-scale sampling module is taken as the input of the space compression module. Further, through the parallel structure of global maximum pooling and global average pooling, the spatial information aggregation is carried out, and the context information of 1 × 1 × C size is generated. The corresponding channel attention graph is generated through two channels of MLP. Eventually, these two channels of information are added to obtain the final channel attention graph.
Moreover, we explore the difference and relationship between the encoder segment and the decoder in U-Net network. The saliency map of low-level features contains many details, the saliency map of high-level features is only a rough result, and some basic areas may be weakened. Zhao and Wu [49] concluded through a series of experiments that deep features usually contain global context information, focus more on salient areas, and ignore some edge information. This will bring disastrous effects in the cloud detection tasks because edge regions are often thin clouds. The shallow feature contains more spatial information, so it also contains the thin cloud region. The proposed attention module uses the shallow feature-coding segment, generates the final weight vector through the multi-scale sampling module and spatial compression module, and then uses the decoder to multiply the weight vector to detect the final output feature. The features of the thin cloud region are enhanced to a certain extent by adding the elements with the original features of the decoder, while the features of the thick cloud area are not weakened.

C. Spatial Attention Module
The visual attention model of the human brain has been simulated in the spatial attention model of DL. In recent years, it has been widely used by scientific workers, especially in the field of DL [50], [51]. In order to extract the key information, the spatial attention module makes a corresponding spatial transformation of the spatial domain information in the picture, generates the mask of the space, score, and finally multiplies the elements with the original image to seek the desired result. The original image is usually fit by the compression channel and simple convolution. Through analysis, it has been found that for image segmentation or classification tasks, the computer is not as sensitive as the human vision to identify the category of a certain area. If the focus is only on the local area, the computer cannot complete the task of segmentation and classification. Usually, the convolution operation is limited by the size of the convolution kernel and can only fit the local information. For cloud feature extraction, long-distance information is often equally crucial because in the final analysis, cloud feature is only a classification problem for cloud and non-cloud regions, and most of the cloud pixels are identical. The effective use of the similarity between cloud pixels is the focus of this research work.
In order to make effective use of the long-distance information, one method is to enlarge the convolution kernel as much as possible, even to the whole image, or to expand the receptive field by accumulating the empty convolutions in the network, which will enlarge the receptive field and acquire the wider information distribution, However, such a continuous superposition method will substantially increase the amount of computation, and the deepening of the network will make it difficult for the network to converge, which boosts the difficulty of optimization. In 2011, Buades et al. [52] proposed a spatial filtering method non-local mean denoising. In this method, the pixels in the image are not considered to exist in isolation. There must be some association between the pixels of one point and the pixels of other places, which can be considered as gray correlation and geometric structure similarity. Meanwhile, it was also found that similar pixels are not limited to a certain local area, such as the long edge, structure, and texture in the image. Natural images contain abundant redundant information, so image blocks that can describe the structural features of images have to be used to seek similar blocks in the whole image. The basic idea of non-local mean denoising is that the gray value of the current pixel is obtained by a weighted average of all pixels in the image with a similar structure.
Inspired by the idea of non-local mean denoising, Wang et al. [44] introduced the idea of non-local mean denoising into DL, the maximum non-local information sharing is achieved by expanding the receptive field area to the size of the whole image. As an end-to-end module, non-local neural network modules can be added to any CNN network, and it will substantially improve the overall performance of the network. However, this module has a fatal disadvantage in that the matrix multiplication leads to a huge number of parameters leading to a high hardware requirement. Cao et al. [53] found that in the original non-local structure for each query location, the important areas are basically the same area, that is, the attention of each location is almost the same. Hence, by adding the characteristics of these important areas to each location, the accuracy of the network does not decline, but the amount of computation is reduced significantly. Based on this, the way of non-local neural network module has been modified to obtain the context information, and convolution has been done instead of a large number of matrix multiplication operations, In this way, the number of parameters is prominently reduced, and considering that the features learned by the non-local networks are location independent, the information of the whole graph has been integrated into one point. In the final structure, a 1 × 1 convolution has been used to fit the information.
Based on past experience, the convolutional neural network is very significant for visual tasks to represent features on multiple scales. While exploring the stronger expression ability of the network, it has been noticed that the better feature extraction ability can be obtained by grouping the features and fusing the results layer by layer. Hence, we proposed a multi-level convolution fusion-block (MFC-block) (Fig. 4).  A packet convolution module is added to this spatial attention module to represent the multi-scale features in a more fine-grained way and increase the receptive field of each network layer, firstly, the input feature passes through a 1 × 1 convolution operation, then the features are divided into several groups based on the channel dimension, and each group is fused with the features of the previous group and then fit by a 3 × 3 convolution. The result of each group is spliced and then taken as output, at the same time, referring to the Res-Net structure, the input feature and output feature are added to obtain the final output.
The spatial attention module proposed here is demonstrated in Fig. 3. The module combines the multi-level convolution fusion module with the improved non-local neural network module. On the one hand, it boosts the relevance of each local region through grouping convolution, and on the other hand, it cross-fuses the long-distance information through the non-local idea. The module is designed as a residual structure and added to the last layer feature of the decoder.

D. Probabilistic Upsampling Block
In the image segmentation task, the boundary details play a significant role. If the spatial information of some key positions is ignored, the segmentation task turns out to be inefficient; hence, Seg-Net proposed by Kendall and Cipolla [54] stores the maxpooling index in the encoder feature map before downsampling. This suggests that the position of the maximum feature value of each pool window is used as the feature map from each encoder to decoder. This structure alleviates the loss of boundary information to a certain extent, and its effectiveness is manifested by experiments. In the cloud detection task, once the spatial location information is lost, the cloud or non-cloud texture will be incoherent; this will significantly affect the detection performance. In order to solve this problem, a PUB module has been designed, which will optimize the problem of common upsampling information loss.
Compared to Seg-Net, the position information of the input image before downsampling has been achieved through this method, and the index of the maximum value is obtained. At the same time, the ratio of the other three positions is set to the maximum index value, that is, the maximum position is set to 1, and the other positions are tuned to the ratio of the maximum value. This information is stored in the weight graph when the decoder performs upsampling; it first performs upsampling, and then multiplies the result of the encoder. This structure not only optimizes the boundary missing issue but also maintains the continuity of space to a certain extent. The experimental results prove that this structure not only has better detection performance but also has a faster convergence speed; PUB is shown in Fig. 5.

E. Loss Function
While performing the binary classification task, binary cross-entropy [55] can be selected as the loss function. When the output of the network is activated through the sigmoid function, the probability value of the output ranges between 0 and 1; if the value exceeds 0.5, it can be classified as a positive sample, if it is less than 0.5, it is a negative sample. Cloud detection is a binary image segmentation task, positive samples in cloud detection are cloud pixels, and the negative samples are non-cloud pixels, the binary cross-entropy can be used as the loss function; through the experiment, good detection results can be achieved. But through the analysis, it can be analyzed that the distribution of cloud and non-cloud is not balanced in a scene cloud image, for the thick cloud and non-cloud areas, since these regions are easy to distinguish for the network when the prediction value of an area is close to 1, it is surely a cloud area. On the contrary, when the prediction value of an area is close to 0, it is certainly a non-cloud area.
However, if it is the boundary between cloud and non-cloud or the thin cloud area, the prediction value is usually close to 0.5, which makes it difficult for the network to identify, to which area, the point belongs to. This makes the cloud detection task strenuous. Often, whether the detection of these regions is correct or not is the key to judge the performance of an algorithm, the use of the binary cross-entropy loss function will make the iterative process of the loss function slow and unable to converge to the optimal in a large number of simple samples. In view of this, the focus of the presented research is to identify the means of improving the binary cross-entropy loss function to make the loss function more suitable for cloud detection tasks, the binary cross-entropy loss function can be computed as follows: In the above formula, y is the output through the activation function, and the size is between 0 and 1.We notice that in the Focal loss function [56], add a constraint term to the binary cross-entropy loss function to achieve better detection performance, such as adding a balance factor to balance positive and negative samples, and adding a constraint term (1−y ) to make the network pay more attention to difficult and misclassified samples. Compared with thick clouds and other ground objects, thin clouds are easy to be misclassified, so if the relevant constraints are added to the loss function, the network will pay more attention to the detection of thin cloud areas. In the cloud detection task, because the number of cloud pixels and non-cloud pixels in a scene is uncertain, the balance factor is not necessary for us. To avoid a lot of meaningless optimization in simple samples, relevant constraints are added to the binary cross-entropy function to improve the loss function. We use focal loss with only constraints. The formula is as follows: When the positive samples (cloud pixels), when they are misclassified due to their dimension, the modulation factor (1−y ) is close to 1, and the loss will not be greatly affected. When they are correctly classified, it is close to 1, so the modulation factor is a value close to 0. The same is true for the negative samples. For the samples with correct classification and the samples with the wrong classification, the loss will be reduced, but the reduction degree of the samples with correct classification is greater than that of the samples with false classification. Hence, more attention will be transferred to the samples with false classification, and the proportion of the samples with better classification will be reduced. For the cloud detection task, for example, a certain region is very similar to the underlying surface of the cloud, which will lead the cloud area to be assigned most likely to the negative samples. Therefore, more attention is paid to these areas, which makes the presented network more robust.

A. Dataset
In 2013, China launched the first high-resolution earth observation satellite called GF-1 WFV, which was equipped with two 2 m/8 m panchromatic cameras and four 16 m multispectral cameras. Gaofen-1 satellite breaks through the optical remote sensing technology of high spatial resolution, multispectral and high temporal resolution, multi-payload image mosaic and fusion technology, high precision, and high stability attitude control technology. The wide field of view camera consists of four bands of visible light and near infrared, namely R, G, B, and NIR. The spatial resolution is 16 m, and the observation range is 800 km. Owing to high precision and wide observation range, it has been applied in many fields, such as environmental disaster reduction, ocean, security, and remote sensing. It is a challenging task to do any research activity on cloud detection algorithms based on high score data. Because there are only four bands of information, and there are no bands such as thermal infrared bands, which are supreme in the cloud detection task, it is a challenging and meaningful research work to use the limited spectral information to better segment the cloud and underlying surface.
The dataset that is used in the experiment is the open-access GF-1 WFV imagery [10]. This set of data comprises 108 data collected from various locations around the world. For scenes with different cloud cover distributions, all data have corresponding cloud masks. A variety of geomorphic environments, including urban areas, barren areas, areas covered by snow, areas covered by a lot of vegetation, and oceans or lakes, are covered in this dataset. The resolution of the image is 16 m, covering the visible and near-infrared bands. The image size is approximately 17 000 × 16 000 × 4. Among the 108 scenes, 86 were selected as training data. The rest are test data. As shown in Fig .6, to remove the black area around each scene, all the images are rotated and cut to 11 264 × 11 264, and each scene is cut to 512 × 512. In this way, 52 272 × 512 × 512 images are obtained for training and testing including 41 624 images for training and 10 648 images for testing the size of 11 264 × 11 264 × 4. Finally, the pixel values of these images are divided by 1023 to normalize between 0 and 1.
As shown in Fig. 7, it can be seen that the ground features are quite different, and the colors of various ground objects are very different, especially the snow and water images, which are very similar to clouds.

B. Experimental Environment
In this article, all the experiments are programmed and implemented with Keras framework on Ubuntu 18.04 and trained with NVIDIA RTX 2080 Ti GPU. The training batch  size is set to 4 and the maximum training epoch to 50. Adam optimizer is used to optimize [57], and the learning rate is set to 10 −6 .

A. Evaluation Metrics
The overall accuracy (OA) and false alarm rate (FAR) [30], [38], [60] are chosen as the main experimental verification indicators. In addition, we also selected precision, Recall, Kappa, and F1-Score as auxiliary indicators. In the field of DL, OA refers to the ratio of the number of samples correctly classified by the classifier to the total number of samples in a given test dataset. FAR indicates the ratio of the negative samples misclassified by the classifier to the total number of all the negative samples. Recall represents the ratio between the correct number of detected cloud pixels to the actual number of cloud pixels in the ground truth. Kappa is a coefficient used to evaluate the consistency in image segmentation. The higher the value of Kappa, the better the model performance of the network. F1 (F1-Score) is a measure of classification problem, which is the harmonic average of Precision and Recall. Generally, the higher the score of F1, the better the quality of the model [58].
The above-mentioned metrics are defined, as follows: where TP indicates the true positive outcomes, i.e., the number of cloud pixels that are correctly identified as cloud pixels in the generated mask; TN represents the true negative outcomes, i.e., the number of non-cloud pixels correctly identified as the non-cloud pixels in the generated mask; FP indicates the false-positive outcomes, i.e., the number of non-cloud pixels wrongly identified as cloud pixels in the generated mask; while FN is the false-negative outcomes, i.e., the number of cloud pixels falsely identified as non-cloud pixels in the generated mask. P denotes the number of cloud pixels in the ground truth and N denotes the number of non-cloud pixels in the ground truth. When calculating far, if all the pixels in the result are non-cloud pixels, TN and FP are both 0, then the denominator of far is 0. We add an infinitely small number ε = e −10 to avoid the situation where the denominator is 0.

B. Evaluation of PUB
The purpose of PUB is to ensure the continuity of spatial information while restoring the image at the decoder and to optimize the problem of boundary loss caused by the upsampling. The spatial attention and channel attention modules are removed, the network is identified as PU-Net. Compared to U-Net, PU-Net only adds the PUB, therefore, the performance of PU-Net is compared with U-Net. In the process of training the network, it is observed that PU-Net not only has better detection results than U-Net but also has a faster convergence speed. The test results are mentioned in Table I. From the above test results, it can be analyzed that PU-Net has achieved higher OA, and far has also decreased by 0.86%.
As highlighted in Fig. 8, it is a test picture of water classification. The results obtained by the U-Net algorithm demonstrate obvious discontinuities in the texture, while the results of PU-Net do not have this situation, the subjective and objective results are combined, the proposed PUB thus seems effective. As shown in Fig. 9, the convergence speed of PU-Net in the first few epochs is significantly higher than that of U-Net, and the final OA is also higher than that of U-Net, so we can draw a conclusion, PUB can effectively improve the convergence speed and network performance.

C. Evaluation of CA Module
In order to verify the CA module, a few random experiments have been carried out. Firstly, for the demonstration of the multi-scale sampling module structure of the CA module, the comparative experiments of 2-D dilated convolution and 3-D dilated convolution are performed. The CA module is then  II   EVALUATION RESULTS WITH CA MODELS USING 3-D CONVOLUTION AND 2-D CONVOLUTION ON THE TEST DATASET   TABLE III EVALUATION RESULTS WITH U-NET MODELS AND U-NET-SA ON THE TEST DATASET  added to the deepest layer of the U-Net network. Through many experiments, it has been concluded that one convolution in parallel and three convolution groups with expansion rates {2, 5, 7} have the best detection results. Therefore, 2-D and 3-D convolutions are chosen for the three-way expansion convolution. From the table, it can be noticed that the accuracy and FAR of the structure with 3-D dilated convolution are optimized compared to that of 2-D dilated convolution. 3-D dilated convolution can not only fit the information in spatial dimension but also operate in channel dimension. Therefore, 3-D dilated convolution is believed to be better than 2-D dilated convolution, and the detection results also confirm this conjecture, as listed in Table II.
To further verify the effectiveness of the CA module, experiments are conducted by controlling the number of CA modules. Since U-Net has a total of five feature layers,

D. Evaluation of SA Module
For the SA module, we choose to add the SA module to the last layer of the U-Net network because we added the CA module in the deepest four layers of the U-Net network. To avoid the mutual influence caused by the mixing of the CA module and the SA module in the final network, the only discussion is done for adding SA to the last layer. At the same time, another reason is that the proposed SA is doing information fusion in a spatial dimension, and the last layer is the largest compared to other layers. It will make this SA module play a maximum role, making it more effective. The SA module added by U-Net is later compared with the original U-Net to verify the effectiveness of the SA module. The results are mentioned in Table III. As shown in Fig. 11, the network with SA shows better detection performance for snow-covered areas.

E. Evaluation of Proposed Method
The final network is based on U-Net, adding CA module, SA module, and PUB module. The CA module strengthens  the fusion of channel dimension information, the SA module strengthens the fusion of the spatial dimension information, and the PUB module guides the acquisition module of the code terminal by supervising the location information of the acquisition module. Also, we compare focal loss with binary cross-entropy loss function. Focal loss focuses more attention on the samples that are difficult to classify, through the detection results as Table IV, it can be observed that the focal loss function has better detection performance. As shown in Fig. 12, from the ground truth lesson, we can see that most areas of the test map are covered by clouds, and many thin cloud areas can be seen from the corresponding NIR-R-G image. Compared with the binary cross-entropy loss function, the focal loss function shows less missed detection.
To further verify the performance of the algorithm, the algorithm with some existing image segmentation networks, including DeeplabV3+ [19], RS-Net [38], and NGAD [59] are compared. To ensure the fairness and accuracy of the experiment, the code used in the experiment is given by the algorithm author, and the same data is used for training and testing. The final results can be analyzed from the subjective image and objective data that the proposed algorithm has many advantages over other algorithms in cloud detection tasks.
As demonstrated in Fig. 13, the first row and the third row are barren and urban classified images, respectively. In the upper left corner of the image, some ground objects similar to clouds are shown. The deeplabV3+ and RS-Net algorithms identify these areas as clouds, and the proposed algorithm correctly detects these non-cloud areas. Therefore, it can be considered that our algorithm has better detection ability in local high reflection areas.
The second row and the fourth row are snow and vegetation classification images, as shown in the red frame selection area. From the JPEG image, it can be seen that there are a lot of thin clouds in these areas. DeeplabV3+, Rs-Net, and NGAD are the three contrast algorithms having the phenomenon of missing thin clouds. Since attention is paid to these difficult areas, the presented algorithm shows a better detection effect, when the texture is more similar to the ground truth.
The fifth row is the water classification image, as shown in the red frame area. Since it is difficult to distinguish water from the cloud under the illumination of light, the low detection performance of the algorithm is observed in these areas; nevertheless, the proposed method is better than the other three methods on the whole.  This proves that in the face of cloud-like areas, these two algorithms divide them correctly into non-cloud areas. From the perspective of OA, the proposed algorithm has greater improvements than other algorithms. On the whole, it has a similar performance to the NGAD, the algorithm proposed in this article has higher OA than NGAD, but also higher FAR than NGAD. In general, our algorithm has greater advantages compared with RS-Net and DeeplabV3+. Compared with NGAD, although FAR performance is not good, other indicators are improved compared with NGAD. F1-Score, in particular, reflects the quality of a network. We have achieved the best results in this area, which also reflects the effectiveness of our work at the same time. NGAD uses complex Gabor features in the network, which greatly increases the network computation, while our network is relatively light and easier to deploy. Table VI presents a comparison of the parameters of different algorithms. By comparing the parameters of several algorithms, we can see that RS-Net has the smallest parameters, but because of the simple model, the detection performance is not enough. While DeeplabV3+ has the largest parameters, and the detection performance is not high, and the detection performance of NGAD is similar to that of our algorithm, but its parameters are 36% more than our model. We can conclude that our algorithm achieves the best detection performance, also well controls the model parameters; therefore, we can think that our algorithm is advanced compared with other cloud detection algorithms.
After analysis, we can find that our FAR is slightly higher than NGAD because we pay more attention to the thin cloud area. Although our OA and Precision detection results have achieved better results, in the edge of thin cloud region, the probability of non-cloud region recognition for cloud region is greatly improved, how to better identify the edge of cloud. This is what we need to further improve in our future work.
In summary, from a subjective and objective point of view, the proposed algorithm exhibits excellent cloud detection performance. This justifies the meaningfulness and effectiveness of this algorithm, at the same time; it also has the characteristics of lightweight, which makes our network have fewer limitations in application and better deployable.

VI. CONCLUSION
With the enhancement of DL theory systems, more and more researchers choose to use the convolutional neural network based on DL for cloud detection and other related research. Nevertheless, when the convolutional neural network is used to get effective cloud information, a large number of redundant information will be fed into the network at the same time, which will lead to subsequent false classification. Cloud detection, a special segmentation task, is very sensitive to the distribution of texture, once an area is classified incorrectly, it will lead to the final result image texture disorder. There are also some common problems: the binary cross-entropy loss function cannot take into account the regions that are more difficult to classify, resulting in low detection performance. In view of these issues, the proposed attention module automatically adjusts the weight of the region to retain more useful information and suppress the useless information. In order to solve the problem that the convolutional neural network does not take into account the boundary texture information, the PUB module has been proposed. The upsampling module has been thus optimized; finally, the binary cross-entropy loss function has been optimized to pay more attention to the critical regions such as thin clouds. Experiments prove the effectiveness of this algorithm, as the OA is 97.45% and far is 2.65%. Compared to other algorithms, the proposed algorithm achieves better detection results. His research interests include remote sensing, computer vision, and deep learning.
Hui Wang is currently pursuing the master's degree with Xidian University, Xi'an, China.
Her research interests include remote sensing, computer vision, and image compression algorithm.
Jun Wu is currently pursuing the master's degree with Xidian University, Xi'an, China.
His research interests include computer vision, remote sensing, and machine learning.