Cloud Detection Method Using CNN Based on Cascaded Feature Attention and Channel Attention

Cloud detection is of great significance for the subsequent analysis and application of remote-sensing images, and it is a critical part of remote-sensing image preprocessing. In this article, we propose a cloud detection method using convolutional neural networks based on cascaded feature attention and channel attention (CFCA-Net). The CFCA-Net uses cascaded feature attention module (CFAM) to enhance the attention of the network toward important color feature and texture feature. The CFAM cascaded the color feature attention and texture feature attention module in the encoder. The CFAN-Net also uses channel attention to highlight the important information in the channel dimensions. The attention module is based on multi-scale features and uses dilated convolution with different dilation rates to obtain information about multiple receptive fields. Moreover, a loss function combined quadtree and binary cross-entropy (BCE) was also introduced to make the network focus on the edge of cloud area. We validated our CFCA-Net on the Gaofen-1 wide field-of-view (WFV) imagery dataset. The experimental results show that the CFCA-Net performs well under different scenarios, and its overall accuracy reaches 97.55%. Moreover, subjective cloud detection results also prove the effectiveness of our algorithm.

Over the years, researchers have studied much about cloud detection methods. It is known that the traditional cloud detection method relies on the physical characteristics of the cloud and sets the threshold based on it. The cloud detection method based on physical characteristics studies mainly the reflectivity of clouds in different bands and the relationship between them (such as the ratio of reflectance between two bands, etc.). Using the difference between the physical characteristics of the cloud area and the non-cloud area, a better detection effect can be achieved by setting thresholds for the specific physical characteristics. In 1993, Rowssow and Garder [2] set thresholds in the near-infrared and visible light bands and proposed an International Satellite Cloud Climatology Project (ISCPP) cloud detection algorithm. Targeting the Landsat-7 remote-sensing data, Irish et al. [3] proposed an automatic cloud cover assessment (ACCA) algorithm. This method uses the multi-spectral and thermal infrared band reflection characteristics of the Landsat7 remote-sensing data to obtain cloud masks and non-cloud masks. This method is improved and also used in Gaofen-1 satellite imagery [4]. These methods use only a part of the band information about the remote-sensing data. The F-mask considers almost all the band information, conducts several physical tests, builds a probability model to calculate the cloud probability of each pixel, and can dynamically calculate the suitable threshold [5]- [7]. Chen et al. [8] used F-mask to integrate spectral information and contextual semantic information to improve the detection accuracy of Landsat images. The multi-feature combined (MFC) algorithm uses the relationship between the reflectivity and waveband of the GF-1 remote-sensing image and uses the aggregate and texture features to improve the inspection results to generate the final cloud mask [9].
Some remote-sensing images contain less band information, such as the Gaofen-1 satellite image which has only four bands of information. For such an image, the color and texture features are generally extracted to process the image. An and Shi [10] designed a cloud detection algorithm based on the least square method. This algorithm utilizes the color features, local statistical features, texture features, and structural features of the image. Liu et al. [11] applied a graphic model combined with color features for cloud segmentation. Li et al. [12] used support vector machines (SVMs) [13] to distinguish features, including brightness features, texture features, and average gray-level co-occurrence matrix (GLCM) [14], [15]. Shi et al. [16] used scale-invariant feature transform (SIFT) [17] and RGB features as the key features to evaluate whether a super-pixel [18] is a cloud. These methods extract the brightness, texture, and other variable features of image This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ pixels to obtain the cloud masks. However, these methods are not robust to images of extraordinary underlying surfaces (such as ice and snow).
In recent years, neural network methods have been used widely in the field of image processing and have achieved good results in object detection, classification, and segmentation. Remote-sensing image cloud detection tasks are categorized under semantic segmentation. The deep learning methods for cloud detection can avoid manually designing features and dig out more potential features. Key and Barry [19] took the lead in applying neural networks to cloud detection in remote-sensing images. Bankert [20] and Jianhua [21] used artificial neural networks and probabilistic neural networks, respectively, for Advanced Very High Resolution Radiometer (AVHRR) cloud detection. These two models have a great detection effect on thin clouds and thick clouds and have good stability in complex scenes. In deep learning methods, multi-scale features are widely used. Xie et al. [22] performed super-pixel segmentation on the remote-sensing image to be detected, used a convolutional neural network to extract multi-scale features from the super-pixel, and divided the pixels into cloud pixels and non-cloud pixels. Ji et al. [23] used cascaded convolutional neural networks to integrate cloud detection and cloud removal frameworks and used multi-scale aggregation to detect clouds and shows. Luotamo et al. [24] used multi-scale information and cascaded two CNN models to deal with undersampled and full-resolution images. Jeppesen et al. [25] suggested a cloud detection deep learning model for remote-sensing images based on the convolutional neural network model. Segal-Rozenhaimer et al. [26] proposed a domain-adaptive method based on CNN. This method can better adapt to different satellite platforms in the prediction step without the need to train each platform separately, which improves the robustness of multiple remote-sensing platform predictions. The deep learning methods can also handle situations such as missing information, no clouds labels, and so on. SAGAN used a semi-supervised algorithm to achieve cloud detection, requiring only a small number of image-level tags [27]. For thumbnails with missing resolution and spectral information, CDnet used feature pyramid module (FPM) and boundary refinement (BR) block to effectively extract cloud masks [28]. CDnetV2 had further improved the detection results of images with coexisting clouds and snow [29]. The main advantage of deep learning is the diversity of feature learning and the ability to learn in-deep features. The deep convolutional neural network can extract various features such as spatial features and spectral features.
However, most methods pay more attention to regional accuracy and less to boundary quality, which lead to the blurred boundary in the detection results [30]. In cloud boundaries and thin cloud areas, cloud information and underlying surface information are mixed. Due to the complexity and diversity of the underlying surface, it is very difficult to detect the boundaries and thin cloud areas accurately. In the face of this situation, it is unrealistic to only rely on increasing the width and depth of the network to solve it.
We have done a lot of research on cloud detection. We first consider using the multiple features of ground objects.
We found that the texture difference between cloud and ground objects is very obvious, which is very effective for improving the accuracy of cloud detection. The multi-scale image decomposition based on the domain transform filter were used to extract the texture features of ground objects [31]. Then, we combined the color and texture features of remote-sensing images to design cloud detection methods [32]. Compared to the traditional algorithms, the deep learning method has significantly improved the detection performance. We noticed the development of deep learning and used convolutional neural networks for cloud detection. We designed a Gabor transform layer in the encoder-decoder network to extract texture features [33]. This network also combined with the attention module and achieved a good cloud detection effect. In the AUDI-Net [34], we proposed the Up-Down block and used wavelet transform, which significantly improves the density of thin cloud detection. However, the Up-Down block takes up a lot of parameters and calculations. In [35], we studied the lightweight network and achieved great performance with a smaller amount of parameters.
Through previous research, we found that color and texture features are very effective in cloud detection. The effective extraction and utilization of these features can often improve the performance of cloud detection. It has achieved good detection results on public dataset and also has a lighter network structure compared with addition input, up and down block implant network (ADUI-Net).
We have proposed a network for cloud detection, which contains cascaded feature attention module and channel attention module, named CFCA-Net. The CFCA-Net is built on the encoder-decoder structure. It has achieved good detection results on public dataset, it also achieved good detection results on thin cloud and the boundary of the cloud. And also has a lighter network structure compared with ADUI-Net. Our contributions include the following three parts.
1) We designed a cascaded feature attention module (CFAM) to enhance the useful spatial information of the multi-scale feature maps and suppress invalid information. This module extracts color features and texture features giving better results in remote-sensing images with fewer bands. We used dark channel prior to assisting the extract color feature, and nonsubsampled contourlet transform (NSCT) to assist the extract texture feature. 2) We sketched a channel attention module on the decoder to carry out the screening of characteristic channels. Our channel attention module uses dilated convolution with different dilation rates to obtain information about multiple receptive fields. 3) We designed a loss function based on quadtree segmentation. This loss function pays attention to the part of the detection results that has large edge changes and is difficult to distinguish. The similar points in the entire large area are finally replaced with single values, which reduce the proportion of the simple samples in the loss function compared to using all points to iterate the loss function.

A. Encoder-Decoder Structure
In the field of semantic segmentation, the encoder-decoder structure is widely used and has achieved great results. The fully convolutional network (FCN) [36] introduced an endto-end fully convolutional neural network structure for semantic segmentation. Unet [37] introduced skip connection and achieved good results. SegNet [38] applied the pooling layer result from the encoder to the decoder that introduced more encoding information.
DeepLab series proposed atrous spatial pyramid pooling (ASPP), which combines information at different scales [39]- [42]. DeepLabV3+ introduces a decoding module based on DeepLabV3, which further integrates the low-level features with the high-level features and improves the accuracy of the segmentation boundary.

B. Dilated Convolution
In order to expand the receptive field, there are usually two methods, one is to increase the size of the convolution kernel, and the other is to use a pooling operation. The pooling layer is an important structure in deep learning that can further extract abstract features and expand the receptive field. However, the large convolution kernel will increase the amount of calculation, and the pooling operation will inevitably reduce the resolution and cause the loss of detailed information. Dilated convolutions proposed to use dilated convolution to avoid the decrease in resolution and proposed a "context module" to aggregate multi-scale information [43]. Dilated convolution is realized by inserting spaces between the elements of the convolution kernel. This method of increasing the receptive field has a good effect while connecting multiple dilated convolutions [44]. The DeepLab series uses dilated convolutions, among which DeepLabv2 and DeepLabv3 study the effectiveness of dilated convolution in parallel and series for extracting multi-scale information [39]- [42].

C. Attention Mechanism
The attention mechanism resulted from human visual cognitive science. Scientists discovered that when humans perform visual tasks such as reading and observation, they pay more attention to the detailed information of the target area and suppress other useless data. The attention mechanism in deep learning is similar to this mechanism. The basic idea is to make the model focus on the important features and ignore those that are not important. The results of attention are generally displayed in the form of probability maps or probability feature vectors. Squeeze-and-excitation networks (SE-Nets), proposed by Hu et al. [45], use the SE module to realize the weight learning of feature maps of different channels. Woo et al. [46] proposed convolutional block attention module (CBAM) that combines spatial and channel attention. The attention mechanism is very effective in target detection [47], [48], image segmentation [49]- [51], super-resolution [52], [53], and other fields which can improve the effectiveness of the model.

D. Nonsubsampled Contourlet Transform
Da Cunha et al. [54] proposed the NSCT. NSCT not only has the multi-resolution and time-frequency local characteristics of the wavelet transform, but also has multi-directivity and anisotropy, which can well represent the texture, edge direction, and other information in the image. The NSCT is a transformation based on the non-subsampled pyramid (NSP) and the non-subsampled direction filter bank (NSDFB). First, the NSP decomposes the input image in a tower shape and decomposes it into two parts, high-pass and low-pass. Then, the NSPFB decomposes the high-frequency sub-band into multiple directional sub-bands, and the low-frequency part continues to decompose as above. The NSP uses a translation-invariant filter structure to achieve the filter function.
Using NSCT to extract the texture information of the image for segmentation is conducive to improving the performance of image segmentation [55], [56].

III. METHOD A. Overview
We use the encoder-decoder structure as the framework of the cloud detection network model and introduce the attention mechanism. The backbone of the CFCA-Net is similar to the existing encoder-decoder network models. The overall network framework of the CFCA-Net, in this article, is composed of two parts: encoder and decoder. The encoder encodes the entire input image, expands the number of feature map channels of the image gradually, and obtains features of different scales through the pooling structure. Each step of the encoding end comprises two consecutive convolutional layers and a maximum pooling with size of 2 × 2. Each convolutional layer uses a convolution operation with a kernel size of 3 × 3 and ReLU linear correction unit. The maximum pooling is used to down-sample the feature map; in each down-sampling step, the number of feature channels will be doubled. Contrary to the encoding side, the decoding side is needed to restore the feature map to the size of the input image. Hence, each step of the decoder includes up-sampling and convolution with kernel sizes of 3 × 3. To make up for the loss of information in the sampling process, the feature map of the corresponding scale at the encoder is connected to the decoder through skip connection, and the feature information is shared with the decoder. Finally, we used 1 × 1 convolution and Sigmoid activation function to get the final prediction result. Table I shows the structure of basic encoder-decoder network of CFCA-Net. In the table, (×2) means that there are two layers with the same structure.
The ground objects in the remote-sensing image are complicated, and too many invalid features affect the performance of the cloud detection model. The introduction of the attention module can enable the network to learn the compelling features of the cloud region, reduce the learning of invalid features such as ground objects, and improve the effectiveness of feature extraction and the accuracy of the cloud detection model. We used the CFAM in the encoder to emphasize the color, texture, and other related features of the cloud area while  ignoring the invalid features of the non-cloud area. As shown in Fig. 1, we used the cascaded attention module at each scale of the encoder to form continuous multi-scale cascaded feature attention. In this way, the information loss caused by down-sampling can be effectively compensated, and the feature map of the next level can be guided to make it pay more attention to the color features and texture features to preserve the features of the cloud area.
On the decoder, after multiple convolutions and pooling operations, a multi-channel feature map containing complex information is generated. The feature map of each channel is a component extracted from the original image that contains different feature information. Some channels contain more features that can highlight the cloud area, while some do not. The channel attention mechanism is a good feature map screening mechanism. The channel attention mechanism is often to mine the correlation of data from itself [45], [46], [57]. However, according to the characteristics of the cloud detection encoder-decoder network in this article, we sketched a channel attention module that uses the feature map of the encoder to guide the feature map of the decoder. As shown in Fig. 1, because of the symmetrical structure of the encoder-decoder network, it is necessary to perform multiple up-sampling operations on the network at the decoder. In the process of up-sampling, the number of feature map channels decreases gradually. Hence, we used the channel attention module before the up-sampling process of the decoder to retain the channels that can highlight the features of the cloud area in the feature map.
The overall performance of the model is affected by both the network structure and the design of loss function [58], [59]. A proper loss function can make the model converge faster during the training process, and the obtained model also has a more reliable prediction performance. Therefore, choosing a suitable loss function is also extremely important for the development of the model. In the semantic segmentation of the ordinary images, the cross-entropy loss function and Adam optimization algorithm are used to train the model to achieve better results [60], [61].
In remote-sensing images, cloud areas and non-cloud areas often occupy a large portion which is easier to identify. However, the boundary between the two is mostly thin clouds, extremely difficult to detect. Therefore, for cloud detection networks, we hope that the network can be more accurate in the edge detection of cloud and non-cloud areas. Quadtree image segmentation is used widely in image processing applications to locate regions of interest [62]. The cloud mask of the cloud detection network is a result of pixel-level binarization. This article designs a selective guided loss function for the quadtree classification. Through quadtree segmentation of the cloud mask, we determine which parts are more heterogeneous than other parts, and let the loss function focus on the edges that are difficult to distinguish. In the network training process, we adopt a combination of the cross-entropy loss function and quadtree loss, so that the loss function can focus on the indistinguishable parts of the edge, while also considering the overall prediction results.

B. Cascaded Feature Attention
The traditional cloud detection methods extract several texture features and color features to improve the performance of cloud detection. Compared to the conventional cloud detection methods, deep-learning-based cloud detection methods generate functions that map the input data to predicted cloud masks by using statistical analysis of the training set. The cloud detection method based on deep learning does not rely on prior knowledge but autonomously learns relevant features through the network. This process relies on a large number of training datasets and enormous computing power support. If we can guide and add prior to the network training, we can make the network's feature learning ability stronger. We used traditional methods to extract the color and detail texture features of the cloud layer and generated attention weights to add to the cloud detection network. It would help the network to pay more attention to these features and enhance the ability of feature learning.
In this section, we explain the cascaded attention module in detail. As shown in Fig. 2, the cascaded attention module contains two sub-modules, the color feature map attention module and the detail texture feature map attention module.

1) Color Feature Attention:
The color feature is one of the most significant visual features of an image, and the color feature has a strong correlation with the scene displayed. In addition, the color feature has a small effect on the size, direction, and viewing angle of the image itself; hence, it is more robust. He et al. [63] found that for most distant images, there will always be some pixels (called dark channel pixels) that contain a very low pixel value in the three-color channel components. It can be seen from Fig. 3 that the cloud area of the dark channel image is still bright, while the non-cloud area is very dark, similar to the ground truth. Upon extraction of the original image from the dark channel feature, the attention feature map is extracted through the color feature map attention module. This is because the clouds generally have a higher reflectivity in the visible light band. Therefore, we extracted the dark channel features of the image and constructed the color feature map attention module. The dark channel extraction method is shown as follows: In the above formula, f dark (x, y) is the dark channel image .  And f (x, y, c) is the original remote-sensing image, which has three visible bands.
The structure of the attention module of the color feature map is given in Fig. 4, which can be expressed as follows: First, the dark channel feature map f C is compressed in the channel dimension, and the average pooling and maximum pooling are performed, respectively, in the channel dimension, and the maximum and average values on the channel are extracted. Next, three dilated convolutions with different dilated rates are connected in parallel to further extract color features. By using dilated convolution, the receptive field can be increased without reducing the image resolution and increasing the amount of calculation. Different receptive fields are concatenated in the channel dimension, and after the convolution and Sigmoid activation, feature fusion is realized, and the attention weight of the color feature map f C−Att is obtained. We used the attention weight obtained from the color feature map to guide the encoder. Multiplying the feature map pixel by pixel with the feature map at the encoder f E and f C−Att , we obtained a feature map with assigned weights. For the module to maintain the original encoding end information, the feature map after the attention weight assigned is subjected to convolution learning and then added to the original encoding end feature map and convolved to obtain the output of the attention module f C−OUT .
2) Texture Feature Attention: The texture feature describes the surface properties of a scene corresponding to the image. It is expressed by the gray-scale spatial distribution of the  pixels and their surrounding spatial neighborhoods, reflecting the slowly changing or periodically changing surface structure organization and arrangement properties of the surface of the object. High-resolution remote-sensing satellite images show rich and detailed information due to their high resolution. As most of the particles that make up the cloud layer are similar and have uniform radiation characteristics, the cloud area in the remote-sensing image is generally smooth, with small gray value changes, strong continuity, and similar texture characteristics. However, the texture details are more obvious because of the complex distribution of the ground features. In remote-sensing images, texture information is an important feature for identifying the cloud and non-cloud areas. Therefore, the effective extraction of the remote-sensing image texture features is conducive to the distinction between cloud and non-cloud areas in cloud detection.
The NSCT helps to maintain the edge information and contour structure of the image. In the cloud detection of remote-sensing image, the NSCT can extract the edge contour of the cloud area. The attention mechanism using the NSCT extracted texture features that can help the network identify cloud areas and non-cloud areas. Simultaneously, it can enhance the detection accuracy of the edge of the cloud area and improve the detection performance. In this article, we use NSCT to perform a two-level decomposition, and set the sub-band decomposition coefficients of each level as 2 and 4, respectively, as shown in Fig. 5.
It can be seen from Fig. 6 that the cloud area is very smooth, while the texture characteristics of the non-cloud area are obvious. The texture characteristics of the cloud and non-cloud areas are very different. After a detailed texture feature is extracted by the NSCT, the attention feature map is extracted through the attention module of the detailed texture feature map. The structure of the detailed texture feature attention module is shown in Fig. 7, which can be expressed by the following equations:   The attention of the detailed texture feature map is similar to the attention of the color feature map that extracts attention from space. We implement a structure similar to that of the attention module of the color feature map. The extraction of the attention weight of the detail texture is consistent with the attention module of the color feature map, which uses dilated convolution with different dilated rates and also uses the Sigmoid as activation function. We use the attention weight obtained from the detailed texture feature map to guide the encoder. We then perform a pixel-by-pixel multiplication operation with f T −Att and the feature map f C−OUT at the encoder to obtain the feature map with attention weights. Since the cloud area is relatively smooth and the texture features are similar, the texture features of the non-cloud area are richer, and hence, the texture attention is more focused on the texture feature of the non-cloud. To make the network pay more attention to the characteristics of the cloud region, the feature map, after the attention weight is assigned, is subjected to convolutional learning and then subtracted from the feature map of the original encoding end to obtain detailed texture difference information. The attention module output f T −OUT is obtained after the convolution output.

C. Channel Attention Module
In the encoder, the original image is subjected to operations such as convolution and pooling to generate a multi-channel feature map containing a variety of complex information. Each channel is a component extracted from the original image and contains variety feature information. Some channels contain more information, which highlights the characteristics of the cloud area. This information is helpful for the network to segment the cloud area from the image and is the key information for the network to complete the segmentation task. However, in the decoding process, at the decoding end, the feature maps of these channels are regarded as equally important, causing a certain degree of useless information interference. We employ the channel attention mechanism to filter these irrelevant feature channels. For clouds with different sizes, different receptive fields are required. Large cloud areas require larger receptive fields to obtain richer semantic information, while small cloud areas should use smaller receptive fields. In order to deal with the cloud areas with different size, we use parallel dilated convolutions to obtain different receptive fields and capture multiscale information.
The structure of the channel attention module is presented in Fig. 8, which can be expressed by the following formulas: First, the feature maps of the encoding end are, respectively, subject to the dilated convolution with the dilated rate of {1, 3, 5} and the feature maps of different receptive fields are obtained. The corresponding elements of the feature maps of different receptive fields are added together for feature fusion. Then, global average pooling and global maximum pooling are performed on the fused feature maps to obtain global information on each channel. The vector generated after using global maximum pooling and global average pooling has the extracted high-level features. Using these two pooling methods, models can obtain relatively rich information. The information of these two vectors is transformed, and feature is extracted using a fully connected layer, and after the addition, the Sigmoid function is used for normalization to obtain the channel attention weight. We use the attention weights generated by the feature map of the code segment, containing the shallow features, to guide the feature map of the decoder. The weight extracted by the encoder is multiplied with the feature map of the decoding end to obtain the reconstructed feature map.

D. Quadtree-Binary (QTB) Loss Function
Binary cross-entropy (BCE) is usually used as the loss function in binary classification tasks. The formula of BCE is as follows: The output of the network is normalized to 0-1 by the sigmoid function. The pixels can be regarded as positive samples if the probability value exceeds 0.5. In cloud detection, the cloud pixels can be regarded as positive samples and non-cloud pixels as negative samples.
In the cloud detection task, it is found that the large cloud or large non-cloud areas are simple samples and are easier to detect. For cloud detection, we hope that the network focuses more on the edge of the cloud and non-cloud areas because these areas are difficult to detect and often have a greater impact on detection performance. The prediction value of cloud pixels in these areas is about 0.5, which is the challenge of the cloud detection task. Using BCE function cannot converge to the optimal in a large number of simple samples.
Based on this, we design quadtree loss. We introduce the quadtree structure into the loss function and refine the segmentation sub-region on the real cloud mask. The same eigenvalues are classified into the same category after the quadtree segmentation of the whole image is completed. Similarly, the probability value of the prediction image is divided into sub-regions according to the quadtree segmentation result of the cloud mask. The formula of quadtree loss is given as This formula of the L QT k suggests that after the quadtree segmentation of the cloud mask, BCE is done for each region, which represents the local detection accuracy between prediction result and ground truth. These local regions are obtained by the quadtree segmentation, and the k denotes kth sub-regions. The L QT is the quadtree loss. M is the number of sub-regions divided by quadtree, that is, the size of the set is obtained by quadtree. This means that after the BCE of each region is finished, the average of all regions is calculated. Fig. 9 is a simple example of calculating the quadtree loss. We first perform the quadtree segment on the mask to obtain segmented blocks. Then calculate the cross-entropy for each segmented block. Finally calculate the average of all blocks. There are 16 pixels in the original image, calculated according to the BCE. After using the quadtree loss, the final result only needs to average the value of ten points.
The advantage of the quadtree loss is to focus the attention of loss function on the parts with large edge changes and is difficult to distinguish. Thus, compared with iterating the loss function with all points, the proportion of simple samples in the loss function is reduced. The application of quadtree classification selective guidance loss function can enhance the network performance effectively by drawing the network attention to the samples that are difficult to detect. Fig. 10 shows quadtree segmentation results of ground truth. It can be seen from the results that the large cloud and non-cloud areas are divided into large blocks, while the edge area is densely distributed with many small blocks. Therefore, when calculating the quadtree loss, the proportion of these edge region samples in the loss function will increase.  We use BCE and quadtree loss at the same time to make the network have better convergence performance and improve the effect of edge detection at the same time. The final loss function is named quadtree-binary (QTB) loss; the formula is as follows: where γ 1 and γ 2 , respectively, represent the weight of L BCE and L QT and can be adjusted for different data. The final loss function will automatically adjust the influence of the samples with different degrees of difficulty. At the same time, the integration of the entire region is equivalent to adjusting the proportion of this type of samples, and it optimizes the problem of inter-class competition caused by the uneven proportion among samples.

A. Dataset
The dataset used in the experiment is obtained from the Gaofen-1 satellite. The satellite was launched in 2013 and is equipped with two panchromatic cameras and four spectroscopic cameras. It can achieve an imaging width of more than 800 km with a resolution of 16 m. The cloud detection algorithm based on Gaofen-1 data is challenging as the wide-field of a camera carried by Gaofen-1 consists of three visible light bands and near-infrared bands. Employing limited spectral information to achieve better segmentation results is very challenging and meaningful research.
We utilize the GF-1 wide field of view (WFV) dataset provided by Li et al. [9]. This set of data includes 108 images collected from all over the world. The dataset covers different geomorphic environments, including urban, barren, snow, vegetation, and water. The resolution of the image in the dataset is 16 m, and there are four bands of information of R, G, B, and NIR. Table II shows the bands and resolution of Gaofen-1. The dimensions of each image are approximately 17 000 × 16 000 × 4. We selected 86 scenes as the training set for the experiment and the rest as the test set.
To remove the black area around the scene, all images are rotated and cut to 11 264 × 11 264 as the black area does not contain any remote-sensing information and is not helpful for feature extraction. Due to the limitation of hardware resources, each scene is cut into 512 × 512 × 4 small pictures. In the end, 41 624 pictures were used for training, and 10 648 pictures were used for testing.

B. Evaluation Metrics
To evaluate the algorithm objectively, we use OA [64], Precision [65], Recall [66], F1-Score [25], Kappa [67], and FAR [64], [68] to evaluate the results. These metrics are calculated by Here, TP denotes the correct prediction of cloud pixels, TN denotes the number of non-cloud pixels correctly identified as non-cloud pixels, FN and FP represent the incorrect detecting results, FP denotes the false positive outcomes, and FN denotes the false negative outcomes. P and N denote the number of cloud pixels and non-cloud pixels, respectively. In order to avoid a situation where the denominator is 0, we add a very small number = e −10 to the term where the denominator may be 0.
These metrics are calculated based on each large picture of 11 246 × 11 246, and the final result is obtained by averaging all large pictures in the test sample.

C. Implementation Details
In this study, all the experiments were programmed and implemented on Ubuntu 16.04. The implementation of the models is based on Python 3.6 and employing Keras 2.2.4 and TensorFlow 1.12 deep learning framework. The models are trained and evaluated on NVIDIA GEFORCE RTX 2080 Ti. The network uses the Adam optimization algorithm with a learning set to 0.00001 in training stage. For the BGR-NIR images, the batch size is set to 2, and the number of iterations is 30. The value of pixels was normalized between 0 and 1. In QTB loss function, γ 1 = 0.9 and γ 2 = 0.1, and these values are obtained through experimental testing.

A. Evaluation of CFAM
We performed a series of experiments to verify the effectiveness of the CFAM. We used different combinations of the color feature attention module and the texture feature attention module for training and testing. In order to verify the performance of CFAM, we cascaded the color feature attention module and texture feature attention module to form the final cascaded feature attention network (CFAN). In order to verify the effectiveness of color feature extraction, we added color feature attention to the encoder and then the attention weight was added to the coding network through concatenate in channel dimension. This network is called the color feature network (CFN). Then, we used texture feature attention in the coding network to design texture feature network (TFN). The structure of TFN is similar to CFN. The difference is that the texture feature attention module is used to replace the color feature attention module. We trained and tested three networks, CFN, TFN, and CFAN, and compared the results with the basic encoder-decoder network U-Net [37]. The performance of these networks is given in Table III.
As shown in Table III, the color feature attention module and the texture feature attention module improve the performance of the network significantly. Specifically, the texture feature attention module has improved the OA and Precision. The OA has reached 97.22%, and Precision has increased by 4.24% compared to U-Net. In addition, F1-Score, Kappa, and FAR have also been improved. In other words, the main function of the texture feature attention module is to improve the detection accuracy of cloud pixels.
As for color feature attention module, the results show that the Recall has been improved significantly. This means that CFN can better identify cloud pixels in cloud pixels and non-cloud pixels correctly. This is due to the color feature extraction based on the dark channel prior.
In the CFAN, we used both the color feature attention module and texture feature attention module. The CFAN shows better performance as compared to the CFN and TFN in general. Its OA has increased to 97.35%, and F1-Score has improved to 91.53%. Kappa also has reached to 87.31%. This shows that cascaded the color feature attention module and the texture attention module has a better effect, compared with using a single module. A large number of experiments have also proved that the proposed CFAN shows high stability.
Therefore, it can be considered that the CFAM has shown excellent performance in the cloud detection of remote-sensing images by extracting color and texture features.

B. Evaluation of CA
The channel attention module is used to classify useless channel information and useful information. We used channel attention to assist in the interpretation of information at the decoder, so we added the channel attention module before the up-sampling of the decoder. The network was marked as CA-Net. We used BCE as the loss function of the network. We compared the performance of CA-Net with U-Net.
Moreover, we compared the network performance under different CA modules. At the decoding end, we had a total of five feature layers, which needed to go through four times of up-sampling and channel reduction. We controlled the number of CA modules as {1, 2, 3, 4}. The results are mentioned in Table IV. The results proved that the model performs better as the number of channel attention module increases. When the CA modules are used before all up-sampling, the OA of the network reaches 97.31%, and the F1-Score reaches 90.17%. At the same time, FAR remains at a low level. This indicates that our proposed CA module can assist the decoder to interpret the information.

C. Evaluation of CFCA-Net
The CFCA-Net is based on the encoder-decoder structure, constructing the Dark&NSCT subnet on the encoder, using the multi-scale CFAM, and adding the channel attention module on the decoding end. The Dark&NSCT subnet extracts the color and texture features and injects the CFAM to the encoder to pay more attention to color and texture features. The channel attention module strengthens the fusion of the channel dimension information at the decoder. The QTB loss function and the BCE loss are also compared.
1) Analysis of CFCA-Net: Fig. 11 shows the detection results of this method and comparison algorithm in different land-cover scenarios. We selected five representative scenes: urban, water, barren, snow, and vegetation. From the area marked by the red box in the figure, it can be seen that our algorithm shows better thin cloud detection performance. In the scene in the first column, a thin cloud on the left side of the image is seen, which looks such as a mountain range. Our algorithm detected this area.
In the water scene in the second column, there is a thin cloud at the top of the picture, visually indistinguishable from the underlying water. It can be seen that the comparison algorithms have different degrees of missed detection; ADUI-Net is overdetected, and our algorithm has a better detection performance.
In the snow scene presented in the fourth column, there are a large number of thin cloud areas, and only ADUI-Net and the method in this article have detected those thin cloud areas. The thin clouds do not completely cover the background of the ground objects, so they are easily confused with the underlying surface, and the detection is extremely difficult. It thus reflects the strong, thin cloud and edge detection performance of our algorithm.
Especially in the third scene and the last scene, we can see a thin cloud in the area marked by the blue frame in the RGB image. Due to the contrast of colors on the underlying  In (c) and (d), the cyan pixel indicates the pixel (TP) that is correctly detected as a cloud, the black pixel indicates the pixel that correctly detects the non-cloud (TN), and the yellow pixel indicates that the non-cloud pixel is incorrectly detected as a cloud pixel (FP), the purple pixel indicates that the cloud pixel is incorrectly detected as a non-cloud pixel (FN). surface, it is difficult to distinguish the boundary of the cloud, even with the naked eye. Our algorithm detected the thin cloud area correctly. Compared with ADUI-Net, the detection performance of thin clouds has been further improved owing to the use of NSCT and quadtree binary loss in this method. The metrics performance of these networks are described in Table V. It can be found that after using QTB loss, the OA is increased by 0.1% and the Precision is increased by 0.61% compared with BCE. This shows that QTB loss can effectively improve the accuracy of cloud detection models.
2) Analysis of QTB Loss: Fig. 12 shows the detection results obtained by QTB loss and BCE loss. As shown in Fig. 12(1c), it can be seen that BCE loss has a lot of false detections in the edge area. Compared with Fig. 12(1a), it can be seen that the edge of the cloud is blurry. Especially in the lower left corner, the boundary between the edge and the underlying surface is not clear. After training with the QTB loss, it can be seen from Fig. 12(1d) that the false detection of the edge has decreased. Most of the edges have good detection results, although there still a small range of false detections at some edges. This means that QTB loss can improve the cloud detection effect in the edge area. In the case of thin cloud, as shown in Fig. 12(2a), the underlying surface information is mixed with the cloud, and it is difficult to distinguish. The model trained with BCE loss failed to detect this cloud area, as shown in Fig. 12(2c). The model trained with QTB loss detected this thin cloud. As shown in Fig. 12(2d), although there are still false detection at the edge of the thin cloud, the main part of the cloud is correctly detected. Fig. 13 shows the convergence during the training. As shown in Fig. 13, the convergence speed of QTB loss in the first few epochs is significantly higher than that of BCE loss, and the final OA is also higher than that of BCE loss. Therefore, we can conclude that QTB loss can improve the detection effect of thin clouds and edges and improve the convergence speed.
We also compared the cloud detection results in various scenarios, as mentioned in Table VI. From these results, we can see that the proposed algorithm has a better performance in  most scenarios, and the performance of some scenarios is slightly worse than ADUI-Net. However, the overall performance on the entire test set is better than ADUI-Net.
At the same time, because the ADUI-Net uses Up-Down blocks with numerous parameters, the network has a huge number of parameters and calculations. To evaluate the complexity of the network, we counted the parameters and floating-point calculations of the model, as shown in Table VII and Fig. 14. The parameters of RS-Net and SegNet are small, but due to the simple model, the detection performance is not outstanding. DeepLab v3+ used ASPP, and therefore the parameters are relatively large. However, the detection performance is not good, which shows that merely increasing the complexity of the network cannot improve the performance of cloud detection. ADUI-Net designed the Up-Down block and wavelet transform to extract the texture characteristics of the cloud according to the characteristics of the cloud, showing high detection performance. Because the Up block and the Down block perform a large number of convolution operations on the feature map to extract features, this makes the amount of network parameters increase rapidly. Our CFCA-Net also use texture features, but our cascaded attention module uses less convolution to effectively extract color and texture features. Therefore, while our method achieves higher detection performance, the amount of parameters and complexity are still at a relatively low level. It can be seen that when our algorithm achieves the same or even better performance than ADUI-Net, while the parameter amount is only 30% of the latter. Therefore, we can conclude that our method is advanced compared with other cloud detection methods.

VI. CONCLUSION
With the development of deep learning theory, the convolution neural network, based on deep learning, has been used in remote-sensing image cloud detection research and achieved great results. Especially for the remote-sensing images with few spectral segments, the cloud detection method based on deep learning can extract more useful information from limited spectral segments with more advantages than traditional methods. However, feature extraction using convolutional neural network carries redundant information. This information does not help in the detection of cloud region and leads to false detection affecting the performance of the network. In view of the large difference of color and texture features between the cloud region and underlying surface, this article proposes a cascade feature attention module to extract the color and texture features of cloud region. This article also designs a channel attention module to remove the redundant information and retain useful information. Moreover, this article optimizes the loss function to improve the performance of edge detection.
For GF-1 WFV image, the multi-scale cascade feature attention module and multi-scale channel attention model, proposed in this article, significantly improve the detection accuracy of thin cloud. To further evaluate the effectiveness of the proposed algorithm, we compared SegNet, DeepLabV3+, RS-Net, and ADUI-Net. Our algorithm shows better performance. Experimental results show the excellent performance of the proposed algorithm. On the Gaofen-1 WFV dataset, the overall accuracy of our method reached 97.55%. Subjective cloud detection results also proved the effectiveness of our algorithm.