A Novel Spectral Indices-Driven Spectral-Spatial-Context Attention Network for Automatic Cloud Detection

Cloud detection is a fundamental step for optical satellite image applications. Existing deep learning methods can provide more accurate cloud detection results. However, the performance of these methods relies on a large number of label samples, whose collection is time-consuming and high-cost. In addition, cloud detection is challenging in high-brightness scenes due to cloud and high-brightness objects having a similar spectral features. In this study, we propose a cloud index driven spectral-spatial-context attention network (SSCA-net) for cloud detection, which relies on no effort to manually collect label samples and can improve the accuracy of cloud detection in high-brightness scenes. The label samples are automatically generated from the cloud index by using dual-threshold, which is then expanded to improve the completeness of cloud mask labels. We designed SSCA-net with the spectral-spatial-context aware module and spectral-spatial-context information aggregation module, aimed to improve the accuracy of cloud detection in high-brightness scenes. The results show that the proposed SSCA-net achieved good performance with an average overall accuracy of 97.69% and an average kappa coefficient of 92.71% on the Sentinel-2 and Landsat-8 datasets. This article provides fresh insight into how advanced deep attention networks and cloud indexes can be integrated to obtain high accuracy of cloud detection on high-brightness scenes.


I. INTRODUCTION
W ITH the development of remote sensing imaging, optical remote sensing imagery has been widely used for monitoring land surface change [1], mapping land use cover [2], and estimating biophysical parameters [3]. However, cloud coverage for the earth is about 60%, especially in humid tropical and subtropical regions [4]. The imaging process of remote sensing images is frequently affected by the cloud, which leads to reduce the availability of optical images for applications [5]. Therefore, to improve the availability of optical remote sensing images, accurate and robust cloud detection algorithms are essential. In addition, cloud detection is a necessary preprocessing step for optical satellite image applications.
So far, a large number of methods have been developed to detect clouds from optical satellite images [6], [7], [8]. These methods can be roughly summarized into two main categories: spectral threshold-based methods and machine learning-based methods. The spectral threshold method is a simple cloud detection method and usually applies spectral as the criterion to detect cloud, which has been widely used since the appearance of tasseled cap transformation cloud detection [9]. Zhu and Woodcock [10] proposed a novel function of mask (Fmask) method based on cloud physical properties to detect cloud and cloud shadow in Landsat images. Li et al. [11] designed a multifeature combined algorithm based spectral threshold to recognize cloud and clear sky pixels in Gaofen-1 images. Zhai et al. [12] developed a unified cloud detection method based on cloud index (CI) to generate cloud masks for Landsat-8 images. These spectral threshold methods have proven not to be suitable for cloud detection tasks in high-brightness scenes (such as rock and snow/ice) due to cloud and high-brightness objects having similar spectral features. In addition, the performance of the spectral threshold method depended on the threshold in cloud detection tasks.
In recent years, machine learning-based methods, especially deep learning methods, can provide an effective technique in cloud detection. Latry et al. [13] used a support vector machine algorithm to detect clouds from MODIS images. Chen et al. [14] exploited multiple convolutional neural networks to produce cloud masks for high-resolution remote sensing imagery. Wang et al. [15] used a novel CNN to recognize clouds and snow in Gaofen-1 multispectral images. Chen et al. [16] developed a novel cloud method based 3D-CNN to generate cloud masks This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ in ZY-3 imagery. Xia et al. [17] designed a multidimensional deep residual network (M-ResNet) based on CNN to detect clouds from multispectral satellite imagery in high-brightness scenes. Machine learning-based methods, especially deep learning methods, can achieve good accuracy for detecting cloud areas. Nevertheless, machine learning-based methods rely on a large number of cloud label samples, whose acquisition is time-consuming and high-cost. Furthermore, the application of the machine learning-based method is limited in cloud detection tasks. In addition, the above-mentioned methods cannot capture the diverse appearances of clouds in highly mixed scenes.
The main disadvantages of the above-mentioned methods are summarized as follows.
1) The performance of the spectral threshold method depended on threshold choice, but these thresholds are based on human experience and professional knowledge, and it is difficult to detect clouds when high brightness scene is limited.
2) The machine learning-based methods, especially deep learning methods, require a large number of cloud sample labels, but large cloud label samples are not always obtained.
3) The machine learning-based method cannot capture the diverse appearances of clouds in complex scenes (such as cloud-snow coexistence scenes and highly mixed scenes). To address the above-mentioned problems, we propose a CIdriven spectral-spatial-context attention network (SSCA-net) for cloud detection in high-brightness scenes, which relies on no effort to manually collect labels. The main contributions of this article are summarized as follows.
1) We established SSCA-net with the spectral-spatialcontext aware module and spectral-spatial-context information aggregation module, aimed to improve the accuracy of cloud detection in high-brightness scenes. 2) We assess the accuracy of CI-driven SSCA-net in Landsat and Sentinel-2 images from global different regions and demonstrate that the proposed CI-driven SSCA-net achieved good performance in cloud detection, which can well deal with cloud heterogeneity and high brightness object confusion. The rest of this article is organized as follows. Section II reviews the deep learning-based cloud detection-related work. Section III introduces the details of the proposed CI-driven SSCA-net. Section IV analyzes the experimental results. Finally, Section V concludes this article.

II. RELATED WORK
We will review some cloud detection works related to the proposed SSCA-net in this section.

A. Deep Learning-Based on Cloud Detection Method
Nowadays, with the development of deep learning (such as CNN, generative adversarial networks (GAN), and recurrent neural network), deep learning has been successfully used in many remote sensing image processing tasks, such as cloud detection [16], land cover mapping [19], and semantic segmentation [20]. The deep learning-based algorithms have shown more optimized performance than traditional algorithms for cloud detection tasks in high-brightness land cover scenes [17], [18]. The advantage of deep learning over traditional algorithms is due to their ability to automatically extract cloud semantic features [7], [18].
Deep CNN networks are widely used in cloud detection tasks. To explore the performance of deep CNN networks in cloud detection tasks, Xie et al. [6] combined a deep CNN model and a simple linear iterative clustering (SLIC) method to extract cloud from remote sensing images, and achieved good cloud detection results. Mohajerani and Saeedi [21] used a fully convolutional network to detect clouds from Landsat-8 images at the object level. Luotamo et al. [22] designed a novel object segmentation-based method with two cascaded CNNs for cloud detection from Sentinel-2 multispectral images and achieved accurate cloud detection results. Liu et al. [23] exploited the superpixel cloud detection method with the CNN applicate for the cloud detection task. However, the performance of these algorithms relies on the accuracy of SLIC segmentation, which severely limits the application scenes in cloud detection work.
To overcome the above-mentioned limitations, many research works regard cloud detection as an end-to-end semantic segmentation task and design exclusive deep CNNs [24], [25], [26], [27]. Guo et al. [24] designed a cloud detection neural network with an encoder-decoder structure to detect clouds from ZY-3 satellite images. Shao et al. [25] proposed an encoder-decoder multiscale features-convolutional neural network for cloud detection in Landsat-8 satellite imagery. Li et al. [26] designed a multiscale convolutional feature fusion architecture to detect clouds from remote sensing images of different sensors and achieved an overall accuracy (OA) of 96.83%. Chai et al. [27] used deep CNN to extract a cloud of multilevel spatial and spectral features and provided further more accurate cloud detection results. Deep CNN networks can achieve accurate cloud detection results. However, these approaches often require labeled cloud samples for training CNN models that are very time-consuming and high cost to collect.
With the development of GAN technology, many unsupervised domain adaptation (UDA) methods based on GAN with no need to collect labeled samples have been successfully used for detecting clouds [28], [29], [30], [31]. Guo et al. [29] used UDA to reduce the labeling cost and can achieve accurate cloud detection results in ZY-3 multispectral images. Li et al. [30] designed a hybrid weakly supervised cloud detection method based on the synergistic combination of GAN and cloud distortion model to detect clouds from Landsat-8 images, which can achieve an OA of 90.20%. Li et al. [28] designed a weakly supervised deep learning framework for detecting clouds and achieved an OA of 96.66% in Gaofen-1 images. However, the UDA method is difficult to apply for cloud detection from multisource remote sensing images due to multisource satellite sensors having large domain discrepancies in spectral and resolution aspects. In addition, deep learning achieves a lower accuracy in cloud detection in high-brightness land cover scenes due to above-mentioned deep learning cannot simultaneously capture the spectral-spatial-context of clouds

B. Attention Mechanism
The attention mechanism is widely used in the computer vision area, especially for remote sensing image segmentation [31], [32], [33], [34]. Roy et al. [31] designed an attentionbased adaptive spectral-spatial kernel to capture discriminative spectral spatial features for hyperspectral image classification. Lu et al. [32] used an attention-based neural encoder-decoder framework for the captioning image. Fang et al. [33] proposed attention in attention networks to extract a discriminative pedestrian representation for person retrieval. Peng et al. [34] designed a dense-attention convolutional neural network for change detection in optical remote sensing images. Nevertheless, these methods only focus on the appearance aspect of the cloud and ignore the context information in optical remote sensing images. In addition, the above-mentioned attention networks cannot simultaneously capture the spectral-spatial context of clouds from input images. Therefore, we propose SSCA-net to capture discriminative spectral-spatial-context features for cloud detection.

A. SSCA-Net as Cloud Detection Model
The traditional attention mechanism networks have been used in remote sensing image segmentation studies focusing on spectral attention, spatial attention, and both [35]. The spectral-spatial-context information is a crucial discriminative feature for cloud detection from optical remote sensing images [36]. However, traditional attention mechanism networks simultaneously cannot capture the spectral-spatial-context of clouds. Therefore, we propose SSCA-nets as a cloud detection model to extract clouds from the input images. The architecture of SSCA-nets is shown in Fig. 1.
The SSCA-net inputs a cloud remote sensing image and outputs a cloud detection result to generate end-to-end detection. The proposed SSCA-net consisted of two parts: 1) spectral-spatial-context aware module; and 2) spectral-spatial-context information aggregation module. The spectral-spatial-context aware module is designed to extract spectral-spatial-context information of clouds in different scales. The spectral-spatial-context information aggregation module is designed to fuse spectral-spatial-context features from the spectral-spatial-context aware module, further improving the accuracy of cloud detection. The spectral-spatial-context aware module consisted of the following three branch attention networks: 1) spectral attention network; 2) spatial attention network; 3) context attention network. The spectral attention network is used to extract the spectral feature of the cloud from the input images. The spatial attention network is used to capture the spatial feature of the cloud from the input images.
The spatial attention network is used to extract the context feature of the cloud from the input images. It should be noted that the cloud detection model is an encoder-decoder framework, which can capture the discriminative of spectral-spatial-context features for cloud detection.

B. Cloud Index Driven Cloud Mask Generation
The truth cloud labels are necessary data for supervised CNN cloud detection [37]. Therefore, we show how to generate accurate cloud labels from CI, which serves as the truth cloud labels for training the SSCA-net. The generate cloud label process consisted of two steps. Dual-threshold is first used to generate initial cloud mask results with precise cloud and noncloud labels. Some cloud pixel labels are ignored in the initial cloud mask results, whose ignored cloud pixels could not be used for training SSCA-net. Hence, initial cloud mask results are then iteratively expanded to expansion cloud mask, where the ignored cloud pixels are assigned the labels of cloud, further generating a more complete cloud mask. The workflow of the proposed CI-driven SSCA-net algorithm is introduced in Fig. 2. 1) Initial Cloud Mask Generation: Clouds are usually white appearances compare to other land cover materials in optical remote sensing images [38]. In other words, the reflectance of cloud pixels increases with a relatively large in the visible, near-infrared, and short-wave infrared (SWIR) bands [10]. This means that the reflectance of cloud-contaminated regions is higher than other land covers in above-mentioned band ranges. Based on the cloud spectral characteristic, normalized CI (NCI) is defined to generate initial cloud mask results. According to cloud spectral features [12], for multispectral remote sensing images, the NCI index can be expressed as where B B and B SWIR represent the blue band and the SWIR band, respectively.
NCI index can generate accurate cloud masks depending on the adopted good threshold. In fact, when setting a higher threshold, the majority of cloud areas are accurately detected, but less adequate. However, when setting lower a threshold, the cloud areas in the mask can be more adequate, yet less accurate. Therefore, the dual-threshold method in this study is used to generate the initial cloud mask, which is composed of a high threshold (HT) and a low threshold (LT). The dual-threshold aims to avoid the influences from ambiguous pixels (such as high-brightness object pixels and cloud pixels) that are difficult to detect by CI. Initial cloud mask, we look forward to set two safe thresholds in the dual-threshold to ensure the good accuracy of cloud and noncloud. In this study, the initial cloud mask is generated from normalized CI by using the following equation: In this article, based on empirical threshold value analysis, we found that setting HT and LT as 0.6, 05, 0.4, and 0.3 for Sentinel-2 images and Landsat-8 images, respectively. The ignored pixels as negative noises are excluded for calculating losses in the training SSCA-net.
2) Postprocessing Initial Mask: The initial cloud mask is used for training the cloud detection model, which can output cloud detection results with high accuracy than the initial cloud mask itself because of the powerful self-learning of the SSCAnet. However, some ignored pixels without any labels cannot be used for training SSCA-net in the initial cloud masks. Hence, the iteratively expanded as a postprocessing step aimed to further enhance the completeness of the initial cloud mask.
In the iteratively expanding mask step, we set up an expanded threshold with the size of HT+LH 2 to generate the average cloud mask from CI. The average cloud masks are used to update the initial cloud mask by combining with the cloud mask from the SSCA-net (trained with the initial cloud mask) output. In fact, the cloud mask from the SSCA-net output is taken as iteratively expanding seeds. If there exists a pixel determined as a cloud in the seeds for each connected cloud domain in the average cloud mask, the whole connected cloud domain is taken as cloud and will be utilized to update the corresponding position pixels in the initial cloud mask. It should be noted that the ignored cloud pixels can only be updated into cloud pixels without consideration of noncloud pixels. The cloud mask with a complete cloud can prompt the SSCA-net to extract more cloud information and output more clouds, which will be more complete in the next iteration. In the update process, the ratio of cloud and noncloud pixel numbers is calculated for each expansion cloud mask and the iteration stops when the ratio of cloud and noncloud pixel numbers is lower than setting a threshold.

C. Final Training the SSCA-Net
The final expansion cloud mask is taken as the final cloud label for training the SSCA-net after the cloud mask expansion. At this moment, the final expansion cloud masks have high accuracy and completeness in terms of cloud labels. To obtain reliable cloud detection results, we used the mean squared error as the cloud detection loss function. The cloud detection loss function is expressed by where X j are predicted cloud detection results, Y j are the truth cloud labels, and N is the number of cloud labels. It is noted cloud detection loss that is only used for the final training of SSCA-net and is not adopted in the iteratively expanding mask step. The reason is that the mean squared error can bring in the loss of some commission errors. In general, the proposed method consisted of two steps: 1) the final expansion cloud mask as ground-truth is generated as introduced in the CI-driven cloud mask generation step; and 2) the final SSCA-net is trained by using the final expansion cloud mask and mean squared error loss.

A. Study Data and Accuracy Assessment
In this study, we use two typical optical remote sensing datasets namely Sentinel-2 and Landsat-8 to evaluate of performance the proposed SSCA-net. The proposed SSCA-net was tested from Sentinel-2 data and Landsat-8 data where only the blue band and the SWIR band. All Sentinel-2 and Landsat-8 experiment images are clipped into image patches with the size  Fig. 3.
2) Landsat-8 Dataset: Landsat-8 image datasets are developed by using Landsat-8 L1T. This dataset contains 67 scene training and 13 scene validation. We choose the reflectance of four bands of Landsat-8 (including bands 23 and 46) as SSCAnet input images. All Landsat-8 data can be freely downloaded from the United States Geological Survey.
Truth clouds are obtained by manually marking at the ENVI 5.3 software platform. The OA, kappa coefficient (Kappa), and IoU metrics can reflect the classification accuracy of cloud pixels. However, three accuracy metrics (including OA, Kappa, and IoU) metrics cannot reflect the boundary accuracy of the cloud. The edge OA indicates the degree of boundary accuracy of the cloud [39] Therefore, OA, Kappa, IoU, and EOA are adopted to evaluate cloud mask performance on study data.

B. Performance Evaluation of SSCA-Net N Sentinel-2
To validate the cloud detection performance of the SSCAnet on Sentinel-2 images, we have compared the proposed SSCA-net with Fmask [10], CNN [22], and M-ResNet [17]. In addition, four high-brightness scene images are used to evaluate the SSCA-net suppressing high-brightness noise performance.
Based on the visual evaluation, a performance comparison of the different high-brightness scenes is depicted in Fig. 4. The cloud detection results show that Fmask, CNN, and M-ResNet mix some of the high-brightness objects (such as rock and white building) with the cloud. The reason for this result can be summarized as these methods difficult capture discriminative spectral-spatial-context features from input images. In the five rows of Fig. 4, it is found that the proposed SSCA-net has the ability to accurately detect clouds, which relies on no effort to manually collect label samples. Table I shows the accuracy assessment of methods different on high-brightness surface scenes. From Table I, it is found that the SSCA-net significantly outperformed traditional Fmask at discriminating clouds from high-brightness objects (such as rock

C. Performance Evaluation of SSCA-Net on Landsat-8
To verify the effect of the different ground covers for cloud detection, two entire Landsat-8 images with different land cover types are adopted. Two entire Landsat-8 images contain complex ground cover conditions (such as white building, water, vegetation, and bare land), as shown in the first row of Fig. 5. To make the cloud detection comparative analysis, we have compared the proposed SSCA-net with Fmask, CNN, and M-ResNet. Fig. 5 shows the cloud detection results by using the different methods for Landsat-8 data. The cloud detection result of the Fmask method fails to solve the overestimation phenomenon (some white buildings are misdetected as clouds) on the entire Landsat-8 images. It is found that CNN and M-ResNet confuse some white buildings with clouds. The white buildings are misclassified as clouds in a complex urban environment since these methods cannot capture cloud discriminative features from Landsat-8 images. In the five rows of Fig. 5, it is seen that the proposed SSCA-net can provide satisfactory cloud detection results and solve the overestimation problem for complex urban scenes. The reason for this result can be attributed to the spectral-spatial-context aware module of SSCA-net can capture fully cloud discriminative features.

D. Effect of Threshold on the Performance of SSCA-Net
As a CI-driven SSCA-net, it is important to know how the threshold of cloud mask effect SSCA-net performance. Therefore, the benefits of dual-threshold-based training SSCA-net compared to single-threshold (HT and LT) based SSCA-net are introduced. It is worth noting that the ignored pixels are not used in loss computation when using the initial cloud mask from the dual-threshold for training SSCA-net. Fig. 6 shows cloud detection results of SSCA-net by using different threshold training. From Fig. 6(c), HT-based SSCA-net failure to detect thin clouds led to the underestimation phenomenon. The reason for this result can be generalized as thin cloud labels cannot be generated in cloud mask generation when setting a high single threshold. Although LT-based SSCA-net can provide a complete cloud detection result, it faces an overestimation issue (some white building noise are misclassified as clouds). This is because cloud labels may include high brightness noise in the initial cloud mask generation  step from a low single threshold. However, dual-threshold-based SSCA-net can provide more accurate cloud detection results. Table III provides quantitative assessment results of the different threshold training.
From Table III, dual-threshold-based SSCA-net can achieve higher values in four metrics compared to the single threshold (HT and LT) training with the same threshold. The reason for this result can be summarized as cloud labels are more accurate in the initial cloud mask generation step from the dual-threshold. This is because the cloud mask generation from setting a single threshold may include background noise for either cloud or noncloud and disturb the SSCA-net training. However, the high brightness noise pixels are excluded in the cloud mask generation step from the dual-threshold for training SSCA-net in case disadvantageous information is induced to the SSCA-net.    III  ACCURACY ASSESSMENT OF HIGH THRESHOLD-BASED SSCA-NET, LOW THRESHOLD-BASED SSCA-NET, AND DUAL-THRESHOLD-BASED SSCA-NET   TABLE IV ACCURACY ASSESSMENT OF THE PROPOSED SSCA-NET COMPONENTS

E. Effect of SSCA-Net Components on Cloud Detection
In the spectral-spatial-context aware module, since the cloud discriminative features from the three attention network branches are used, it is necessary to analyze how much these features of different network branches contribute to the cloud detection task. To testify the validity of the SSCA-net components, we have compared SSCA-net with SSCA-net without spatial attention, SSCA-net without spectral attention, and SSCA-net without context attention. Fig. 7 shows the cloud detection results of the proposed SSCA-net components. From Fig. 7(c)-(e), it is found that SSCA-net without context attention seriously confuses white buildings with clouds. This is because SSCA-net without context attention cannot extract the context feature. The results show that the importance of the context attention branch is larger than that of the spatial attention and spectral attention branch in cloud detection tasks. In contrast, it can be seen that SSCA-net generates more accurate cloud detection results from Fig. 7(b). Table IV provides quantitative assessment results of the proposed SSCA-net components. The proposed SSCA-net yields an OA of 97.49%, Kappa of 92.81%, IoU of 0.9541, and EOA of 95.01%. Furthermore, the OA of proposed SSCA-net the is better than that of SSCA-net without spatial attention, SSCA-net without spectral attention, and SSCA-net without context attention by 5.37%, 8.04%, and 11.4%, respectively. Based on the visual and quantitation evaluation, we further find that in most of the feature groups, the importance of the context features is larger than that of the spatial and spectral features in cloud detection tasks.
The experimental results show the proposed method has two advantages: 1) The proposed method relies on no effort to manually collect label samples; and 2) SSCA-net can capture the spectral-spatial-context of clouds, which further provides an accuracy of cloud detection result in high brightness scenes.

V. CONCLUSION
In this article, we propose a CI-driven SSCA-net for cloud detection, which relies on no effort to manually collect label samples. This method can be integrated with the advantages of deep attention networks and CI. As a powerful deep learning, SSCA-net requires a large number of cloud label samples for the training model, but these label samples require high acquisition costs. We solve this issue by subtly generating cloud label samples by using the normalized CI. We established SSCA-net with the spectral-spatial-context aware module and spectralspatial-context information aggregation module, aimed to capture the spectral-spatial-context of clouds, to further improve the accuracy of cloud detection in high-brightness scenes. The experimental results show that the SSCA-net method can simultaneously eliminate high-brightness noise and provide more accurate cloud detection results on Sentinel-2 and Landsat-8 datasets.
The proposed CI-driven SSCA-net provides a reliable cloud detection result on high-brightness scenes. However, the proposed CI-driven SSCA-net is no longer practical for the lack of a SWIR band of optical satellite images. In future works, more features, such as cloud geometric features, will be used in cloud mask generation to overcome the disadvantages of normalized CI. His research interests include GIS and intelligent remote sensing information processing.