AOSVSSNet: Attention-Guided Optical Satellite Video Smoke Segmentation Network

Smoke is more observable than open fires. Optical satellite video has the advantages of a wide monitoring range, fast response speed, and good economy in large-scale surface smoke monitoring tasks. It can be used in wide-area forest wildfire monitoring, battlefield dynamic monitoring, disaster relief decision-making. The smoke segmentation method based on traditional handcrafted features is easily limited by the scene and data. This article introduces the deep learning method to the optical satellite video smoke target segmentation. However, due to the lack of real smoke images and the blurred edges of smoke, there are currently few labeled datasets for smoke segmentation in high-resolution optical satellite imagery scenes, which cannot provide sufficient training data for deep learning models. The smoke image from the satellite perspective also has the characteristics of multiscale features and ground object background interference. To solve the abovementioned problems, we construct a set of high-resolution optical satellite imagery smoke synthesis datasets based on the optical imaging process of smoke targets, which saves the cost of manual labeling. In addition, we design an attention-guided optical satellite video smoke segmentation network model, which can effectively suppress the ground object background's false alarm and extract the smoke's multiscale features. Synthetic data faces the transferability problem in real-world applications, so the physical constraints of the smoke imaging process are introduced into the loss function to improve the generalization of the model in real smoke data. The comprehensive evaluation results show that the method outperforms representative semantic segmentation networks.


I. INTRODUCTION
S MOKE is more observable than open fires. Compared with traditional inductive detectors that need to be close to the fire source for physical and chemical composition analysis, the smoke sensing technology based on video image processing can respond faster to fire alarms, and the noncontact method can effectively eliminate the loss of the sensor [1]. Compared with the existing smoke coarse localization based on image classification and target detection, smoke segmentation effectively integrates location information and attribute information and obtains accurate pixel-by-pixel information, which helps rescuers effectively identify the source of fire and reduce the possibility of fire alarm delays. In addition, it can dynamically monitor the trend of smoke morphology, effectively reflect the current environmental conditions of the fire scene from the side, and provide instructive data support for predicting the spread trend and speed of the fire, which has significant research value and practical significance.
In recent years, with the continuous innovation of sensor technology and the improvement of the quality of spatial data acquisition, the emergence of optical video satellites that are capable of high frame rate (frame rate ≥ 24 FPS) imaging in the same area with dynamic observation capabilities, has made the research of smoke segmentation method for optical satellite video data more prominent than ground surveillance video, which has the advantages of wider monitoring range, faster response speed, and better economy. It shows great potential in monitoring fire smoke in vast surface spaces such as forests, volcanoes, and large oil tank farms.
However, smoke segmentation is a highly challenging computer vision task because smoke has more substantial intraclass variability than other segmentation targets. It is affected by lighting and shooting perspective and shows different colors, poses, and shapes at different times and under different physical and chemical conditions [2], as shown in Fig. 1(a). Moreover, compared with ground surveillance video, the scene from the satellite perspective has complex and similar ground object backgrounds and multiscale smoke targets, as shown in Fig. 1(b). Therefore, although the traditional artificially designed feature expression method can achieve accurate extraction of smoke to a certain extent, most of these design schemes are relatively complex, and the selection and combination of features lack unified principles and specifications. It is more susceptible to scene and data constraints in smoke target extraction [3]. As an excellent data-driven modeling tool, deep learning can automatically learn the excellent and essential features that conform to the distribution of the current task image dataset and significantly reduce the labor cost of feature modeling, which has attracted the attention of researchers, such as CNN [4], [5], [6], [7], GCN [8], and Transformer [9]. The performance of deep learning models largely depends on large-scale, high-quality labeled training datasets. The slow development of smoke segmentation datasets for high-resolution optical satellite images restricts related research progress. The main reasons include the following two points: 1) Since fire and smoke are accidental emergencies in daily life, there are relatively few fire and smoke scenes on the ground that can be captured by satellites, resulting in a small scale of real datasets for research; 2) Because the smoke target has the characteristics of irregular shape and blurred edge, it is extremely difficult and inaccurate to manually label the boundary of the smoke target pixel by pixel.
This article constructs a set of high-resolution optical satellite image smoke target synthesis datasets based on the optical imaging principle of smoke targets to solve the problem of the lack of reliable training datasets and the difficulty of labeling in the optical satellite video smoke segmentation task, which significantly saves the cost of manual labeling. Compared with typical smoke images, the smoke targets in satellite video have stronger visual saliency than other ground objects. For the problem that the multiscale segmentation results of optical satellite video smoke are easily disturbed by background objects, an attention-guided optical satellite video smoke segmentation network model called attention-guided optical satellite video smoke segmentation network model (AOSVSSNet) is proposed in this article to improve the segmentation accuracy. For the transferability of the synthetic dataset training model on the real test dataset, this article introduces the physical constraints of the smoke imaging process into the loss function of the segmentation network, which has good generalization.
In summary, this article has the following three main contributions.
1) This article constructs a set of high-resolution optical satellite image smog target synthesis datasets. As far as we know, this is the first high-resolution optical satellite smoke dataset, which effectively solves the issues of lack of training samples for smoke segmentation and labeling difficulty. 2) This article proposes a convolutional neural network model for smoke segmentation, which to our knowledge is the first deep learning model for smoke segmentation in optical satellite video. It can achieve end-to-end training and prediction, and effectively suppress background interference while extracting smoke pixels. 3) In this article, the physical constraints of the smoke imaging process are introduced into the loss function of the segmentation network, which improves the generalization of the synthetic dataset training model on the real test dataset. The rest of this article is organized as follows. Section II presents related work on semantic segmentation of smoke images. Section III details the optical satellite imagery smoke training data synthesis method and smoke segmentation model for optical satellite video. Section IV presents and analyzes the experimental results of the proposed method. Finally, Section V concludes this article.

II. RELATED LITERATURE
Currently, there is no public report on the research of smoke segmentation for optical satellite video. Therefore, the existing algorithm can be adapted by referring to the research on smoke segmentation based on natural images. The mainstream smoke segmentation methods can be divided into traditional manual features and deep learning methods.

A. Traditional Smoke Segmentation Methods
Smoke targets have rich image information, including static features such as color, texture, and shape, and dynamic characteristics such as diffusion, displacement, and flickering [2]. Therefore, traditional smoke segmentation methods focus on using various smoke image information to form a feature representation method with sufficient recognition. Among them, the focus of research is color, texture, frequency, and motion features.
In terms of color features, the characteristics of smoke in the red-green-blue (RGB) color model are mainly manifested in that the gray values of the R, G, and B channels are relatively similar, roughly distributed in the range of 80-220 [10], [11]. The salience of the hue-saturation-value (HSV) and hue-saturationintensity (HSI) color models is mainly focused on the saturation component [12], [13], [14]. In terms of texture features, gray level co-occurrence matrix [15], local binary pattern (LBP) [16], and Pyramid LBP [17], [18] are the more commonly used methods. Additionally, dynamic textures have the potential to characterize temporal invariance and have also been applied to describe smoke [19]. In terms of frequency features, a single frequency feature can achieve a good recognition effect [20], [21], as different frequency information in the frequency domain corresponds to the image information in the spatial domain. The high, medium, and low-frequency information reflect the image's edge details, structure, and main components. Among them, wavelet transform is the most commonly used frequency feature extraction method [22], [23], [24]. By fusing the target features in the spatial domain and the frequency domain, and using ensemble classifier learning, the translation, and rotation invariant features of the target can be expressed, thereby improving the detection accuracy [25], [26], [27], [28], [29]. In terms of motion features, the drift, diffusion, and other motion characteristics of smoke are the focus of research [30], [31]. The feature extraction mainly adopts statistical features, including optical flow estimation method, area, and centroid change statistics of suspected smoke areas, and movement direction change statistics [11], [22], [32]. Smoke segmentation can also be seen as the process of background and moving foreground segmentation, extracted by methods such as anomaly detection [33], linear unmixing [34], object tracking [35], or modal translation [36].
In general, although these traditional methods can achieve accurate smoke extraction to a certain extent, they usually require manual feature design and classifier selection, which requires designers to have solid empirical knowledge in specific fields such as the extraction method and combination of features, the setting of hyperparameters, resulting in high cost. In addition, the migration of artificially designed features is poor, and the testing effect is generally good only on the current task dataset. However, it is difficult to adapt to the smoke targets with different data quality and scene changes, resulting in unstable or poor segmentation accuracy.

B. Deep-Learning-Based Smoke Segmentation Methods
The emerging deep learning algorithm avoids the complex feature design process to the greatest extent. By designing a reasonable neural network structure, people can enable the model to automatically and efficiently learn excellent features adapted to the current task with less manual intervention, and bring significant improvements to visual smoke monitoring tasks of various granularities [37], [38], [39], [40]. The performance of smoke semantic segmentation network models largely relies on large-scale pixel-by-pixel labeled data. At present, the opensource smoke datasets of natural images mainly include the laboratory dataset of Bilken University, Turkey [41], the laboratory dataset of Keimyung University, South Korea [42], the Chino flame smoke image dataset BoWFire [43], the dataset of the State Key Laboratory of Fire Science, University of Science and Technology of China [44], and Jiangxi University of Finance and Economics Yuan Feiniu Laboratory datasets [45]. Among these, only the last two datasets have pixel-by-pixel annotations of smoke. The rest of the labeled datasets are used for classification or detection, with the scene mainly based on the ground perspective. Remote sensing images have a wide observational perspective and rich and diverse data sources, including optical, SAR, hyperspectral, and video. Data obtained from different platforms can provide diverse and complementary information [46], [47]. The smoke datasets for optical satellite images mostly come from low-resolution multispectral images such as MODIS [48], Himawari-8 [49], LandSat-8 [50], and GOES-16 [51]. The lack of large-scale, open-source, high-resolution labeled datasets for segmentation restricts the development of smoke segmentation network models for high-resolution optical satellite imagery.
In addition, compared to image classification and object detection tasks, fine-grained semantic segmentation tasks rely more on contextual feature information to obtain higher segmentation accuracy. At present, the main ideas of semantic segmentation networks include fully convolutional neural networks (such as FCN [52]), encoder-decoder structures (such as U-Net [53], SegNet [54], PSPNet [55]), and dilated convolutional networks (such as DeepLab series algorithms [56], [57], [58], [59]). Existing smoke segmentation methods are also mainly based on it.
Regarding how the algorithm utilizes the input data stream, video smoke segmentation can be divided into single-frame image smoke segmentation that only uses static appearance features and video smoke segmentation methods that fuse dynamic spatiotemporal features.
In terms of single-frame image smoke segmentation research, Xu et al. [60] proposed an end-to-end framework for smoke saliency detection, which consists of a region proposal network and an autoencoder structure to achieve smoke framelevel recognition and pixel-level fine segmentation. Yuan et al. [45] proposed an end-to-end segmentation network that fuses dual-branch features for blurred, semitransparent, and nonrigid boundaries of smoke targets, which outputs a soft segmentation probability map with 0-1 continuous values and gains pixelby-pixel density estimation. Yuan et al. [61] believed that the full fusion of information between the high and low layers of the codec could improve the segmentation accuracy of fuzzy objects such as smoke and clouds and proposed a deep neural network with a wave structure using a synthetic smoke dataset for training to achieve smoke density estimation. Yuan et al. [62] proposed a classification-assisted gated regression semantic segmentation network for the problem of interclass similarity of smoke and small smoke segmentation, which can learn longdistance feature relationships and contextual information and improve the accuracy of smoke segmentation. It is not difficult to see from the abovementioned methods that the natural image smoke segmentation network basically innovates and transforms around the goal of how to enhance the contextual features. These strategies include dual-branch feature fusion, high-level and low-level feature fusion, and visual attention mechanisms to improve the accuracy of smoke segmentation, which are worthy of reference and study.
Currently, there are relatively few deep learning smoke segmentation methods for the overall processing of video form. Li et al. [63] applied a 3-D fully convolutional neural network to the video wildfire smoke segmentation task for the first time and reduced the false detection rate of smoke segmentation by fusing the information between high and low layers and expanding the receptive field. The unsupervised video target segmentation network that has emerged in recent years has also attracted attention. It mainly realizes the classification of the target in the initial frame and the tracking in the subsequent frame from the pixel level according to some salient features of the target to be segmented, such as motion features, which shows potential in video smoke object segmentation that is difficult to manually annotate. Two-stream networks fusing motion and appearance features and recurrent neural networks are two important ideas to achieve unsupervised video object segmentation. The representative methods include MP-Net [64], LVO [65], FSEG [66], PDB [67], CosNet [68], and AGNN [69]. Although these methods perform better in the segmentation of rigid objects with translational motion, the motion pattern of smoke generally presents a diffuse motion from the source point to the surrounding; i.e., the edge pixels move while the interior pixels remain stationary. Therefore, the influence of the video frame sequence is mainly on the edge of the smoke. Although the model that introduces motion information will further refine the edge of the smoke or enhance the feature expression of little smoke, it also introduces more motion noise. The repeated texture inside the smoke makes the description of the motion optical flow feature unreliable, resulting in poor smoke segmentation results or missed detections.
Satellite video processing methods can be divided into multiframe processing methods using timing information and frameby-frame processing methods. Considering that video annotation is expensive, to extract the main area of the smoke target as much as possible, this article adopts the idea of frame-by-frame processing of the deframed video, takes the improved version UNet++ of the classic semantic segmentation network UNet as the basic framework, and realizes high and low-level features through a dense skip connection structure. The complete integration of the convolutional attention module guides the model to pay more attention to the smoke target and suppress the background of irrelevant objects to achieve accurate segmentation of the smoke area based on optical satellite video.

A. Smoke Segmentation Synthetic Dataset Construction
Currently, optical satellite image smoke datasets mainly focus on low-resolution scenes and coarse-grained detection and recognition tasks, lacking large-scale high-resolution opensource segmentation datasets. To solve the problem of the scarcity of training data for the deep learning model of smoke segmentation, we uses the existing open-source datasets for natural image smoke segmentation based on the optical smoke imaging principle to construct a rich and diverse optical satellite image smoke target segmentation synthetic dataset and validates the generalization performance on synthetic datasets through real data.  sources or reflected light, the light is continuously weakened during the propagation process and finally imaged in the camera under observation [70]. Fig. 2 shows the optical imaging process of smoke targets [61].
The optical imaging process of a smoke target from a 3-D space to a 2-D plane means each pixel value i(x) can be simplified as a weighted sum of pure background pixel values and pure smoke pixel values in mathematical description In (1), b(x) represents the background color, s(x) represents the smoke color, and α(x) represents the transparency coefficient or alpha channel of the smoke. Since this equation is essentially a linear color synthesis equation in the mathematical form [71], [72], this article regards α(x) as the optical density of smoke, which helps us to synthesize smoke images by quantitative methods later, and incorporate physical constraints into the model to improve segmentation accuracy.
2) Synthesis Method of the Smoke Target Image: The smoke optical density α(x) is a value ranging from 0 to 255, and it is neither possible nor accurate to calibrate the transparency of each pixel manually. Existing studies have used computer graphics methods to simulate and visualize smoke based on the principle of fluid dynamics. The most representative one is a set of open-source smoke datasets constructed by the team of Prof. Y. Feiniu from Jiangxi University of Finance and Economics using the open-source 3-D modeling software Blender [45]. The research team has generated a large amount of synthetic smoke data, including background, smoke, and transparency maps, by setting physical parameters such as wind, motion, and gravity. These smokes had different shapes, densities, lighting, and backgrounds, which have a realistic vision of real smoke. It significantly has saved the cost of manually collecting real smoke images and provides a sufficient database for deep learning model training.
The smoke targets in the ground cameras and remote sensing images have similar diffusion motion patterns, but the scales of the smoke targets in the remote sensing images are more different, and there are complex ground object backgrounds. Therefore, on the basis of the abovementioned open-source (2) Among them, γ 1 , γ 2 , γ 3 are random numbers in the range of [01].
3) Next, take the data enhancement operation of horizontal and vertical flipping on the composite smoke image I(x), which can reduce the overfitting of the model to a particular feature and improve its robustness and generalization. 4) Finally, set the threshold T h for the transparency map α(x) corresponding to the smoke synthesis image I(x), and generate its corresponding binary mask image according to the following equation as the ground-truth map of the semantic segmentation task. T h is set to 128 in this article. Fig. 3 shows the synthetic dataset

B. AOSVSSNet Network Structure
Existing semantic segmentation networks are based on U-Net. U-Net includes four-time down-sampling and up-sampling encoders and decoders and a long-skip connection structure, which realizes the splicing of high-level semantic and low-level geometric features and improves segmentation accuracy. However, some questions can still be explored in the design of the U-Net network, including the degree of influence of sampling times on feature extraction and the actual performance of long-connection structures in bridging the semantic gap. In response to these problems, Zhou et al. [73] extended the U-Net network and proposed an encoder-decoder structure UNet++ composed of nested dense short-skip connection layers by stacking U-Net networks of different levels, which helps to reduce the semantic gap between the feature map and the decoded feature map. It has a strong ability to capture image feature details, adapt to the high-resolution remote sensing images with rich details, multiscale features, and complex structure of ground objects characteristics, and has better segmentation performance.
Therefore, an attention-guided optical satellite video smoke segmentation network with the pruned version of UNet++ as the basic structure was designed.
1) CBAM was introduced between the original encoder layers to adaptively select and enhance features, so that the network could focus more on the smoke target content and global location information, suppress other irrelevant ground objects and noise information, and improve the accuracy of smoke segmentation. 2) Select the lightweight network MobileNetV2 as the convolution unit of the network to reduce the number of parameters required for training. 3) According to the smoke optical imaging process, a complex loss function with multiple constraints was introduced into the model, which could achieve fine segmentation of smoke targets based on the optical concentration estimation results, and improve the generalization performance of the model tested on real data. Correspondingly, according to the loss function, the number of channels at the input and output of the network was adjusted. Details are shown in Fig. 4.
1) UNet++ Network Structure: The network structure of UNet++ is shown in Fig. 5, which mainly includes five parts: input interface, encoder, decoder, skip connection, and deep supervision.
The encoder part consists of five down sampling layers X 00 , X 10 , X 20 , X 30 , X 40 . Each downsampling layer is implemented by a VGG block and a pooling layer, and each VGG block is concatenated with two convolutional layers with a kernel size of 3 × 3 pixels and a sliding stride of 1 pixel. The number of VGG block convolution kernels in each layer is 64, 128, 256, 512, and 512, respectively. The implementation of the downsampling layer can choose other convolutional neural network structures according to actual needs, and the number of convolution kernels can also be adjusted as needed.
The decoder part mainly includes four branches. These branches upsample the feature maps extracted by X 10 , X 20 , X 30 , X 40 , and fuse the shallow features of the same layer, and iteratively process from top to bottom to obtain the output graph of four branches. Similar to the encoder, the specific implementation of each layer unit of the decoder can also be designed as needed. The calculation result of each unit of the codec part can be expressed by the following equation: In (4), H(·) represents the convolution computation, P (·) represents the max-pooling computation with a size of 2 × 2 for downsampling, U (·) represents the deconvolution computation for upsampling, and [·] represents feature connections in the channel dimension.
The blue solid line path is the deep supervision layer, which can combine the output results of each branch of the decoder to obtain the final segmentation result. Combination the four branches of the decoder between the corresponding level of the encoder can be regarded as four subnetworks of different levels. A separate model corresponding to the four versions can be formed through pruning: UNet++L1, UNet++L2, UNet++L3, and UNet++L4, as shown in Fig. 6. Compared with training four subnetworks separately and selecting the model, UNet++ adopts the strategy of training the overall model and then pruning, which has stronger operability and is less time-consuming. When the scale of the subnetwork reaches a specific target prediction accuracy, the model with the smallest memory footprint or calculation amount can be obtained by the approach, which reflects the flexibility and efficiency of the model.
2) Convolutional Attention Module CBAM: While providing sufficient, discriminative, multiscale deep features for image classification or regression tasks, convolutional neural networks also introduce more redundant and noisy information, increasing computational cost and affecting segmentation performance. Feature optimization can select the most useful features for the segmentation task from the original feature set. Inspired by human vision research, the convolutional attention mechanism, an excellent deep feature selection method, can learn the weight distribution of output feature maps, highlight the target content and location information, and ignore other irrelevant information. Currently, the convolutional attention mechanism is mainly divided into three categories: spatial attention mechanism, channel attention mechanism, and hybrid attention mechanism. Among them, the hybrid attention mechanism considers spatial and channel similarity. The main methods include CBAM [74], DANet [75].
Optical satellite video is a high-resolution remote sensing time series image. The complex background of ground objects is the primary interference information for the task of smoke segmentation. At the same time, the smoke mainly moves upward and is less constrained by the structure of ground objects. Therefore, a lightweight, efficient, and plug-and-play CBAM module was integrated into the UNet++ model to adjust the feature weights in the spatial and channel directions, improve the semantic expression ability of the network for the smoke target, and realize end-to-end training.
The working principle of CBAM can be shown in Fig. 7. Assuming that the size of the input feature map F is H × W and the number of channels is C, then CBAM first uses the channel attention module to calculate the feature map F to obtain a 1-D channel attention weight distribution A C (size is 1 × 1 × C) and then calculate the dot product of the feature map F and A C to obtain the channel-oriented salient feature map F C , and the calculation process is expressed by (5). Then, use the spatial attention module to calculate F C to obtain a 2-D spatial attention weight distribution A S (the size is H × W × 1). Finally, the dot product of the feature map F C and A S is calculated to obtain the spatially significant feature map F M , and the calculation process is expressed by (6) as follows: represents the dot product operation in the equation.

3) Lightweight Convolutional Neural Network Mo-bileNetV2:
Compared with UNet, UNet++ has a stronger multiscale semantic feature expression ability. However, it also has more convolution computing units, which will reduce the processing speed of optical satellite video data. It is challenging to meet the future onboard processing need with limited computing and memory resources. In the context of the needs of embedded mobile devices and real-time processing, lightweight convolutional neural networks have emerged, and MobileNet series algorithms are an excellent representative of them.
MobileNetV2 [76] is a lightweight network proposed by Google in 2018. It inherits the depthwise separable convolution adopted by MobileNetV1 [77] and adds a new structure called bottleneck residual module, mainly composed of two substructures of reverse residual and linear bottleneck. Compared with the accuracy of the ordinary convolution layer, it significantly reduces the number of model parameters and computing resource consumption and has better comprehensive performance. The layer structure of the original MobileNetV2 model is shown in Table I. a) Depthwise Separable Convolution: The most prominent feature of depthwise separable convolution is that it can significantly reduce the number of parameters required for convolution calculation without affecting the performance of the model. The depthwise separable convolution splits the traditional convolution calculation into two stages: depthwise convolution and point convolution. The computation ratio of depthwise separable convolution and traditional standard convolution can be expressed as In (7), K represents the size of the convolution kernel, H, W , and M represent the height, width, and the number of channels of the input feature map, respectively, and N represents the number of channels of the output feature map. In reality, since the number of channels N of the output feature map is often large, when using a convolution kernel with a size of 3×3 to calculate the output feature map of 8 channels, the calculation amount of the depthwise separable convolution is reduced by nearly 80%, compared with traditional standard convolution.
b) Inverse Residual Structure: The emergence of the residual structure has well compensated for the difficulty of training caused by the depth of the neural network and brought a significant improvement to the performance of the model. Therefore, MobileNetV2 also draws on this design to form a new structure called the inverse residual structure, as shown in Fig. 8. Considering that the channel compression of the input feature map will reduce the accuracy of the model, its calculation process is designed into three steps: feature enhancement, feature extraction, and feature dimensionality reduction. In addition, to prevent network performance degradation, the structure also replaces the ReLU6 function used for feature mapping from high-dimensional to low-dimensional with a linear function, forming the final linear bottleneck structure. In conjunction with the depthwise separable convolution mentioned previously, it can effectively reduce the computational cost of the reverse residual structure in the high-dimensional feature extraction process and achieve the unification of model performance and efficiency. Therefore, this paper used MobileNetV2 as the basic unit of convolution calculation of the network model to improve the efficiency of the algorithm.

4) Loss Function:
In this article, the input interface of the segmentation network was set to a three-channel RGB smoke synthesized image, and the output interface was set to a sevenchannel feature map. The first, second, and third channels of the output were used to predict the three-channel pure background pixel values of RGB. The output's fourth, fifth, and sixth channels were used to predict the three-channel pure smoke pixel values of RGB, and the output's seventh channel was used to predict the values of the RGB synthesized smoke images. Correspondingly, the segmentation network adopted a complex loss function [61] containing four error terms with physical constraints, which was defined as the following equation: Among them, L α , L s , L b , and L c represented the mean square error of the four predicted values including the smoke density, RGB pure smoke pixel value, RGB pure background pixel value, and the RGB synthesized smoke image pixel value, as shown in (9)- (12). w α , w s , w b , and w c , respectively, represented the weight coefficients of the four error terms in the final error, all of which were taken as 0.25 in this article In (9)-(11), α, s, and b represented the predicted values of the smoke density, RGB pure smoke pixels and RGB pure background pixels, respectively; α gt , s gt , and b gt represented the corresponding ground truth; In (12), i and c denoted the ground truth and predicted values of the RGB synthesized smoke images, respectively. The setting of these four error terms can constrain the predicted value of each component in a mixed pixel of the smoke image, thereby improving the accuracy of smoke concentration prediction.

A. Experimental Setups a) Synthetic Dataset and Real Dataset:
The experimental data included two parts: synthetic dataset and real dataset. The synthetic dataset was used for training and testing the network model, and the real dataset was used to test the network model trained based on the synthetic data to verify the transferability of the smoke synthetic dataset.
The synthetic dataset was made according to the method described in Section III. The experiments in this chapter construct a smoke synthesis dataset containing 10000 synthetic images and labels with a size of 256×256. Its background types include airports, highways, forests, built-up areas, and water bodies from the perspective of satellite remote sensing. The synthetic smoke targets had different colors and shapes. In total, 80% of the samples were randomly selected as the training set in the experiments in this chapter, and the remaining 20% were equally divided into the validation set and the test set. The samples are placed according to the file structure of the Pascal VOC public dataset.
The real dataset adopted a set of real-shot optical satellite videos (a total of 200 frames of images) and the corresponding artificially labeled data to test the effectiveness of the synthetic data and the methods in this chapter, as shown in Table II and Fig. 9.
b) Evaluation Metric: Moving object segmentation algorithms can evaluate their performance in terms of both accuracy and efficiency.
In terms of accuracy, the most commonly used evaluation index for image segmentation is intersection over union (IoU), which represents the ratio of the area of the intersection between the prediction and the label area to the total area covered by the two, indicating the accuracy of the algorithm prediction. P T represents the set of pixel locations within the prediction area, and GT represents the set of pixel locations within the true label area. The IoU can be represented by the following equation: In the case of multiclassification, the abovementioned equation can be extended to mean IoU (mIoU): In (14), c is the number of categories, and IoU i is the intersection ratio of the ith category. The task of smoke segmentation is a binary classification problem, so this article took the average of IoU of smoke pixels and background pixels as the processing accuracy of a single-frame image, and calculated the average of mIoU of multiple frames to obtain the processing accuracy of our method.
In terms of efficiency, this chapter adopted the number of predicted frames per second (FPS) to evaluate the segmentation speed of our method. c) Environment: The experimental environment was the Ubuntu 20.04 operating system. The PyTorch environment was configured, the Visual Studio Code editor was used, and the NVIDIA GeForce RTX 3090 graphics card was used to complete the algorithm implementation, training, and prediction. We adopted the end-to-end training method and the Adam parameter optimization algorithm. The hyperparameters were set as follows: the initial learning rate was set to 1e-4, the learning rate decay coefficient was 0.98, the number of batches was set to 8, the training epoch was set to 50 times, the momentum was set to 0.9, and the weight decay was 1e-4.

B. Ablation Experiments
To verify the effectiveness of each module in the method in this chapter, this section took UNet++ as the basic framework to design a series of ablation experiments, as shown in Table III. The first line is the UNet++ network. The second line embeds the attention mechanism module CBAM into the UNet++ encoder. The third line replaces all VGGNet used for feature extraction in UNet++ with MobileNetV2. The fourth line replaces the binary cross-entropy loss function in UNet++ is the composite loss function described in Section III. The fifth line is AOSVSSNet and the sixth line prunes it.
As shown in Fig. 10, model 2 added the convolutional attention module CBAM to UNet++, which could incorporate global context information in training, eliminate the interference of irrelevant ground object background information, enhance the distinguishability of smoke areas. Compared with model 1, namely UNet++, its segmentation accuracy was improved by 0.56%. CBAM only introduced a small number of parameters, so that the segmentation efficiency of model 2 was slightly lower than that of model 1, taking into account both accuracy and efficiency.
Model 3 used the lightweight module MobileNetV2 in the feature extraction. The segmentation accuracy of Model 3 on the test data was low, and overfitting occurs. This might be because the expansion coefficient of the inverse residual module is set larger -when the coefficient was set to 6, there were more features used for model fitting, and its segmentation accuracy and efficiency were 61.50% and 104FPS, respectively; when it was set to 2, its segmentation accuracy and efficiency were 67.54% and 245FPS, respectively. In Model 2, CBAM could alleviate the overfitting effect of training data by adjusting the feature weights of space and channels. To fully express the characteristics of smoke images, the expansion coefficient of the inverse residual module in this method was still set to 6, and the pruning operation improves segmentation efficiency.
Model 4 redesigned the UNet ++ loss function to obtain a more refined segmentation edge for pixel-by-pixel smoke density estimation and also introduced noise caused by similar ground object backgrounds, resulting in a decrease in the overall segmentation accuracy to 61.50%. The introduction of multiple loss terms computation also reduced the segmentation efficiency to 164FPS.
Model 5 introduced the convolutional attention module, lightweight module, and composite loss function into the basic framework of UNet++, which realized fine segmentation of smoke edges and reduced false alarms caused by incorrect   The test performance of each model in the ablation experiment on the real smoke dataset is shown in Table V. It could be seen that the segmentation accuracy of models 4, 6, and 5, namely UNet++ and composite loss function, AOSVSSNet without pruning, and AOSVSSNet with pruning, are 73.56%, 72.84%, and 68.81%, respectively. Models 2, 4, 1, and 6,  namely UNet++&CBAM, UNet++&composite loss function, UNet++, and AOSVSSNet with pruning had higher segmentation efficiency, 16FPS, 16FPS, 15FPS, and 12FPS, respectively.
Among them, Model 4 achieved the highest segmentation accuracy on the real smoke dataset, indicating that the composite loss function based on the physical process constraints of the optical imaging of the smoke target could effectively improve the generalization of the UNet++ network model, and the synthetic smoke dataset had better performance. The segmentation edge of model 4 was relatively stable between video frames, indicating that the composite loss function based on concentration estimation helped to enhance the smoothness of video object segmentation. As shown in Fig. 11, Model 2 with CBAM module could effectively focus on the smoke information and eliminate the interference of similar ground objects. Compared with Model 1, its segmentation accuracy was significantly improved by 24.32%, indicating that the enhancement of spatial dependencies between pixels played an important role in aggregating homogeneous smoke pixels and enhancing the distinguishability of similar ground objects. Model 3 had an overfitting problem, and the training accuracy was high but the test accuracy on synthetic and real smoke datasets was lower, at 59.42%, and loosed more smoke pixels. The segmentation  accuracy of model 6 was 4.03% higher than that of model 5, and it could segment the pixels with lower concentration at one end of the smoke diffusion. This might be because the pruning operation not only reduces the computational complexity of the model, but also alleviated the degree of overfitting of the model on the training dataset.

C. Comparative Experiments
This section selected FCN, UNet, and DeepLabV3+ as the representatives of the three main structures of the classic semantic segmentation network to verify the effectiveness of AOSVSSNet with pruning.
In the comparison experiment, the test performance of each model in the synthetic smoke dataset is shown in Table VI. It can be seen that models 4 and 3, namely AOSVSSNet with pruning and DeepLabV3+, had higher segmentation accuracy, 70.51% and 69.33%, respectively. Models 1 and 4, namely FCN and AOSVSSNet with pruning, had higher segmentation efficiency, 233FPS and 227FPS, respectively.
As shown in Fig. 12, in addition to the smoke target, the segmentation results of Model 1, FCN, more background objects present, and the segmentation accuracy was 54.13%. This was because although FCN combined the segmentation results of the rough layer and the fine layer, so that the prediction of local pixels followed the global ground object distribution structure to a certain extent, it still didn't fully consider the relationship between pixels and lacks local information consistency. This leaded to the appearance of a large number of missegmented patches of similar ground objects. Model 2, UNet, fused the high and low-level features output by the same layer of the encoder and decoder and upsampled them layer-by-layer, which narrowed the semantic gap and greatly improved the segmentation accuracy to 68.79%, which was 14.66% higher than , could fuse information of various scales without reducing the feature resolution through the hole convolution pyramid, equivalent to incorporating more fine global context information, and its segmentation accuracy was 69.33%. The accuracy of AOSVSSNet with pruning was 70.51%, but the segmentation efficiency was higher, which was 36.75% higher than that of DeepLabV3+, realizing the unity of accuracy and efficiency. The synthetic smoke dataset shown in Fig. 13 contains small, medium, and large-sized smoke targets. The segmentation results showed that our method could adapt to the segmentation of smoke targets of different sizes and could simulate different spatial scales from the perspective of satellite remote sensing and the actual smoke scene at different periods. In addition, as shown in Fig. 14, our method and model 3 of this article, namely DeepLabV3+, had lower false alarms in smoke-free images. It resulted from the enhancement of the global context information feature make the false alarms of similar ground objects suppressed, thereby improving smoke segmentation accuracy of the target.
The test performance of each model in the comparison experiment on the real smoke dataset is shown in Table VII and Fig. 15. It can be seen that models 4 and 3, AOSVSSNet with pruning and DeepLabV3+, had higher segmentation accuracy, 72.84% and 72.14%, respectively, indicating good generalization. Models 2 and 1, U-Net and FCN, had higher segmentation efficiency but lower accuracy. Combined with the ablation experiments, the reason was that the methods of Model 4 and 3 integrated more spatial context information in the training process in different ways. DeepLabV3+ utilized atrous convolution pyramid structure to fuse multilevel fine feature information. Our method combined the convolutional attention module and the complex loss function to enhance the expression of spatial dependencies between pixels, making the features of the smoke target and the background features more distinguishable, and improving the segmentation accuracy.

V. CONCLUSION
In this article, a deep learning method is innovatively introduced into optical satellite video smoke object segmentation. An attention-guided optical satellite video smoke segmentation network model for optical satellite video is proposed to aim at the multiscale segmentation of satellite video smoke targets and the background interference of complex and similar ground objects. Based on UNet++, a lightweight attention module CBAM enhances the smoke target features, effectively suppresses the false alarm of the ground object background, and achieves high segmentation accuracy on the synthetic dataset. A synthetic dataset is constructed based on the optical imaging process of smoke targets to solve the difficulty of manual labeling and model segmentation of deep learning samples due to blurred smoke edges, which saves manual labeling costs. In addition, it introduces the physical constraints of the smoke imaging process into the loss function and improves the generalization of the model to real smoke data.
Future work mainly focuses on optimizing the context feature extraction method, improving the network's ability to fuse global and local features, further reducing the missegmentation and missing pixels of smoke, and testing in different real smoke scenes. In addition, there are differences in the imaging process between satellite video and natural images. Remote sensing physical mechanisms such as atmospheric radiative transfer models can be considered as constraints and integrated into the model to enhance the interpretability and generalization of the smog segmentation network.