Background Subtraction Based on GAN and Domain Adaptation for VHR Optical Remote Sensing Videos

The application of deep learning techniques in background subtraction for VHR optical remote sensing videos holds the potential to facilitate multiple intelligent remote sensing processing tasks. However, existing methods on background subtraction for VHR optical remote sensing videos are still facing technical challenges. First, conventional CNN and other networks are limited by performance constraints. Second, existing background subtraction methods are mostly trained by natural videos due to the lack of VHR optical remote sensing video datasets. Third, VHR optical remote sensing videos have large scene sizes. In our article, we design a novel deep learning network via fully utilizing GAN and domain adaptation, which has the ability to measure and minimize the discrepancy between feature distributions of natural videos and VHR optical remote sensing videos so that the background subtraction performance for VHR optical remote sensing videos is improved significantly. Numerous experiments are conducted on the CDnet 2014 dataset and VHR optical remote sensing video dataset. Tremendous experiments demonstrate that our proposed method achieves an average FM of 0.8533, which reveals excellent performance on background subtraction.


I. INTRODUCTION
Recently, benefiting from rapidly increasing very high resolution (VHR) optical remote sensing videos, VHR optical remote sensing videos have been widely utilized in various fields. Background subtraction in VHR optical remote sensing videos plays a vital role in military and civil applications, such as military reconnaissance [1], [2], maritime surveillance [3], [4], urban planning &, urban construction [5] and traffic monitoring [6], [7]. It is regarded as a crucial component and the key link of intelligent remote sensing processing.
Since considerable attention has been focused on background subtraction in recent years, promising methods have been proposed. Garcia-Garcia et al. [8] presented an The associate editor coordinating the review of this manuscript and approving it for publication was Joanna Kołodziej . exhaustive review on background subtraction. Background subtraction methods based on robust principal component analysis (RPCA) models [9], fuzzy models [10], semantic models [11] and deep learning models [12] are considered promising solutions. Among these methods, deep learning-based methods show great potential in practice owing to their strong generalization capability.
Although deep learning-based background subtraction methods tap into the advantages of deep features and achieve very good performance, there are some technical challenges for VHR optical remote sensing videos. The first technical challenge is that in the application of background subtraction for VHR optical remote sensing videos, features must be extracted more efficiently by stronger generalization capability. However, conventional convolutional neural networks (CNN) and other networks are limited by performance constraints, which do not satisfy our demand.
The acquisition and annotation of VHR optical remote sensing videos are quite expensive and complicated. The second technical challenge is that existing background subtraction methods are almost trained by natural videos. However, natural videos and VHR optical remote sensing videos share similar but different distributions. When these deep learning-based methods are performed on VHR optical remote sensing videos, their performances decrease sharply.
The third technical challenge is that VHR optical remote sensing videos have very large scene sizes. Usually, the height and width of VHR optical remote sensing videos are hundreds of times larger than those of normal videos. This leads to background subtraction for VHR optical remote sensing videos being more time consuming. In addition, the whole VHR optical remote sensing video frame is too large to directly use with deep neural networks, such as CNNs.
To deal with the technical challenges mentioned above, in our article, we propose a deep learning framework that fully utilizes generative adversarial networks (GAN) and domain adaptation, which is capable of achieving reliable and efficient real-time background subtraction for VHR optical remote sensing videos. Specifically, we design a novel and robust background subtraction approach anchored on conditional generative adversarial networks (CGAN) [13] and domain-adversarial neural networks (DANN) [14]. By integrating CGAN with DANN, we evaluate and minimize the conflict between feature distributions of natural videos and VHR optical remote sensing videos so that the background subtraction performance for VHR optical remote sensing videos improves significantly.
The contributions of our article can be depicted as follows.
• We construct and train CGAN to learn a nonlinear mapping between the input video frame and generated background using natural videos. By leveraging CGAN, the proposed deep neural network has stronger generalization and more powerful feature extraction capability than conventional CNN and can achieve excellent background subtraction performance.
• DANN learns the distinction between the source domain (natural videos) and the target domain (VHR optical remote sensing videos). This means that the background subtraction method learned from the labeled natural videos can be performed on completely unlabeled VHR optical remote sensing videos. Benefiting from this, we greatly improve the background subtraction performance for VHR optical remote sensing videos compared with other advanced methods.
• We divide all of the VHR optical remote sensing video frames into regular blocks of the same size. Then, they are sent to the deep neural network in parallel and computed on the GPU. Finally, the outputs of the deep neural network are reverted to the original size, which is the background subtraction result. These steps make considerable contributions to address the problem of large scene size and significantly speed up processing.
• We choose the widely used dataset CDnet 2014 [15] as our natural videos and make a novel VHR optical remote sensing video dataset with no annotations. Excessive experiments are conducted on them. Numerous experiments demonstrate that our proposed approach is capable of achieving an average F-measure of 0.8533.
The rest of our article is organized as follows. Section II conveys the related work about background subtraction, domain adaptation and GAN. Section III vividly describes the methodology. Section IV presents numerous experiments. Section V presents the conclusion of this article.

II. RELATED WORK
In this section, several popular background subtraction approach techniques based on deep neural networks will be reviewed. Related work on domain adaptation and GAN will be reviewed as well.

A. BACKGROUND SUBTRACTION BASED ON DEEP NEURAL NETWORKS
Deep neural networks have been identified as one of the best methods for learning and representing features. In recent years, researchers have begun to explore the perspective of deep neural networks and applied them to background subtraction, achieving outstanding performances.
Restricted Boltzmann Machine (RBM) has been extensively applied to unsupervised learning and modeling of images. It was used in the background subtraction firstly. Guo et al. [16] first generated a moving object detection result through background subtraction via RBM in 2013. They treated the background subtraction problem as an image recovery and foreground residual estimation task. Xu et al. [17] combined the temporal nature of background subtraction with RBM, which can achieve stable background subtraction results and adapt to changes quickly.
For the static background, several previous methods have already acquired excellent performance. However, the performances on dynamic background hold the great potential to improve. To address this challenge, many method are proposed based on deep auto-encoder networks. Xu et al. [18] developed a neoteric background subtraction method anchored on deep auto-encoder networks. They first utilized a deep auto-encoder to extract dynamic background images from videos. Then, they leveraged another auto-encoder to learn the representation of the dynamic background. Kyungsun Lim et al. [19] designed a CNN with the structure of an encoder-decoder in which the encoder outputs a high-dimensional feature vector and the decoder converts it into the segmentation result.
Since CNN has achieved great success in image classification tasks, its application in the community of computer vision has become more and more extensive. Therefore, lots of novel methods for background subtraction are proposed based on CNN. Braham and Droogenbroeck [20] proposed a novel background subtraction method anchored VOLUME 8, 2020 on spatial features learned from CNN. Cinelli [21] proposed a novel approach based on residual neural networks, which is able to detect moving objects by pixelwise foreground segmentation. Wang et al. [22] proposed a semiautomatic end-to-end network anchored on a multiresolution cascaded CNN. Babaee et al. [23] performed background subtraction from video sequences using CNN. The final background subtraction results were output after spatial-median filtering. Zhao et al. [24] proposed a two-stage CNN to generate background and foreground, which simultaneously output the background and foreground.

B. DOMAIN ADAPTATION
Domain adaptation holds the potential to transfer learned knowledge from the source domain to the discriminative but relevant target domain. In recent years, various methods of domain adaptation have been proposed using unsupervised learning. Huang et al. [25] presented a non-parametric method and directly produced resampling weights without distribution estimation, which matches distributions in high-dimensional feature space. Gong et al. [26] developed a novel approach that uses groundtruth to automatically bridge different distributions, which can optimize the original method discriminatively without any groundtruth from the target domain. Pan et al. [27] proposed transfer component analysis (TCA), a neoteric representation method designed specifically for domain adaptation. It projects input data onto the learned TCA, which can alleviate the discrepancy between two similar but different distributions.
While the above are unsupervised domain adaptation methods, there are multiple methods performing supervised domain adaptation by exploiting labeled data from the target domain. Gopalan et al. [28] were motivated by incremental learning so they designed intermediate representations of data between the source domain and the target domain, which holds the potential to learn a target domain discriminative classifier from projections of labeled source domain data. Baktashmotlagh et al. [29] proposed unsupervised domain adaptation, which was a novel type of domain invariant projection. It extracts invariant information across the source domain and the target domain. Specifically, it learns the projection of data from the low-level underlying space where the distance between the source domain and the target domain is minimized.
Existing domain adaptation approaches mainly focus on learning invariant feature representations across the source domain and the target domain. They seldom leverage low-level intrinsic structures. Gong et al. [30] developed a geodesic flow kernel, a novel type of kernel-based method, which aggregates infinite subspaces that are discriminative both in geometrical and statistical attributes from the source domain to the target domain. Chopra et al. [

C. GAN
Kocaoglu et al. [32] developed a generative model for a given causal graph by tapping into the advantages of the adversarial training process. They designed Causal-GAN and CausalBEGAN, two novel types of conditional GAN. Experimental results demonstrate that the designed approach holds the potential to exploit underlying and volatile image distributions since the datasets are inevitably volatile. Feizi et al. [33] designed a novel GAN architecture to recover the maximum-likelihood solution and showed fast generalization capacity.
Hong et al. [34] developed a neoteric approach to aggregate GAN into fully convolutional networks (FCN), which can mitigate the gap between the source domain and the target domain. It utilizes a conditional generator and transforms features of synthetic images to real images. Then, a discriminator is utilized to distinguish them. In every training batch, the generator and the discriminator of CGAN compete against each other. By doing so, the generator and the discriminator of CGAN improve their performance.

III. METHODOLOGY
The proposed background subtraction framework for VHR optical remote sensing videos consists of four parts.
(1) background image generation; (2) data preparation; (3) background subtraction network construction and training; (4) domain adaptation. Figure 1 depicts the framework of the background subtraction method for VHR optical remote sensing videos.

A. BACKGROUND GENERATION
For background subtraction using the proposed framework, we generate the background model from the original videos. The clear and vivid background model is vital for the proposed framework to fully learn the discrepancies between the current input frame and background model, which is beneficial for final background subtraction results.
We leverage the background generation method based on a foreground mask extracted by the SuBSENSE algorithm [35]. Figure 2 illustrates the generated background images extracted under highway and bungalow scenes. As shown in Figure 2, extracted background images are very similar to the real background. Although there are many moving targets in the scene, the generated background images are not blurred.

B. DATA PREPARATION
The RGB [36] and the HSV [37] are two usually utilized color spaces. Compared with RGB space, HSV space is more capable of showing visually brightness and darkness changes. In background subtraction, illumination variation, which has a negative effect on the performance of background subtraction, inevitably occurs in video scenes. For the purpose where H , S, and V represent the three values of the HSV color space, respectively, and R, G, and B represent the three values of the RGB color space.
In addition, with the purpose of improving the generalization ability of the proposed framework and allowing it to adapt to different scenes under various conditions, it is necessary to learn different sizes of videos in different scenes. In this article, a bilinear interpolation algorithm [38] is utilized to resize the original video frames in the CDnet 2014 dataset to the same size as 321 × 321 × 6. by adding additional information, which is an extension of conventional GAN. A conditional variable y with constraints is added to the generator and the discriminator to guide and restrain the generation process. The constraints can be a variety of information, such as category labels, partial data for image repair, and data from different modalities. When using category labels as constraints, the unsupervised GAN becomes supervised. This improvement has proven to be very effective, so CGAN has been widely used.
For the purpose of overcoming the shortcomings of conventional CNN, such as pool generalization and feature extraction capabilities, Long et al. proposed FCN [39] and successfully applied it to image semantic segmentation. FCN removes all fully connected layers in the conventional CNN. Transposed convolution layers in FCN restore the size of the feature map sampled by the pooled layer, which is designed to produce a correspondingly sized output for any size input. FCN not only preserves the overall structure information of the input image but also realizes pixel-by-pixel semantic segmentation.
The proposed background subtraction network based on CGAN is made up of a generator and discriminator. The generator is utilized to obtain the background subtraction result. It is modified from FCN. The discriminator is constructed by conventional CNN. It is utilized to discriminate the probability of the input image coming from the dataset or generator. Figure 3 shows the framework of the background subtraction network based on CGAN, which is called CGAN-BS. In Figure 3, the input of the generator consists of z and y, where z is composed of the original video frame and generated background and y is the ground truth.
The backbone of the FCN in the generator is based on ResNet-50 [40], which is shown in Table 2. Figure 4 is the architecture of the generator. Table 2 gives the specific network parameters. In the FCN, the residual network is used to extract deep features hidden behind the original input image.
As illustrated in Figure 5, in the discriminator, the size of the input video frame is reduced through several convolutional layers with the stride of 2, ended by a fully connected layer. The output of the discriminator indicates the probability that the input video frame is coming from the generator or the dataset.   A mixed loss function is utilized to define the loss of the CGAN-BS. The losses of the generator and the discriminator are as follows: l bee (a (x n , y n ) , 1) + l bce (a (x n , s (x n )) , 0) where l bce is the cross-entropy loss, s(·) is the generated result from the generator, x is the training sample, y is the groundtruth, a is the output of the discriminator, a(x n , y n ) is the output of the discriminator when its input is the groundtruth, a(x n , s(x n )) is the output of the discriminator when its input is the output of the generator, and n is the n th sample.
Assuming the predicted value is z and the actual value is z, the cross entropy is: During the training phase of the proposed model, we fix the generator and update the parameters of the discriminator, alternating iteratively. For the discriminator, it is expected that the discriminator can correctly distinguish whether the input video frame comes from the dataset or the generator. As a result, we minimize D loss . For the generator, on the one hand, we expect that the generated result will be infinitely close to the groundtruth, that is, minimize G loss . On the other hand, we expect that the generated result can "trick" the discriminator, that is, minimize l bce (a(x n , s(x n )), 1). Hence, we introduce the penalty factor λ (0 < λ < 1) and modify the loss function of the generator to: G mix = G loss + λ · l bce (a(x n , s(x n )), 1) When training the generator, the parameters of the generator and the discriminator are updated according to G mix . When the discriminator is trained, the parameters of the discriminator are updated according to D loss .

2) BACKGROUND SUBTRACTION NETWORK FOR VHR OPTICAL REMOTE SENSING VIDEOS USING DANN
After we exploit CDnet 2014 to train background subtraction based on CGAN, the trained generator, which can be utilized to generate background subtraction results, is the desired result. We modify the trained generator by DANN, which is called DANN-BS.
DANN is a novel feature representation method designed for domain adaptation that concentrates on incorporating domain adaptation and deep feature learning at the same time. It is a generic approach that allows almost every existing feedforward deep neural network, including our proposed  generator, to implement knowledge transferability. The only additional component is the gradient reversal layer, which is utilized to reverse the gradient during the backpropagation computation process. DANN is made up of a label predictor, domain classifier and feature extractor.
In this article, we consider the background subtraction task as a classification task, which is aimed to classify the background and foreground for every pixel in the video frame. We define X as the input sapce and Y = {0, 1, . . . , L − 1} as the set of L possible labels. Therefore, there are two different distributions over X ×Y , which are named by source domain D S and target domain D T . For unsupervised domain adaptation learning methods, the labeled source sample S are drawn from D S while the unlabeled target sample T are drawn from D X T , where D X T is the marginal distribution of D T over X . S and D both subject to independent and identical distribution. The domain adaptation algorithm can be presented as: where N = n + n is the total number of samples. The target of domain adaptation is to create classifier η : In DANN-BS, it is assumed that there are two data distributions: source domain S(x, y) and target domain T (x, y). The goal of DANN-BS is to obtain better performance on the target domain using unsupervised learning. The implicit assumption is that the training samples are x 1 , x 2 , · · · , x N from the source domain S(x, y) and the target domain T (x, y). At the same time, we define the domain label of the i th training sample as In the course of training, the inputs of DANN-BS come from source domain datasets with labels and target domain datasets without labels, as well as domain classification labels for the source domain and the target domain. Namely, we have the background subtraction ground truth for the source domain but do not have the background subtraction ground truth for the target domain.
Since the VHR optical remote sensing video dataset is not labeled, there is no groundtruth. As a result, the input of the generator is changed from 321 × 321 × 7 to 321 × 321 × 6. Figure 6 shows the framework of the background subtraction network using DANN. As visualized in Figure 6, the green dotted box is the feature extractor, the blue dotted box is the label predictor and the red dotted box is the domain classifier.
The connecting line between the feature extractor and domain classifier is the gradient reversal layer. When performing forward propagation, the gradient reversal layer is identified as an identity connection. When performing backpropagation, a negative constant is multiplied by the original gradient in terms of the gradient reversal layer. The mathematical expression is as follows: The input of the feature extractor is a batch of 321×321×6 tensors. The input batch has two parts: the first part comes from natural videos, and the second part comes from VHR optical remote sensing videos. The whole batch is fed into the feature extractor, and we obtain the corresponding features.
The output of the label predictor is the background subtraction. It is worth mentioning that in the course of training, the input of the label predictor is the first part of the batch, which means that the second batch coming from VHR optical remote sensing videos is discarded. The reason is that the VHR optical remote sensing video dataset has no groundtruth, so backpropagation cannot be conducted for the second batch.
In the domain classifier, there is a fully connected layer and two convolution layers. The input of the domain classifier is a complete batch including natural videos and VHR optical remote sensing videos. The outputs of the domain classifier determine whether the input batch comes from the source domain (natural videos) or the target domain (VHR optical remote sensing videos).
In DANN-BS, the feature extractor is aimed to learn a hidden representation either for source frames or target frames so that the domain classifier is able to classify source domain inputs and target domain inputs accurately and the label predictor is able to classify the source inputs.
In the course of training, the learning rate is adjusted using the following formula: where p is gradually increasing from 0 to 1, µ 0 = 0.01, α = 9 and β = 0.69. It can facilitate convergence and alleviate deviation in the target domain. In addition, the momentum is set to 0.9. The λ p is set to 0 during the initialization phase. Then, it linearly increases to 1, as in the formula below: where γ is set to 10. It makes the domain classifier immune to noise during the initialization phase of training. λ p is utilized to update the feature extractor.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
We conducted multiple experiments on the CDnet 2014 dataset and VHR optical remote sensing video dataset, which we made, to evaluate the performance of our proposed background subtraction method.

A. DATASET
CDnet 2014 is one of the benchmark datasets for background subtraction, which includes some complex video scenes that are challenging in the field of background subtraction, such as camera jitters, dynamic backgrounds, shadows, and obscure videos.
Our dataset is based on the VHR optical remote sensing videos provided by the Jilin No. 1 satellite. The videos are in AVI format, and the resolution of videos is less than 1m. The ground area covered by each video is 11km × 4.5km. All videos have undergone geometric correction, radiometric correction and image stabilization. The targets in our VHR optical remote sensing videos are various. The average length of each video is 30 seconds. Figure 7 shows one of the VHR optical remote sensing video frames in Beijing.
A local binary pattern (LBP) [41] texture histogram is utilized to demonstrate that the two datasets have scrupulously different distributions. LBP is a popular local texture descriptor in computer vision that has also been widely applied in background subtraction [42]- [44]. Hence, LBP texture histograms are of great value to help us observe the distributions of two datasets.  As shown in Figure 8, (a) is the LBP texture histogram of CDnet 2014 and (b) is the LBP texture histogram of the VHR optical remote sensing video dataset. They both reflect the overall local texture feature of video frames in two datasets. VOLUME 8, 2020 In (a), only the bin of 255 has the greatest number of pixels. However, in (b), bin 25 and bin 225 have many pixels in addition to bin 255.
As a result, we, therefore, find it prudent to comment that the CDnet 2014 and VHR optical remote sensing video datasets share similar but different distributions.
Since the resolution of optical remote sensing videos was 12, 000 × 5, 000 and include considerable interference, such as background clutter and moving cloud occlusion, we preprocessed the videos first. The whole video frame was cut into extensive blocks. Each block had the same size of 321 × 321. Then, they were sent to the background subtraction network in parallel and computed on the GPU. Finally, all the outputs of the background subtraction network were reverted to the original size, which was the background subtraction result. These steps made considerable contributions to address the problem of large scene size and significantly speed up processing.

B. EVALUATION METRICS
We note that the built-in assumption of existing proposals is that background subtraction can be regarded as a pixel-level classification issue. Its essence is the two-class classification of image pixels. Therefore, evaluation metrics based on pixel-level classification were used to evaluate the proposed method in this article and other advanced methods.
In this article, we treat background pixels as positive while foreground pixels as negative. Pixel-level classification results can be divided into four categories: background pixels are judged as background pixels, foreground pixels are judged as background pixels, foreground pixels are judged as foreground pixels and background pixels are judged as foreground pixels, represented by true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
In terms of the above statistical indicators, seven objective evaluation indicators including Recall(Re), Specificity (Sp), Precision (Pr), F-Measure (FM), False Positive Rate (FPR), False Negative Rate (FNR) and Percentage of Wrong Classifications (PWC) were used to evaluate the background subtraction, which are presented in this article vividly [45].
In background subtraction, Re denotes that the ratio of pixels that should be judged as the background is correctly classified. The larger the value of Re, the better the performance. Pr represents the ratio of pixels that are precisely classified among the pixels judged as background. The larger the value of Pr, the better the performance. Sp mainly characterizes the ability to classify foreground pixels. The larger the value of Sp, the better the performance. FPR, FNR and PWC all indicate the ratio of misclassifications. The smaller the value is, the better the performance. In practice, Re and Pr are usually used to measure the background subtraction. Re and Pr are expected to be as high as possible. However, in fact, they are contradictory in some cases. It is possible for Re to high but Pr to be low. Therefore, in general, the comprehensive indicator FM is generally used to evaluate background subtraction.

C. EXPERIMENTAL ENVIRONMENT
Our hardware experimental environment: The CPU was an Intel Xeon (R) E5-2640 v4 with 20 cores running at 2.4 GHz. The memory size was 64 GB. The GPU was an NVIDIA GeForce GTX 1080 with 8 GB of memory.

D. FEASIBILITY EXPERIMENT AND ANALYSIS
To verify the feasibility of the proposed DANN-BS, the training time and inference time were analyzed.

1) TRAINING TIME OF THE PROPOSED METHOD
When batch size was set to 30, the training time was tested under the CPU and GPU. With the CPU, the training time was more than one month, and the memory consumption was more than 20 GB. With GPU acceleration technology, the training time was reduced to 18 hours, which was within the acceptable range.

2) INFERENCE TIME OF THE PROPOSED METHOD
The trained model was utilized for background subtraction. Experiments show that the single image inference time is 50 ms, or 20 frames per second. As a result, the proposed method was capable of being processed in real-time.

E. EXPERIMENTS ON BACKGROUND SUBTRACTION NETWORK CGAN-BS
To further illustrate the superiority of the proposed network over existing methods, several representative background subtraction algorithms were selected for comparison. Background subtraction algorithms based on unsupervised learning include the mixed Gaussian model (GMM) [47], the pixel-based adaptive segmentation PBAS algorithm [48] and the SuBSENSE algorithm [35]. The background subtraction algorithm based on supervised learning was the DeepBS algorithm proposed by Babaee et al. in 2017 [23].

1) EXPERIMENTS ON CDnet 2014
We leverage the CDnet 2014 dataset to train the proposed background subtraction network CGAN-BS. Figure 9 is a diagram showing the background subtraction results of the 1,500th frame, the 1,840th frame, the 1,900th frame, and the 2,360th frame in the 'fall' video scene by the proposed CGAN-BS and other advanced methods. The 'fall' video scene belongs to the dynamic background category in the CDnet 2014 dataset. Due to the shaking of leaves in the 'fall' video scene, the traditional unsupervised learning method incorrectly classified the swaying leaves as foreground. In addition, there was considerable noise around the foreground. The DeepBS method based on a deep neural network accurately classified the swaying leaves into backgrounds. However, there were too many holes in the foreground, as shown in the classification results. In the 1,900th frame of the 'fall' video, other advanced methods classified the suspected regions into the foreground incorrectly. The foreground classified by the proposed CGAN-BS method  was consistent, and the details of the foreground along with the structural information were complete. More importantly, The suspect regions were not classified as foreground by the proposed CGAN-BS. Consequently, the proposed CGAN-BS surpassed other methods. Table 3 shows evaluation indicators of CGAN-BS compared with other advanced methods in the CDnet 2014 dataset. By leveraging CGAN, the proposed background subtraction network CGAN-BS has stronger generalization and more powerful feature extraction capability than conventional CNN and other advanced methods, which obtained excellent background subtraction performance.

2) EXPERIMENTS ON THE VHR OPTICAL REMOTE SENSING VIDEO DATASET
Taking a VHR optical remote sensing image block of Beijing as an example, we carried out experiments on it and obtained the background subtraction results. Figure 10 is a diagram showing the background subtraction results of CGAN-BS and other advanced methods in the VHR optical remote sensing video.
From Figure 9, Figure 10 and Table 4, we can see that the proposed background subtraction network CGAN-BS shows good performance on natural videos while showing poor performance on the VHR optical remote sensing VOLUME 8, 2020 video dataset. It shows that they share similar but different distributions.
In Table 4, the experimental consequences of CGAN-BS in untrained VHR optical remote sensing video datasets were   not as good as those of trained natural videos. What is worse is that the FNR and PWC were high. This shows that the performance of the CGAN-BS on untrained video scenes is flawed.
Experiments demonstrate that our proposed CGAN-BS obtains good performance in video scenes that have been involved in training, but it does not work well in untrained video scenes. Therefore, the robustness of the proposed CGAN-BS needs to be further improved.

F. EXPERIMENTS ON BACKGROUND SUBTRACTION NETWORK DANN-BS 1) EXPERIMENTAL RESULTS
We used the unlabeled VHR optical remote sensing video dataset that we made to test the proposed DANN-BS. It is compared with the CGAN-BS, PBAS, SuBSENSE, DeepBS and GMM, as shown in Table 5.
In Table 5, our proposed DANN-BS has higher FM, Pr and Re than the traditional SuBSENSE, PBAS and GMM in unlabeled video scenes. Compared with the proposed CGAN-BS, the FM of DANN-BS increases by 7.3%, Pr of DANN-BS increased 7.8%, and Re of DANN-BS increased 3.2%. Therefore, we conclude that the proposed DANN-BS can effectively improve the generalization ability of background subtraction for unlabeled VHR optical remote sensing videos and can also be applied to other unlabeled complex video scenes owing to domain adaptation. Figure 11 shows background subtraction results of CGAN-BS and DANN-BS in the VHR optical remote sensing videos. From Figure 11, we can see that the proposed background subtraction network DANN-BS obviously improves the performance.

2) EXPERIMENTS WITH DIFFERENT COLOR SPACES
In the VHR optical remote sensing video dataset, when the input video frames use the RGB color space, the average FM value of 0.8533 is higher than that of 0.8509, where the input video frames use the HSV color space. Nonetheless, the differences between them were small. Consequently, the selection of different color spaces had little effect on the background subtraction results for the proposed DANN-BS method.

V. CONCLUSION
In this article, we designed a neoteric and outstanding background subtraction network via CGAN and DANN. The proposed method utilizes the generated background and video frame pairs and obtains the background subtraction result for VHR optical remote sensing videos by using domain adaptation. Our experiments on the CDnet 2014 dataset and our dataset demonstrated that our proposed method achieved an average FM of 0.8533, which reveals excellent performance on background subtraction for VHR optical remote sensing videos.
It is noteworthy that the quality and quantity of background subtraction datasets for VHR optical remote sensing videos in academia are limited. In the future, we will devote our effort to collecting more background subtraction datasets to support in-depth research.