Incorporating Deep Background Prior Into Model-Based Method for Unsupervised Moving Vehicle Detection in Satellite Videos

Background reconstruction is a key step of moving object detection in satellite videos. Most existing model-based methods exploit low-rank prior to recover background, which has achieved good performance but suffered degradation under complex and dynamic scenes. In this article, we introduce a deep background prior into model-based methods for moving vehicle detection in satellite videos. Our deep background prior is obtained by a background reconstruction network, which can learn to reconstruct the background from consecutive frames. By applying our deep background prior into model-based methods, a closed-form solution can be obtained via the alternating direction method of multipliers (ADMM), and then, detection results can be acquired through iterative optimization. More importantly, our background reconstruction network can be trained in an unsupervised way by introducing specifically designed loss, thus relieving the dependence on large-scale labeled datasets. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed method.


I. INTRODUCTION
W ITH the development of remote sensing technology in recent years, video surveillance from satellites has become an effective way for many applications, such as urban monitoring [1], resource exploration [2], and traffic condition monitoring [3], [4]. For these applications, moving object detection (MOD) plays a fundamental role to locate objects of interest. However, MOD, especially moving vehicle detection in satellite videos, is extremely challenging due to the following aspects.
1) Small Object Sizes: Due to the low spatial resolution (e.g., the ground sampling distance (GSD) of Jilin-1 is around 1 meter), moving vehicles captured by satellite videos are with small object sizes and usually smaller than 5 × 5 pixels, leading to a lack of appearance and texture information. 2) Low Contrast to the Complex and Dynamic Background: Due to various complex and dynamic scenes, the moving vehicles are sometimes with low contrast to backgrounds, which are difficult to be distinguished from background clutters. 3) Insufficient Labeled Data: Although the camera on satellites can provide continuous observation of the Earth and obtain a large number of satellite videos, collecting and manually annotating large datasets for MOD require considerable time, effort, and cost, thus hindering the research process. Due to the merit of flexibility, model-based methods [3], [5], [6], [7], [8], [9], [10] have been widely investigated for MOD in satellite videos, in which background subtraction plays an important role to segment target from its adjacent pixels. However, existing methods [6], [7], [8] generally adopt handcrafted priors (e.g., low-rank prior [6], [7], [8]) with regularization terms and hand-tuned parameters to the model background. When dealing with complex scenes, these methods can not accurately reconstruct the background, thus limiting the detection performance. Moreover, due to the introduction of various regularization terms, most of these methods are computationally expensive as they need iterative optimization.
Recently, deep neural networks have demonstrated their effectiveness in serving as implicit image priors and have achieved remarkable performance in various fields, such as image deblurring [11], single image super-resolution (SISR) [12], and color image demosaicing [13]. Inspired by these methods, in this article, we utilize deep networks to extract implicit background information (which is called deep background prior in the following text) to accurately reconstruct the background for moving vehicle detection. Specifically, we design a U-shape network to obtain deep background prior, which is incorporated into model-based methods. The whole framework can be solved by alternating direction method of multipliers (ADMM) [14] to get the closed-form solution. Based on the closed-form solution, the detection results can be obtained via iterative optimization.
It is worth noting that our background prior can be learned in an unsupervised manner by training the U-shaped network with a specifically designed loss function, thus eliminating the reliance on the large-scale annotated dataset.
The main contributions of this article are summarized as follows.
1) We incorporate deep background prior into the model-based method for moving vehicle detection in satellite videos, which combines the advantages of both model-and learning-based methods. To the best of our knowledge, we are the first to propose such a framework for MOD in satellite videos. 2) We design a background reconstruction network to recover background from multiple frames in an unsupervised way. A loss objective is designed to guide the network to reconstruct the background without groundtruth labels. 3) With the help of the incorporated deep background prior, our method achieves superior detection performance with significant acceleration compared to state-of-the-art model-based methods. The rest of this article is organized as follows. Some related works are briefly reviewed in Section II. The notations and preliminaries are described in Section III. The proposed framework is illustrated in Section IV. Section V presents the experimental setup and results in detail, and Section VI concludes this article.
II. RELATED WORK MOD in satellite videos is a newly emerging field in recent years due to the availability of satellite videos provided by launched satellites, such as Jilin-1 and Skybox. In this section, we briefly review the major works on model-and learning-based methods for MOD in satellite videos. In addition, we introduce the highly related works on model-based methods with deep image priors and unsupervised MOD.
Frame differencing-based methods [15], [16], [17] detect moving objects by computing the differences between adjacent frames and then perform segmentation to get the detection results. A variety of two-and three-frame differencing methods have been proposed. However, the detection performance would be degraded by the sudden change in the dynamic backgrounds.
Typical background subtraction-based methods [3], [18] first estimate the background by different filters (e.g., mean or median filters) and then get the detection results by subtracting the estimated background from each frame. The relatively simple estimation of background makes these methods suffer performance degradation under complex scenes, resulting in a high false alarm rate.
Another line of background subtraction-based methods exploits robust principal component analysis (RPCA) techniques to detect moving objects. These RPCA-based methods [6], [7], [19], [20], [21] assume that the image from satellite videos is a summation of background, target, and noise, and employ different regularization on each component. The detection results can be obtained by acquiring closed-form solutions and applying iterative optimization [6], [7], [21]. However, these RPCA-based methods usually employ sophisticated and handcrafted regularization terms to tackle complex scenes, which increases the computational complexity and slows down the iterative process. Moreover, when dealing with complex scenes, these methods cannot ensure the quality of the recovered background, thus degrading the detection performance.

B. Learning-Based Methods for MOD in Satellite Videos
Before the deep learning era, feature extractors and descriptors are widely used for many tasks, such as object detection [22] and image matching [23]. With powerful feature modeling capacities, deep learning has been successfully applied in object detection for natural images [24], [25], [26], [27], [28], [29], [30] and achieved promising performance. For example, Bayraktar et al. [30] proposed a framework consisting of basic image preprocessing techniques, geometrical operations, and deep neural networks to improve the performance of ornamental plant detection and counting from onboard UAV cameras. However, these object detection methods mainly rely on appearance information to detect objects. When dealing with moving objects in satellite videos with limited appearance and texture information, these methods will suffer significant performance degradation [31], [32]. For MOD in satellite videos, the spatiotemporal information is of great importance. Therefore, existing learning-based methods usually design suitable network architectures for the extraction of spatiotemporal information.
LaLonde et al. [31] proposed a two-stage network named ClusterNet to extract spatiotemporal information from consecutive airborne images to detect moving objects. Xiao et al. [32] proposed a two-stream detection network called DSFNet to incorporate the static context information and the dynamic motion cues for MOD in satellite videos. Although the learning-based methods have achieved promising performance, their performance relies heavily on large-scale labeled data. However, due to the extremely small object size and complex backgrounds, annotating moving objects in satellite videos is labor-intensive and time-consuming. In this article, we explore an unsupervised method for MOD in satellite videos to fully use a large amount of unlabeled data to relieve the labeling burden, which is more practical in real scenes.

C. Model-Based Methods With Deep Image Priors
Unlike traditional model-based methods that require explicit and handcrafted image priors, model-based methods with deep priors [33], [34], [35] can incorporate implicit deep priors from deep CNN networks for image restoration [36]. Tirer and Giryes [37] utilized the plug-and-play framework with IRCNN [38] denoiser to tackle SISR. Li and Wu [39] introduced IRCNN denoiser into a model-based method to solve depth image inpainting. Zhang et al. [13] modulated the deep denoiser prior into traditional model-based methods to solve various image restoration problems.
Inspired by these works, we introduce an implicit deep background prior into model-based methods to generate more accurate backgrounds, which further improves the effectiveness of MOD in satellite videos. Different from the aforementioned supervised image restoration methods, we develop a background reconstruction network to obtain deep background prior in an unsupervised manner.

D. Unsupervised Moving Object Detection
Unsupervised MOD aims to perform detection without any handcrafted annotation. Recently, many unsupervised MOD methods have been proposed for natural images [40], [41], [42], [43]. Specifically, Sultana et al. [40] proposed a GAN-based moving object detector to estimate the background and employed differencing and segmentation to generate detection results. Yun et al. [42] proposed an unsupervised MOD method for pan-tilt-zoom (PTZ) cameras. They designed two background models for large and small changes, and incorporated the results from both models to get the moving objects. Bao et al. [43] modified the SlotAttention [44] framework to detect moving objects and utilized pseudo-ground truth generated by a motion segmentation method as supervision. However, the aforementioned methods are designed for general objects in natural images, where the object contains abundant appearance and texture information. They tend to suffer significant performance degradation on MOD in satellite videos due to the small sizes and low contrast to background clutters.
To alleviate the annotation burden of moving vehicles in satellite videos, Zhang et al. [45] proposed a weakly supervised method to detect moving objects in satellite videos. They first generated pixelwise pseudolabels from the traditional RPCA-based method E-LSD [6] and then utilized the pseudolabels to train an encoder-decoder network to segment moving objects. Due to the inaccuracy of the generated pseudolabels, the method in [45] achieved an inferior performance than E-LSD [6].
In the field of MOD in satellite videos, unsupervised learning has not been discussed yet. In this article, we propose the first unsupervised learning method for moving vehicle detection in satellite videos.

A. Formulation of MOD in Satellite Videos
Generally, the problem of MOD in satellite videos can be formulated as follows: where f D , f B , f T , and f N represent the original image, the background image, the target image, and the noise image, respectively. Compared with matrix-based methods [6], [7], the low-rank tensor decomposition method [8] can obtain good detection performance due to the preservation of the spatiotemporal structure. Therefore, this article uses the low-rank tensor decomposition method as the basic framework. Consequently, the model in (1) can be rewritten into the tensor form as follows: where D, B, T , N ∈ R n L ×H ×W represent the original patch tensor, the background patch tensor, the target patch tensor, and the noise patch tensor, respectively. The detection results (i.e., target image) can be obtained by fetching out the slices of T .

B. Low-Rank and Sparse Component Decomposition Model
The background regions are generally assumed to change slowly over a period of time, and there are a lot of overlapped regions among different frames. Therefore, background patch tensor B conforms to the low-rank property [6] with suitable video length, which can be described as where r is a constant and rank(·) represents the rank of a tensor. The target patch tensor T conforms to the sparsity prior, which can be depicted as where d is an integer that is related to the target characteristic and satisfies d ≪ W × H . The noise N is usually modeled as additive white Gaussian noise, and it satisfies the following: where ∥·∥ F represents the Frobenius norm of a tensor and σ > 0 denotes the Gaussian noise level. Generally, the low-rank tensor-based framework for MOD in satellite videos can be obtained by replacing ∥T ∥ 0 with ∥T ∥ 1 [46]. Therefore, the low-rank and sparse component decomposition model can be formulated as where λ and β denote the weight for target and noise components, respectively. ∥·∥ * represents the nuclear norm, which is a nonconvex approximation of rank(B).

IV. PROPOSED FRAMEWORK
Previous model-based methods [6], [7], [8] usually employ explicit image prior (e.g., low rank prior) as regularization terms (e.g., the nuclear norm) to accurately recover the background. Despite achieving promising performance, these methods cannot handle complex scenes well due to the quality of the reconstructed background. To address this issue, we introduce the implicit deep background prior into the model-based method, which can be obtained by a deep background reconstruction network. In this section, we introduce the proposed framework with deep background prior, which is shown in Fig. 1(a). In the following, we first present a model-based method with deep background prior in Section IV-A. Then, the solving process of the proposed framework is illustrated in Section IV-B. Finally, the unsupervised background reconstruction network is introduced in Section IV-C.

A. Model-Based Method With Deep Background Prior
The core idea of our proposed framework is to incorporate deep background prior into the model-based method. Therefore, we introduce the deep background prior into (6) and remove the handcrafted nuclear norm on the background. The formulated model is given as follows: where f θ (D) denotes the deep background prior recovered from the input image by the background reconstruction network f θ (·). λ and β denote the positive regularization parameters. Note that one can replace f θ (·) with any designed background reconstruction network. Therefore, our proposed framework can not only retain the flexibility of the model-based method but also leverage the powerful modeling ability of deep neural networks.

B. Solving the Proposed Method
The problem in (7) can be rewritten by the inexact augmented Lagrangian multiplier (IALM) [47] approach as follows: where y 1 and y 2 represent Lagrangian multipliers and µ is a positive penalty scalar. Since it is hard to optimize all these variables concurrently, we approximately solve this optimization problem by alternately solving one variable with the others being fixed. Thus, we apply ADMM [14] approach to decompose (8) into three optimization subproblems about B, T , and N , and then alternately solve these variables. The details are given as follows. 1) Updating B with other variables being fixed The solution of (9) can be obtained by 2) Updating T with other variables being fixed Algorithm 1 Proposed Algorithm Input: image sequence, parameters λ, β, µ > 0 Initialize: Transform the image sequence with length n L into the original tensor D, B 0 Equation (11) can be solved by performing elementwise shrinkage operation [48] T where Th(·) denotes the elementwise shrinkage operator and µ k is the positive penalty scalar for the kth iteration. 3) Updating N k+1 with other variables being fixed The solution of (13) can be obtained by 4) Updating multipliers y 1 , y 2 with other variables being fixed y k+1 5) Updating the positive penalty scalar µ k+1 Finally, the proposed method is summarized in Algorithm 1.

C. Unsupervised Background Reconstruction Network
We build a background reconstruction network to recover background, which can serve as implicit deep background prior. Due to the lack of ground-truth background and the difficulty in acquiring such labels in real scenes, we propose to train the background reconstruction network in an unsupervised manner. In the following parts, we introduce the architecture of the proposed background reconstruction network in Section IV-C1, the merge block in Section IV-C2, and the specifically designed loss in Section IV-C3. 1) Network Architecture: We design a U-shape network [49] to recover the background from consecutive frames in satellite videos, which consists of an encoder for feature extraction, a decoder for feature reconstruction, and skip connections for feature propagation. The encoder and the decoder are composed of several convolution blocks (each convolution block consists of two Conv-BN-ReLU layers) and downsampling or upsampling operations. A merge block is added to the skip connection to aggregate the temporal information, which can propagate the spatiotemporal information from the encoder side to the decoder side. The network architecture is shown in Fig. 1(b). Specifically, a video clip V t with n frames I t+τ (τ = [−r, r ], r = ⌊n/2⌋) is first fed into a 2-D convolutional layer to generate the initial feature map F t 0 ∈ R bn×c 0 ×H ×W , where b denotes the batch size. After that, the initial feature map is processed by the encoder to generate multilevel feature maps, resulting in for the ith convolution block. Next, the generated multilevel feature maps are processed by merge blocks, which can fuse the spatial and temporal information, resulting in G t i ∈ R b×c i ×(H/2 i−1 )×(W/2 i−1 ) . Then, the fused multilevel feature maps are sent to the decoder, which can recover the resolution of the feature map. Finally, the resulting feature map from the decoder is processed by a 2-D convolution to get the reconstructed backgroundB t .
2) Merge Block: For the skip connection, to aggregate the spatiotemporal information of the feature map generated from the video clip, we build a merge block into the skip connection to propagate spatiotemporal information from the encoder to the decoder. Since the background region overlaps among adjacent frames, merging the spatiotemporal information from multiframes and reducing the temporal dimension can help to obtain deep background prior. The merge block consists of a 3-D convolution block and a temporal average-pooling operation. To reduce the computational cost of the 3-D convolution block, we decomposed the 3-D convolution with a kernel size of k × k × k into a spatial convolution with a kernel size of 1 × k × k and a temporal convolution with a kernel size of k × 1 × 1. Each convolution in the decomposed 3-D convolution block is followed by a batch normalization and a ReLU. Following the decomposed 3-D convolution is the temporal average-pooling operation. The temporal average-pooling operation can reduce the temporal dimension and fuse background information from multiple frames. Through the merge block, multiframe background information can be extracted and fused for background reconstruction.
3) Objective Loss Function: It is a straightforward way to utilize clean backgrounds as supervision to train the background reconstruction network. However, in practice, it is difficult to generate backgrounds as supervision from natural images. Therefore, in this article, we design a loss function to guide the network to reconstruct the background in an unsupervised manner.
Since the image can be intuitively separated into background and target regions, we can use different strategies to deal with these regions when computing loss. For the background region, it is better to make the reconstructed results approximate the original images. In contrast, for the target areas, it is better to make the reconstructed results approximate the adjacent background area instead of the original target pixels. Based on these motivations, we separate the reconstructed background into two disjoint subsets (i.e., target region and background region) and employ different supervisions to compute the loss of these two regions. For the background region, we use the original input image as supervision. For the target region, to alleviate the influence of target pixels, we utilize the temporal median filtered image as supervision since targets are moving, and temporal median filtering can filter out most target pixels to reduce their influence in the target region. Therefore, the loss objective consists of two parts, including background region-related loss L back and the target region-related loss L tar , which are defined as follows: and where ⊙ represents element multiplication and M t represents the generated binary mask of the target areas with 1 denoting the target region and 0 indicating the background region. I m represents the temporal median filtered image of the input video clip. To obtain the target region mask, we first feed the reconstructed background and the input images to the iterative optimization to generate detection results and then apply segmentation to get the target mask. The background region-related loss L back and the target region-related loss L tar work jointly to guide the network to reconstruct the background. The total loss objective is defined as Since the designed loss objective is label-independent, the proposed method can alleviate the dependence on large-scale labeled data.

V. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we conduct extensive experiments to evaluate the detection performance of the proposed framework on the dataset collected from Jilin-1 satellite [10].

A. Dataset Description and Experimental Details
The detection performance of the proposed method is evaluated on satellite videos from the Jilin-1 satellite. The GSD of the dataset is around 1 m, and the frame rate is 10 frames per second. The moving vehicles in the dataset are labeled by bounding boxes as the ground truth. The videos in the dataset contain complex and dynamic backgrounds, which are challenging for MOD.
For the background reconstruction network, we used seven consecutive frames with a frame interval of 3 as input to the network. The batch size was set to 10 with a random crop image patch size of 256 × 256. We trained our network using the Adam optimizer [50] for 100 epochs with a learning rate of 1 × 10 −4 . All the models were implemented with Pytorch on one Nvidia RTX 3090Ti GPU.

B. Evaluation Criteria
In order to make a fair comparison with other compared methods, we follow [7], [10], and [31] to use precision, recall, and F1 score as the evaluation metrics, which are defined as follows.
where TP, FN, and FP represent the number of true positives (correct detections), false negatives (missed targets), and false positives (false alarms), respectively. Specifically, the precision metric measures the fraction of the detections of TPs, and the recall metric indicates the fraction of positives that are correctly identified. The F1 measure is a combination of precision and recall, and is a more reliable and comprehensive evaluation metric. It is worth noting that, although IoU is widely used for the performance evaluation of generic object detection [24], [25], [26], [27], [28], [29], it is not suitable for the evaluation of extremely small objects in satellite videos. Due to the small size of moving targets in satellite videos, tiny shifts of the predicted bounding box will cause a large fluctuation in the IoU value. Therefore, in this article, we follow [31] to consider a predicted bounding box as a TP if the distance between the center of this bounding box and the ground-truth one is smaller than a predefined threshold. In this article, we set the distance threshold to 5 pixels, which represents around 5 m considering the GSD of the Jilin-1 satellite.
1) Quantitative Results: The quantitative results are shown in Table I. It can be observed that, compared with the model-based methods, our framework achieves higher average recall, precision, and F1 score, outperforming the second best model-based method MMB [10] by 1.4 in terms of F1 score. That is because our method introduces deep background prior into model-based iterative optimization, which can recover background more accurately, thus achieving superior performance. Compared with the deep learning-based method SAHI [29], our method achieves superior performance. That is because SAHI [29] is designed for generic small object detection in a single image and would suffer significant performance degradation when applied to an extremely small moving object in satellite videos. Note that our method achieves the best average precision, outperforming the second best method DSFNet [32] by 1.3 in terms of precision rate, which means that our method can improve the detection performance with reduced false alarms due to the accurately reconstructed background. Moreover, although our framework performs inferior to DSFNet [32] (80.9 versus 85.5 in terms of F1 score), our method can detect moving objects in an unsupervised way, which can relieve the dependence on the large-scale dataset with labor-intensive and time-consuming annotation process.
2) Time Efficiency Analyses: To compare the efficiency of different methods, we record the average time cost (s) of different methods on an input image with a size of 1024 × 1024. The results are listed in Table I. It can be observed that, compared with LRSD-based methods (i.e., GoDec [19], DECOLOR [5], E-LSD [6], and B-MCMD [7]), our method is faster and achieves higher F1 score. Compared with the LRSD-based method GoDec [19], our method can achieve nearly 10× acceleration. That is because our method substitutes the low-rank regularization term with deep background prior, which can reduce the computational burden of the low-rank regularization term. Moreover, due to the removal of the regularization nuclear term in the background, our method can exploit CUDA acceleration techniques to improve efficiency, which can further speed up the detection process. Moreover, compared with the fastest deep learning-based method SAHI [29] and the second fastest frame differencing-based method D&T [9], our method runs relatively slowly, while the detection performance of our method is superior, which demonstrates the effectiveness of our method.
3) Qualitative Results: Qualitative results of different methods are shown in Figs. 2 and 3. It can be observed that, compared to the complex backgrounds, moving vehicles occupy only a few pixels, and there are many distractors in the surroundings. Compared with the state-of-the-art model-based methods, our method can produce more reliable detection results with fewer false alarms (as can be seen from the numbers of the TP, FP, and FN), which demonstrates the superiority of our method in tackling challenging scenes. It can also be observed that the existing model-based methods exhibit many false alarms on stationary background objects (e.g., residential area of video 7 in Fig. 3), while our method produces fewer false alarms on these objects. We attribute this to the accurately reconstructed background produced by our deep background reconstruction network.

D. Ablation Study
In this section, we conduct different ablation studies to investigate the design of our proposed framework.

1) Effectiveness of Background Reconstruction Network:
To validate the effectiveness of our background reconstruction network, we replace the background reconstruction network with other reconstruction methods, including the spatial mean filter, the spatial median filter, the temporal mean filter, and the temporal median filter. The quantitative results are shown in Table II. It can be observed that our proposed method achieves the best F1 score and outperforms the second best background reconstruction method by 1.9 in terms of F1 score. The backgrounds reconstructed by different methods are shown in Fig. 4. It can be observed that our method can restore a more clean background, which can be used to obtain better detection results. In contrast, other background reconstruction methods have target residuals in the target region and, thus, have inferior detection performance.
To accurately evaluate the background reconstruction capability of different methods, we add synthetic moving targets on the clean backgrounds and then apply different methods for background reconstruction. The reconstructed background is compared with the ground-truth clean background. Following [51], we use PSNR calculated between the reconstructed background and the ground-truth one as quantitative metrics for reconstruction performance evaluation. We compare our proposed method with three RPCA-based methods, including  DECOLOR [5], E-LSD [6], and B-MCMD [7]. The quantitative results are shown in Table III. It can be observed that our method achieves the best PSNR and F1 score, which demonstrates the effectiveness of our deep background prior. The qualitative results are shown in Fig. 5. It can be observed that our method can reconstruct a more accurate   background (smaller errors between the generated background and the ground-truth one) and, thus, achieves better detection performance.
2) Effectiveness of Merge Block: As a component of our background reconstruction network, the merge block can integrate the spatial and temporal information, and propagate the fused spatiotemporal information from the encoder to the decoder. Here, we investigate the merge block by introducing two variants, i.e., Block2D and Block3D. Block2D merges spatial and temporal information by first concatenating multiframe features along channel dimension and then performing a 2-D convolution with a kernel size of 3 × 3 (with BN and ReLU layers). Block3D integrates the spatiotemporal information explicitly by a 3-D convolution with BN, ReLU, and a temporal average-pooling layer.
The detection performance of different variants is shown in Table IV. It can be observed that our method achieves the best F1 score and outperforms Block2D by 2.9 in terms of F1  score. That is because Block2D utilizes 2-D convolution and, thus, cannot fully extract and fuse the spatial and temporal information. Moreover, compared with Block3D, our method reduces the processing time of a single image by 33% (0.48 s versus 0.72 s) and achieves a better F1 score. That is because the decomposed 3-D convolution in the merge block can not only reduce the computational cost but also introduce extra nonlinear operations to enhance the modeling ability of the network. In conclusion, our designed merge block can achieve higher accuracy and efficiency.
3) Effectiveness of the Iterative Optimization: To validate the effectiveness of the iterative optimization, we directly use frame differencing operation between the input image and the reconstruction background and segment detection results from the residual images. The quantitative results are shown in Table V. It can be observed that the iterative optimization achieves the best average F1 score and outperforms the frame differencing method by 2.9 in terms of F1 score. That is Fig. 5. Experimental results on the synthetic sequence. The first row illustrates the background reconstruction results obtained by different methods, and the zoomed-in area is utilized for a better illustration of details. The second row draws the detection results, and the light green, yellow, and red rectangles indicate the ground truth, correct detections, and false alarms, respectively. The third row shows the differencing heatmaps between the generated background and the ground-truth one, and lower errors indicate better reconstructed background quality. because the iterative optimization process can optimize the detection results to achieve optimal performance. Moreover, we investigate the convergence of iterative optimization. Here, we study numerical convergence instead of analytical convergence since our method is a combination of deep learning and model-based approaches. Following [52], ) ≤ ζ as criterion to measure the convergence. Taking video 1 as an example, the convergence curve is shown in Fig. 6. It can be observed that the proposed method converges to an optimal objective value after about 40 iterations and maintains stable. 4) Impact of Network Depth: We investigate the impact of network depth on detection performance. We set the number of convolution blocks in the encoder and decoder to 3, 4, 5,  Table VI. It can be observed that, when the network depth increases from 3 to 5, the detection performance is improved with the increase in the network depth but at the cost of a higher computational burden with more processing time. When network depth increases from 5 to 6, the average F1 score slightly drops. That is because, when the depth goes deeper, it tends to overfit the limited training data and, thus, damages the performance. Therefore, we choose a five-layer U-shape network as our reconstruction network. 5) Impact of Frame Number: Our background reconstruction network reconstructs the background from n consecutive frames. We evaluate the background reconstruction network with different frame numbers, i.e., n = 3, 5, 7, 9. The results are shown in Table VII. It can be observed that, when n increases from 3 to 7, the detection performance is improved  as the frame number is increased. That is because additional frames can provide more information about the background, which is beneficial to background reconstruction. It is also notable that the detection performance tends to be saturated when the frame number is increased from 7 to 9 (the average F1 score remains unchanged). That is because the information provided by the seven frames is already sufficient for background reconstruction. Since the spatial and temporal information has been fully exploited for seven input frames, a further increase in frames cannot provide performance improvement but bring extra computational burdens. Therefore, we utilize seven frames as input to the proposed network.

E. Analyses of Loss Objective
To reconstruct the background in an unsupervised manner, we design a loss objective and adapt different strategies for different image regions. To verify the effectiveness of the proposed loss objective, we train our background reconstruction network under L back , L tar , and the combination of both losses, respectively. The quantitative results are shown in Table VIII. It can be observed that, with only L back , the trained model only suffers a minor performance degradation (80.1 versus 80.9 in terms of F1 score). That is because, due to the ignoring of target regions, the network cannot learn to reconstruct a finegrained background. It can also be observed that, with only L tar , the F1 score drops nearly half compared to our proposed method. That is because the limited background information is insufficient to reconstruct the background. Thanks to the discriminative treatment of target and background areas, our method can learn to reconstruct a fine-grained background and, thus, achieves higher performance.
To further investigate the effectiveness of our proposed loss objective, we visualize the reconstructed background and generated masks during training in Fig. 7. It can be observed that, with the increase in training epochs, the generated masks can cover more target regions, and the quality of the reconstructed background can be improved gradually. Since the quality of the reconstructed background is gradually improved, the detection performance increases with epochs and reaches saturation at around 100 epochs, as shown in Fig. 8.

F. Parameter Sensitivity Analyses
In this section, we conduct experiments to investigate the impact of two important parameters λ and β in the iterative optimization on the MOD performance.
1) Impact of λ: To make it concise, while keeping β fixed to 100, we use various values of λ 0 to control the values of λ (λ = λ 0 /(max(H, W ) × n L ) 1/2 ). The results are shown in   Fig. 9(a). It can be observed that, when λ 0 increases from 10 −6 to 10 2 , the detection performance remains unchanged. However, when λ 0 becomes too large, the sparsity of the target would be overemphasized, leading to overshrinkage of the target and a dramatic drop in detection performance. Theoretically, when λ 0 approximates 0, the sparsity term will be ignored, which will damage the detection performance. It can be observed that our proposed method can still achieve good performance when λ 0 approximates 0. We attribute this to the introduction of deep background prior, which would prevent the performance from dropping to 0 when λ 0 is too small.
2) Impact of β: While keeping λ 0 fixed to 1, we conduct experiments to verify the influence of β. The results are shown in Fig. 9(b). It can be observed that, when β exceeds 10, the detection performance tends to be fixed. That is because, when β is sufficiently large, the noise term N tends to zero, which will negligibly influence the detection performance. Theoretically, when β becomes too small, the noise term N will be less emphasized, leading to the increase in the residual in noise term N and significant performance degradation. However, in our method, when β turns very small, the detection performance remains at a certain level. We attribute this to the introduction of recovered background, which tends to prevent N from including too many residuals into the noise term.

VI. CONCLUSION
In this article, we have introduced deep background prior into the model-based method for MOD in satellite videos.
The deep background prior is obtained by a background reconstruction network, which is trained in an unsupervised manner with the help of a specifically designed loss. Combining the learned deep background prior with the model-based iterative optimization, the proposed framework benefits from both worlds. Extensive experiments have demonstrated the effectiveness and efficiency of the proposed framework.
It is worth noting that there remains room for further improvements. On the one hand, our deep background prior can be generated by any background reconstruction network, and the quality of the reconstructed background has a great impact on the detection performance. One possible direction would be how to design a more powerful background reconstruction network for effective background reconstruction. On the other hand, the background reconstruction and the iterative optimization are divided into two separate steps, and the parameters of iterative optimization need to be tuned by manual efforts. One can explore how to make the parameters in iterative optimization learnable and how to combine the deep background prior and iterative optimization into an end-to-end network.
Ting Liu received the B.E. degree in electrical engineering and automation from the Hunan Institute of Engineering, Xiangtan, China, in 2017, and the M.E. degree in control engineering from Xiangtan University (XTU), Xiangtan, in 2020. She is currently pursuing the Ph.D. degree with the College of Electronic Science, National University of Defense Technology (NUDT), Changsha, China.
Her research interests focus on signal processing, target detection, and image processing. She was a Senior Visiting Scholar with the University of Southampton, Southampton, U.K., in 2016. She is currently a Professor with the College of Electronic Science and Technology, NUDT. She has authored or coauthored over 100 journal and conference publications. Her research interests include signal processing and image processing.
Zhijie Chen received the M.S. and Ph.D. degrees from the Nanjing University of Science and Technology, Nanjing, China, in 1989 and 2006, respectively.
He is currently a Professor with the National Airspace Technology Key Laboratory. His research interest is air traffic control.
Dr. Chen is a member of the Chinese Academy of Engineering.