A Lightweight CNN Model for Refining Moving Vehicle Detection From Satellite Videos

With the recent development of earth observation technology, the satellites have obtained the ability to capture city-scale videos, which enable potential applications in intelligent traffic management. Because of the broad field-of-view, the moving vehicles in satellite videos are usually composed of only tens of pixels, making it difficult to differentiate true objects from noise and other distractors. In addition, the edges of tall building tops are often mistakenly detected as moving vehicles because of the effects of motion parallax. This article proposed a terse framework that can effectively suppress false targets, achieving high precision and recall. The study involves three parts: 1) An adaptive filtering method is proposed to reduce noise, thus making the detection algorithm more reliable; 2) Several background subtraction models are tested, and the best one is chosen to produce the preliminary detection results at high recall but low accuracy; 3) A lightweight convolutional neural network (LCNN) is designed and trained on a small collection of samples, and then used to eliminate false targets. The experiments and evaluations demonstrate that our method can largely improve the precision at the expense of a slight reduction of recall.


I. INTRODUCTION
In recent years, commercial satellite technology has achieved significant development in capturing high-resolution videos [1]. The SkyBox Imaging (USA) launched SkySat-1 (in 2013) and SkySat-2 (in 2014), which were the first two moonlets that could file still images with sub-meter resolution and high definition (HD) videos. In 2015, Chang Guang Satellite Technology Co. Ltd (China), also known as CGSTL, launched two video satellites Jilin1-1 and Jilin1-2, which could acquire color videos. As of 2019, CGSTL has launched eight video satellites, which are now parts of the Jilin 1 satellite constellation. Besides the above mentioned commercial satellites, several other video satellites have been also launched into orbits, including Lapan-TubSat series satellites (developed by the National Institute of Aeronautics and Space, Indonesia and Technical University of Berlin) The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. and TianTuo-2 (designed by National University of Defense Technology, China).
Unlike conventional satellites, video satellites can capture a specific area for a short period, enabling moving object monitoring and tracking. For example, the SkySat satellites can provide both mapping products and full HD video sequences of up to 90 seconds in length. Therefore, video satellites are playing an increasingly important role in modern city management. Compared to the ground-based surveillance systems, satellite videos can capture objects at large-scale scenes, which enables a lot of potential applications, such as city-scale traffic surveillance, vehicle tracking, and traffic density monitoring [2]- [4].
Our study is strongly related to video surveillance research, in which both motion detection and tracking have attracted research attention. With the development in the hardware and software, the detection and tracking algorithms are achieving increasing performance. However, these approaches still face some challenges: to provide accurate and reliable results in some complicated scenarios [5]. Moving object detection is VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ one of the most active research topics in computer vision, but most of the research focuses on ground-based camera videos. Even if some algorithms perform well in surveillance systems, they probably fail on satellite videos. The reason is attributed to the following unique characteristics of satellite videos. 1) Small target size: Because of the broad field-of-view of video satellites, moving objects such as vehicles in cities, often consist of several or tens of pixels, which are hard to identify by conventional methods that perform well in ground surveillance systems. As shown in Figure 1 (b), most of the moving vehicles appear as spots without any textures. Even bigger vehicles, such as buses, do not exhibit more features. As a result, some commonly used features, such as shape and textural information, are unavailable in satellite videos, making object recognition more difficult. 2) Low contrast to the background: Some moving vehicles, as shown in the yellow circle O1 in Figure 1 (b), exhibit similar brightness as the background. They are difficult to detect by existing algorithms that rely on the brightness difference between the targets and the environment. 3) Cluttered background: The satellites move along their orbits when imaging, resulting in slowly moving backgrounds, which introduce many false targets. Significantly, some point-shaped distractors on tall building tops are often mistakenly identified as targets. One fundamental difficulty is to differentiate true moving vehicles from false targets that exhibit same sizes and similar behaviors as true vehicles. Because of these reasons, the existing methods that perform well in surveillance systems fail to detect such small objects from satellite videos. Some works have been done to improve the existing methods to detect small objects. Because the video satellites came into the Earth Observing System no more than ten years ago, only a few relevant publications are available. Teutsch and Krüger [6] developed a track-before-detect (TBD) approach combined with an AdaBoost classifier to detect vehicles from UAV videos. Kopsiaftis [2] proposed an adaptive background subtraction method to detect moving vehicles and estimate traffic density from Skybox satellite videos. Yang et al. [7] presented a saliency-based background model, combined with a dynamic scene motion heat map, to intensify the moving objects. Xu [1] proposed a global scene motion compensation and local dynamic updating method to solve the problem of high errors caused by global scene motion and local pseudo-motion. Zhang et al. [8] utilized the known satellite attitude motion information as well as the unknown object motion information to extract space objects from Tiantuo-2 satellite videos. Ao et al. [3] proposed a local exponential probability distribution noise model to differentiate potential targets from noise patterns. To track moving vehicles from satellite videos, Du et al. [9] proposed a multi-frame optical flow tracker.
Among the existing applications in satellite videos, some focus on improving background models [2], [7], some utilize motion information to improve detection results [1], [8], [9], some adopt classifiers to identify false objects [6], and others establish new noise models to suppress noise [3]. From the literature mentioned above, we can conclude two main shortages of the current works: 1) Although background subtraction models have been discussed in ground surveillance systems, they exhibit different performances when applied to satellite videos. Some conclusions and evaluations drawn from traditional surveillance systems are no longer suitable for satellite videos. Assessment of these models has not been reported for detecting small moving objects from satellite videos. Therefore, these methods should be reevaluated for this particular task. 2) Deep learning (DL), as a prevailing method, has been widely used in object detection from natural pictures and high-resolution remotely sensed images. But it has not been discussed in detecting tiny and dim objects from satellite videos. Network design, training, and performance often vary with the data characteristics. The corresponding investigation is lacking. Our motivations stem from the gaps between the existing research and practical applications. Detecting moving objects involves two main problems: object localization and false target removal. The challenge is to eliminate false targets while keeping true ones. To this end, we proposed a detection framework based on background subtraction modeling and convolutional neural network (CNN). Our approach is based on two considerations as follows.
1) Although the satellites keep moving along their orbits, the cameras aim to a fixed region in the course of capturing. Unlike UAV videos with fast-moving backgrounds, satellite videos have almost static backgrounds except for displacements of tall building tops. Thus, we can take advantage of background subtraction models to obtain candidate object locations since they have succeeded in ground surveillance systems. 2) Deep learning has achieved great success in many applications, such as object detection and recognition. Therefore, we can try to design a CNN model to identify false targets from the preliminary results to achieve high precision. Unlike the latest deep learning networks, such as Faster-RCNN [10] and YOLO [11], our proposed network is not an end-to-end approach for vehicle detection, but a combination of the traditional background models and deep learning techniques. The above-mentioned end-to-end networks embed object localization in one framework by calculating region proposals, but the detection locations are not accurate. It does not matter when detecting large objects from high-resolution images because the displacement won't affect the accuracy too much. However, for tiny moving vehicles, a slight displacement of several pixels may cause a complete detection failure. Besides, calculating region proposals is time-consuming. The most serious difficulty is that the targets are so small that these networks often pay no attention to them. Therefore, we decide to extract accurate object locations using background subtraction models and then identify false targets using a lightweight CNN model. Correspondingly, this article has three contributions as follows.
1) A local adaptive image filtering approach is proposed to suppress noise, making the preliminary detection results more reliable. 2) This study conducts a comprehensive quantitative evaluation on several prevailing background subtraction models so as to provide some guidance for selecting the best background subtraction model for detecting small moving objects from satellite videos. 3) We design a lightweight CNN model that performs well for improving precision without apparent reduction of recall and provide details about the data preparation and training.
The proposed approach has several advantages: 1) Simplicity and ease of use: our framework is mainly composed of a cascade of two well-developed techniques. The background subtraction model (BGM) has been well-established, and our proposed CNN network can be trained on a small set of samples in minutes without GPU. 2) Both high precision and recall: the BGM ensures a preliminary detection with high detection ratio, and the CNN network can suppress false objects, improving the accuracy of the final results. The rest of this article is organized as follows. Section II covers the existing works related to our study. Section III first gives an overview of the framework and then details the background subtraction model and the proposed CNN network. Section IV presents the experimental results and evaluations, and Section V summarizes the conclusions of this research.

II. RELATED WORKS
This study relates mainly to two categories of existing works: motion detection (or moving object detection) and deep learning (DL).

A. MOTION DETECTION
Motion detection aims to detect moving objects from video sequences. Currently, motion detection contains mainly three categories of methods [5], [12]:

1) FRAME DIFFERENCE
The frame difference method, as the most straightforward approach for motion detection, identifies moving objects by calculating the difference between two consecutive frames. This method works well for real-time monitoring for its low computational cost, but it often fails to obtain complete outlines of moving objects. It is also sensitive to noise and illumination changes, resulting in a lot of false alarms. For its shortage, this method is often combined with other methods.

2) OPTICAL FLOW
Optical flow [13]- [15] is defined as the apparent motion of individual pixels on the image plane. The optical flow method can obtain complete moment information without any prior knowledge about the scenery, and it is applicable for both static and moving background conditions. Optical flow includes two forms: (1) dense optical flow that can obtain motion information for all pixels in an image but requires high computation; and (2) sparse optical flow that only gives velocity motion of some feature points (corners of an object) and runs much faster. Similar to the frame difference method, the computation of optical flow is susceptible to noise and illumination changes.

3) BACKGROUND SUBTRACTION MODEL
The background subtraction model (BGM) is a widely used approach for detecting moving objects from videos. VOLUME 8, 2020 This approach aims to detect moving objects from the difference between the current frame and a reference frame, often called background model or background image. To achieve the goal, the background model should be a representation of the scene without any moving objects, and it should also be updated regularly to adapt to varying luminance conditions. Nowadays, several background subtraction models have been proposed and widely used in ground surveillance systems. Friedman and Russell [16] proposed a well-known background model, named Gaussian Mixture Model (GMM), which modeled each background pixel by mixing K (from 3 to 5) Gaussian distributions. In the following years, some works have been done to improve and extend GMM model [17]- [21]. The GMM-based model has been widely used for foreground detection and recognized as one of the best models for video surveillance systems [22]. However, GMM is sensitive to illumination changes and vibration of cameras. A recently proposed background model, called ViBe (Visual Background Extractor) [23], [24], is considered a novel method better than GMM. Compared to GMM, ViBe can shorten modeling time and performs well under changing background conditions.
The three categories of methods are commonly used for moving object detection and counting in Intelligent Transport Systems (ITSs). However, these methods also suffer from some inevitable problems when applied to particular scenarios such as satellite videos. In general, we need to select the most suitable method according to specific tasks or combine different ways to obtain better performance. One of our goals is to provide some findings by conducting a comprehensive evaluation on these commonly used models.

B. DEEP LEARNING
Deep Learning (DL) has exhibited a powerful ability in many research areas [25]- [27]. Some DL-based algorithms have also been proposed to deal with moving object detection and tracking [28]. Since our study focuses on vehicle detection, we briefly review deep learning in vehicle detection. The existing studies can be grouped into several categories, according to how the deep learning technique is applied.
The first category uses existing CNN networks to detect vehicles from videos. For example, Ouyang and Wang [29] applied the YOLOv3 algorithm framework to achieve vehicle targets, obtaining higher accuracy than traditional vehicle detection methods. To obtain the traffic flow information, Dai et al. [30] proposed a video-based vehicle counting framework, which applied YOLO and RCNN to detect moving vehicles. These approaches can take advantage of the existing networks that have proven useful and powerful.
The second category needs to redesign new networks to adapt to the data and specific scenarios. Some works have been done to deal with scale problems in vehicle detection. For example, a unified deep neural network [31], named the multi-scale CNN, was proposed to detect objects of different scales; Hu et al. [32] proposed a scale-insensitive convolutional neural network to overcome the scale-sensitive problem of CNN models in vehicle detection; Wei et al. [33] offered three enhancements on a multiple scale CNN network for visual detection in advanced driving assistance systems. For some particular tasks or scenarios, designing completely new networks are also required. For example, Gao et al. [34] proposed a network called EOVNet (EO image-based vehicle detection network) to bridge the gap between the deep learning research of object detection and the vehicle detection in EO images. Liu et al. [35] proposed a backward feature enhancement network to extract vehicles from complex traffic scenes.
The third category often applies CNN networks along with other methods or with auxiliary data. For example, He and Li [36] proposed a CNN-based algorithm combined with radar. To utilize orientation information in vehicle detection, Li et al. [37] proposed a rotatable region-based residual network (R3-Net), which demonstrated high precision and robustness. A convolutional neural network [38] was proposed to screen vehicles from candidates. Chen et al. [39] proposed a memory enhanced global-local aggregation network, combined with ResNet and Faster-RCNN, to detect objects from videos.
It is noted that the above-mentioned studies focus on ground-based surveillance systems and unmanned aerial vehicle videos. DL has exhibited its power in these applications, but its effectiveness in satellite videos has not been investigated. This study, therefore, tries to design a lightweight CNN network for detecting small vehicles from satellite videos and evaluates its performance.

A. FRAMEWORK OVERVIEW
This article proposes a terse framework, which consists of three stages, as shown in Figure 2. The first stage is to pre-process the input video using a proposed adaptive filter, making the target detection more reliable. In the second stage, several background subtraction models (BGMs) are tested on the same video, and the best one is chosen to produce the preliminary detection results with high recall but low precision. In the last stage, a lightweight convolutional neural network (CNN) is trained on a small set of annotated samples and then applied to the preliminary detection to eliminate false targets, outputting the final results with high precision.

B. ADAPTIVE SMOOTHING FILTERING
The moving vehicles on satellite videos are so tiny that the algorithms cannot obtain reliable detection because of many distractors that resemble the targets. Image smoothing is an essential low-level computer vision technique that can reduce noise. Because of their tiny sizes, the targets are apt to deteriorate after smoothing. Among the commonly used smoothing filters, the mean filter and median filter have only one parameter (window size) to control smoothing strength. Even a filter with the smallest window size three will over-blur the targets, reducing the detection ratio. The Gaussian filter, besides the  window size, has another parameter (sigma) to control the strength. Therefore, we choose the Gaussian filter as the basic filter because of its higher controllability.
Despite the Gaussian filter's controllability, we observed that the global smoothing approach did not work well for our task: the targets still got over-blurred. We also noticed that noise often appeared at the edges of tall buildings. To this end, we proposed an adaptive local smoothing method to lessen target blur and suppress noise at strong edges.
Let f (i, j) represent the original value at pixel (i, j), m(i, j) the Gaussian smoothed value, and g(i, j) the absolute gradient value of the gradient map G, which can be calculated using Robert or Sobel operation. Then, the normalized gradient value ng(i, j) is calculated by Equation 1.
where max(G) represents the global maximum value of the gradient map G. It is noted that ng(i, j) varies between (0, 1.0). After this step, a mean filtering is applied to the normalized gradient map, producing mean filtered value at each position (i, j), denoted as mng(i, j). Finally, the adaptively filtered output image is described by Equation 2 and 3: Figure 3 demonstrates an enlarged part of a video (a), its gradient image (b), and the mean filtered gradient map (c). We can observe that the vehicles in (a) and the object edges are highlighted with bright color, as shown in (c). Equation 3 indicates that each output pixel sf (i, j) is a weighted combination of the original value f (i, j) and its Gaussian filtered value mng(i, j). The Gaussian blurred pixels have larger weights at edges and smaller weights in flat areas. That is to say, the pixels in the bright areas in Figure 3 (c) will be more smoothed while the dark areas will nearly retain the original values, and thus the image is adaptively filtered.
Through this approach, we can suppress noise and retain as many original pixels as possible.  notice the difference by naked eyes, but the difference can be easily observed when subtracted, as shown in (d), in which the bright color indicates larger differences and dark smaller.
The two examples indicate that the moving vehicles and the building edges get smoothed with large strength while the rest retains the original values. The kernel size (ks) of the Gaussian filter often influences the output results: larger ks decreases the sensitivity, causing more missed true targets. According to our comprehensive tests, ks = 5 is the best option.

C. KNN BACKGROUND SUBTRACTION MODEL
The GMM and Vibe models are widely used in surveillance systems, but they do not perform well for detecting small moving objects. Instead of the GMM and ViBe models, we applied the K nearest neighbor (KNN) modeling method [40], [41] to build the background model, which can achieve sensitive detection with relatively low false detection rates.

1) BASIC BACKGROUND MODEL
Unlike the GMM method, the KNN method adopts a non-parametric approach to build the background, which can adapt to scene changes and obtain sensitive detection. Let x 1 , x 2 , . . . , x N be consecutive samples of a pixel value, the probability density function that this pixel will have value x t at time t is estimated by the following kernel estimator K : K is usually defined as a Normal function N (0, δ), where δ is the kernel function bandwidth. Based on this probability estimate, the pixel x is classified as foreground if Pr(x) < TH p , where TH p is a predefined threshold, which can control the percentage of false positives. As described in [40], this model can be considered a generalization of the Gaussian mixture model, in which each pixel is represented by a Gaussian distribution by itself. This model can concentrate more on recent observation, obtaining the capability of sensitive detection. In addition, this model avoids parameter estimation, making it easier to use (refer to [40] for more information).
The KNN model is implemented in two steps: 1) decision of background, in which every pixel is classified as background or foreground by comparing it against all of its historical pixels; and 2) history update, in which all the historical pixels are updated at different rates.

2) DECISION OF BACKGROUND
For each pixel, its historical values from previous frames are stored in three history lists: short, medium, and long history, as shown in Figure 5.
Each history contains N samples (N is predefined by the user, for example, N = 7), and each sample is stored in a structure, as shown in Figure 5. R i , G i , and B i are the values of red, green, and blue channels for the i-th sample pixel x i , respectively. Flag i is an indicator whose value is 1 when x i is a potential background pixel, and 0 when foreground pixel. The potential background means that a pixel has been classified as background, or it will be classified as background with high probability. When a new pixel x t comes from the next frame, the algorithm calculates the Euclidean distance Dis(t, i) (in RGB color space) between x t and each sample x i (i = 1, . . . , 3N ) in the three histories.
When the calculating iterates from the 1st to the 3N-th pixel, a counter Count D is used to record the number of the distances that are less than a threshold Th D . At the same time, another counter Count F is used to record the number of samples whose Flag value equals 1, only when Dis(t, i) < Th D . Then, the Count D and Count F are compared to a threshold TC K to classify the pixel x t as background or foreground. The process can be described as the pseudo-code in Figure 6. From the code, one can see that a new pixel x t can be immediately classified as background (value = 1) when there are more than TC K potential background pixels that are close enough (the distance is less than Th D ) to x t , or it will be classified as foreground (value = 0). But when Count D is larger than TC K , x t can be reclassified as potential background, and its Flag value is set to 1. For each frame, all the pixels are processed according to the above procedure, thus producing a binary mask image, in which 1 presents the background and 0 the moving targets. Figure 5 (a) shows that the three histories contain the same number of samples, but they cover different time ranges. For example, the short history covers frames from 1103 to 1112, the medium history from 1073 to 1112, and the long history from 1043 to 1112, as shown in Figure 5 (b). This means that the short history contains only the most recent samples, while the medium and long histories contain samples from earlier frames. In other words, the intervals between samples in the long history are larger, and the intervals in the medium and short histories are smaller, as shown in Figure 5 (b). For this reason, the three histories should be updated at different rates. The three time ranges can be respectively computed by Equation 6 from OpenCV. (6) where Round(.) represents round function; and α (0<α<1.0) is the learning rate and it is set to 0.02 in our study. Once the time range is determined, the update ratio R j is calculated by R j = Round(T j /N ) + 1. Where T j represents T short , T mid , and T long ; and N is the number of samples in each history (e.g. N = 7 in Figure 5). This means that the short history will be updated the most frequently, and the long history the least. Figure 7 illustrates the details of the updating for the three histories. The samples in each history are maintained by a queue, in which the left is the head, and the right is the rear. When a new pixel comes, it is first classified as potential foreground (Flag = 1) or background (Flag = 0), and then it is appended to the rear of the short history. Because the short history keeps only N historical samples, the left-most sample should be removed from the queue. From the process, one can see that the samples are chronologically sorted, and the earliest one will be discarded after updating the short history.  When updating the medium history, the algorithm randomly selects one sample from the short history and then append it to the rear of the medium history. At the same time, the head sample is also removed from the medium history. Similarly, the long history is updated using a randomly chosen pixel from the medium history. When the algorithm iterates from the 1st frame to the last, the three histories get updated at different rates.

D. PROPOSED LCNN MODEL
Recently, Convolution Neural Networks (CNN) have been largely employed in object detection and classification. Up to now, researchers have proposed and trained some promising CNN networks, such as AlexNet [42], VGGNet [43], GoogLeNet [44], ResNets [45], and DenseNet [46], [47]. These typical networks have been widely used in character recognition, picture classification, and object detection, but none of them are suitable for detecting tiny moving vehicles from satellite videos for the following two reasons.
1) The above-mentioned CNN models are designed for natural pictures or high-resolution images. In these images, the targets are relatively large and have distinguishable features. Therefore, very deep networks can help extract features with more semantic information. However, the targets in satellite videos are so tiny that multiple convolutional layers are not necessary. 2) Very deep CNN networks require a lot of computational costs, resulting in low efficiency. Video processing requires fast computation, and hence lightweight CNN networks are desired. For these reasons, instead of adopting the existing networks, we designed a lightweight CNN (LCNN) network for detecting tiny vehicles from the satellite videos. Our network is inspired by the following LeNet5 network.

1) LeNet5 MODEL
The LeNet5 network, proposed by Yann LeCun et al. in the 1990s, is a neural network architecture for handwritten and machine-printed character recognition [48]. The LeNet5 architecture, as shown in Figure 8, consists of two sets of convolutional and average pooling layers, followed by a flattening convolutional layer, then two fully-connected layers, and finally a softmax classifier.

2) LCNN ARCHITECTURE
There are no general rules to follow when designing a CNN. It depends on the user's experience and the intrinsic complexity of the dataset. The best way to find a suitable network is literally to perform trial and test. After extensive tests, we got a more lightweight CNN network, which consists of one convolutional layer followed by a max-pooling layer and then a fully connected layer, as shown in Figure 9.

a: CONVOLUTIONAL LAYER
The convolutional layer is the core building block of a CNN network, and it is used to extract features from input images. Two parameters, the number of filter channels and the filter size, need to be determined. Too few filter channels cannot capture sufficient features, while too many channels will increase the computational complexity. The proposed CNN network adopts eight filter channels, which perform best for our dataset. To choose the best size of the filter, we tested 3, 5, and 7. The test indicated that 5 is the best size. After the convolutional layer, the Rectified Linear Units (ReLU) function is adopted, instead of the Sigmoid activation function used in LeNet5. Another difference from LeNet5 is that padding is used in our convolutional layers so that the feature maps keep the same size (32 × 32) as the input.

b: MAX POOLING
Instead of the average pooling used in LeNet5, a max-pooling is applied after the convolutional layer. With the pool size 2 × 2 and the stride of 2, the eight feature maps are resized to 8@16 × 16 after pooling.

c: FULLY CONNECTED LAYER
After the max-pooling, the feature maps are flattened to a vector of size 2048, and then followed by a fully-connected layer with 100 neurons.

d: OUTPUT LAYER
The output layer has two neurons that correspond to two classes in our task: moving vehicle and others. Table 1 displays all the layers adopted in the proposed CNN and their trainable parameters. Compared to LeNet5, our LCNN is shallower and wider.

3) LOSS FUNCTION AND OPTIMIZATION
Except for the structure itself, a CNN model also relates to several other concerns, including the definition of loss functions and optimizer selection.

a: LOSS FUNCTION
In deep learning, loss functions are used to measure how good a neural network model performs a certain task. Concretely, the moving vehicle detection can be regarded as a classification problem. In order to soften the output, the softmax activation is adapted to normalize the output probability. That is to say, the network will predict each input as a moving vehicle with a probability score of p i or as another object with a probability score (1 − p i ). In this research, the total loss function is defined as the binary cross-entropy: where n is the total number of samples or observations, i indexes samples or observations and j indexes classes, and y i (its value is 1 or 0) is the true label for the i-th sample.

b: OPTIMIZER
The goal of training a model is to minimize the loss, and thus the model can work in its best situation. The choice of optimization algorithm for a deep learning model can mean the difference between good results and bad results. The optimizer for our model is the Adaptive Moment Estimation (Adam) [49], which is one of the optimizers that performs the best on average. Adam is an extension to stochastic gradient descent (SGD) that has been widely adopted for deep learning. Adam uses momentum and adaptive learning rates to converge fast. For simplicity, we used the Keras package's default parameter values, which can obtain satisfactory results for most cases. These parameters are: learning rate α = 0.001, exponential decay rate for the 1st and the 2nd moment estimates β1 = 0.9, β2 = 0.999, and = 1e − 7, which is a small constant for numerical stability.

IV. EXPERIMENTS AND EVALUATIONS A. DATA PREPARATION
The experimental video sequence was acquired by the SkySat-1 Imaging satellite over Las Vegas, USA, on March 25, 2014, and it is publicly available. The video sequence, as shown in Figure 10, contains 1800 frames, with the frame size 1280 x 720. The resolution is about 1.5m, and the frames per second (FPS) is 30. There are hundreds of moving vehicles in each frame, and each vehicle consists of 6 to 20 pixels.

1) DATA ANNOTATION
Training a CNN network requires a large number of positive and negative samples. The positive samples can be manually labeled in some chosen frames from the video. As mentioned before, the moving vehicles are very tiny, and some are too dim to notice. Labeling work is arduous and boring, but it VOLUME 8, 2020 should be done with great care to guarantee the assessment accuracy. To make the labeling work easier, we made a little program to continually loop seven consecutive frames centered at the chosen time, thus generating a very short video clip. In this way, all moving objects became more noticeable and easier to identify. Although this makes the work easier, labeling the whole frame is still inapplicable and challenging because we have to label more samples as ground-truth for evaluation except for the training samples. Unlike the training samples that can be labeled using bounding boxes (Figure 11 (a) and (b)), the ground-truth samples should be labeled in pixel-wise form as shown in Figure 12. This job is extremely time-consuming because these targets are so tiny that several pixels of displacement may cause inaccurate evaluation. To ease the labeling work, we only selected samples from two sub-regions, as shown in Figure 10. Figure 11 shows some labeled samples on two frames from the sub-region Data1, in which the red rectangles in the right images indicate the corresponding moving vehicles in the original frames (the left).

2) TRAINING SAMPLES GENERATION
Unlike the positive samples, negative samples cannot be manually labeled. Instead, they can be automatically clipped from the image. We can generate these samples using two sampling strategies: fixed size or variable size. Experiments showed that the variable size strategy is better than the fixed size. The variable size strategy adopts a random sampling method, as shown in Figure 13. According to the manually labeled rectangles, the minimum and maximum sizes If the sliding window covers over 70% of any red rectangle, it is also saved as a positive sample. For example, the green rectangle in Figure 13 is negative, and the yellow one is positive. In this way, a large number of negative samples and some additional positive samples can be automatically generated for training. Through this method, we can obtain positive and negative training sample images, as shown in Figure 11 (c). For our task, we need only small amounts of samples. In our experiment, about 300 positive and 2000 negative samples were generated from two labeled images.

3) NETWORK TRAINING
To train an excellent network, we also need to deal with two problems: data augmentation and parameter selection.   work in our experiments, training samples are limited. Data augmentation can increase the size of the training set as well as regularize the network, improving the performance of the model [50]. In our experiment, several image transformations, including rotating, zooming, shearing, horizontal flipping, and vertical flipping, were employed to expand samples.

b: TRAINING PARAMETERS
The training is an iterative process and repeats until fulfilling certain criteria or reaching a specified number of iterations (also called epoch in deep learning). In each iteration, the samples are divided into batches, rather than as a whole, to update the model parameters. The number of samples in VOLUME 8, 2020 one batch is called the batch size. In our experiment, the batch size was set to 128, and the epoch was set to 25.
The training was conducted using the deep learning framework Keras [51] and Python. With the samples and above-mentioned parameters, the training was accomplished within about 150 seconds on a laptop with the following configuration: Intel Core CPU, i5-4210M, 2.6GHz; 8GB RAM; and Windows 10. Compared to some heavy CNN models, our model is so lightweight that the training can be finished almost instantly without GPUs.

B. EVALUATION METRICS
The validation of object detection is often conducted by comparing the detection results against the ground-truth images. Our evaluation was assessed in an object-based manner, which is the most common measurement [52]. In this manner, the detection result image (binary mask image) is overlaid on the labeled ground-truth image, and the intersection over union (IoU) of the reference and the detected vehicle is calculated. A reference vehicle is considered as True Positive (TP) if the IoU is equal to or greater than 50%, which is the recommended threshold for most cases, and False Negative (FN) otherwise. A detection vehicle is regarded as False Positive (FP) when it does not intersect with any reference vehicle. In particular, TP represents moving vehicles that have been identified correctly, and FP represents objects that are detected but not true moving vehicles (false alarms). The FN indicate objects that are not successfully identified.
Once the TP, FP, FN numbers are calculated out, the precision, recall, and F1 score can be computed as follows: Precision is the ratio of correctly predicted positive observations to the total predicted positive observation, and recall (also called sensitivity) is the ratio of correctly predicted observations to all true positive observations. The higher the precision and recall, the better the detection results.
In practice, lower precision comes with higher recall, and vice versa; it is not easy to obtain both high precision and recall. F1 score, as the weighted average of precision and recall, is often used as an overall evaluation metric.

C. RESULTS AND COMPARISONS
The experiments are divided into three parts. The first part focuses on comparing several commonly used background models, guiding the selection of the best background subtraction model for our task. The second part presents the results of the LCNN model and its validation, thus verifying the effectiveness of the proposed LCNN model. The third part compares the proposed adaptive filtering against other methods and discusses its effect on the detection results.

1) COMPARISON OF BACKGROUND SUBTRACTION MODELS
Apart from the KNN background subtraction model described in the previous section, there are some other background models for motion detection. These background models have been mainly applied and discussed in surveillance monitoring [53]- [55], but not investigated in detecting small moving objects from satellite videos. To compare the performance of different background models, we selected several other typical background subtraction models, including adaptive selective background learning (ASBL) [56], improved Gaussian mixture model (GMM2) [57], visual background extraction (ViBE) [23], and two points model (TPs). The frame difference method (FD), as a fast and typical method for motion detection, was also involved in the experiment. The implementations of these background subtraction models are publicly available from OpenCV and the bgslibrary [58].
The comparison was tested in a cropped region from the SkySat-1 video sequence. The cropped video, named as Data1, has 1800 frames and contains about 90 tiny moving vehicles in each frame. We selected ten frames (frames 61, 81, 141, 161, 200, 220, 260, 280, 350, and 400) from the video and manually labeled the actual moving vehicles with great care. These labeled images were regarded as ground-truth images. The binary mask images generated by the chosen background models were compared against the ground-truth images, and the evaluation was performed according to the metrics described in Section IV-B. All the detection results contained a lot of isolated noises, which were useless for object detection. Before evaluating, we removed these isolated objects smaller than 3 pixels. For a fair comparison, we selected the optimized parameters for each background subtraction model, as shown in Table 2. Figure 14 shows the detection results on the 280th frame of Data1, and Table 3 shows the quantitative evaluation results.
In Figure 14, the true positives (correct detection), false positives (incorrect detection), and false negatives (moving vehicles that are not identified) are highlighted in yellow, red, and cyan, respectively. It can be observed that the ASBL (a), ViBE (e), and TPs (f) contains false objects with large areas on the building tops, whereas FD (b), GMM2 (c), and   In Table 3, The bold numbers indicate the highest score in each column, and the underlined number the lowest. From the scores, one can see that GMM2 was sensitive to noise and thus caused a lot of small false alarms. ViBE had the weakest ability to detect small and dim moving objects, leading to low recall scores. KNN outperformed other methods with respect VOLUME 8, 2020 to precision, recall, and F1 score. The comparison indicated that KNN was the best choice for our task, and hence the later experiments were only conducted using the KNN background model.

2) EVALUATION OF THE LCNN MODEL
The previous experiments suggest that KNN is the best background subtraction model for our task. Although KNN outperforms others, its precision (60.3%) is still unacceptable for practical applications, but it has a high recall up to 96.4%. This means that if we could remove some false objects, both precision and recall can be guaranteed. The detection objects can be fed to our trained LCNN model to judge whether they are true moving vehicles. For each object, the LCNN outputs a probability number P. If P >= 0.5, it is a true target, or it is false. For simplicity, only the KNN background subtraction model was applied to generate the preliminary detection since it had proven to be the best option for our task. As a comparison, the original LeNet5 model was also tested. The LeNet5 was trained with the same datasets and the same parameters. Figure 15 shows the 61st full frame of the detection results on the video. By visual inspection, we can see that KNN generated a lot of false alarms on building tops and at building edges, as shown in the green rectangles in Figure 15 (a). Our method KNN+LCNN eliminated the majority of false objects while keeping almost all correct targets. In contrast, the LeNet5 network removed almost all false alarms, but it also removed many true targets, resulting in a low detection ratio.
The further quantitative evaluation was conducted on two sets of validation data, which were cropped from the rectangles shown in Figure 10.
The first dataset Data1 is from the same region in the preceding experiment. The trained LCNN and LeNet5 models were applied to the preliminary results, removing false objects and producing the final results, denoted as ''KNN+LCNN'' and ''KNN+LeNet5'', respectively. Table 4 displays the precision, recall, and F1 score for each selected frame. It can be observed that KNN achieved the precision between 50.3% and 69.1%, that KNN+LCNN increased the precision to at least 90.2%, and even high up to 98.6%, and that KNN+LeNet5 achieved the precision between 93.1% and 100%. Increasing precision usually results in a reduction of recall. As expected, KNN obtained recall scores between 93.1% and 100%, whereas KNN+LCNN gained slightly reduced recall scores ranging from 89.2% to 95.1%. Among the three, KNN+LeNet5 gained the poorest recall between 30.7% and 46.1%. The average recall of KNN+LCNN is 92.6% (in the last row of Table 4), which is only about 3.8% lower than that of KNN, but the average precision of KNN+LCNN increases from 60.3% to 93.3%, by 33%. It is worth gaining a great improvement on precision at the expense of a slight reduction of recall. The overall scores also indicate that KNN+LeNet5 can achieve very high precision 96.9% but low recall 39.4%.
From the F1 score columns in Table 4, one can see that KNN achieved F1 scores ranging from 65.8% to 80.4%, with an average of 74.2%, whereas KNN+LCNN achieved F1 scores between 90.8% and 96.5%, with an average of 93.0%, which was increased by about 18.8%. KNN+LeNet5 generated the lowest F1 scores between 46.2% and 63.1%, with an average of 56.1%.
To compare the results more intuitively, we selected three of the chosen frames and highlighted the correct (TP), wrong (FP), and missed (FN) targets in different colors, as shown in Figure 16 and 17. In group (a), many false targets appeared on the building tops, especially at edges. Because of the parallax caused by satellite motion, the tall building tops behaved just like slowly moving objects. In addition, some point-shaped distractors were apt to be recognized as moving vehicles. The background subtraction models rely on intensity changes of pixels to detect moving objects, but they cannot tell true targets from false ones. Since the LCNN model was trained on vehicles and other samples, it could discriminate true objects from the false. This can be observed from the results in Figure 16 (b), in which the LCNN model suppressed most of the wrong targets that were the edges of building tops, leaving only four false targets in frame 61, 6 in frame 81, and only 1 in frame 280. KNN+LeNet5 removed almost all the false objects, but some true objects were also removed, causing relatively low recall. The comparison indicated that the LCNN model could effectively identify wrong detection targets while keeping a high detection ratio.  The second test dataset Data2 was cropped from another region in Figure 10. The quantitative evaluation was performed on ten chosen frames, as shown in Table 5.
Notice that KNN+LeNet5 again exhibited high precision (94.3%) in detection but largely reduced the recall (by 46.4%). It is worth mentioning that KNN+LCNN can largely improve precision and F1 score by 49.3% and 34.2% respectively, at the cost of a slight loss of recall (by 3.8%). Considering the trade-off between precision and recall, KNN+LCNN is the best option. Figure 18 exhibits three of the chosen test frames with highlighted results. Similarly, false targets always appeared at the edges and corners of tall building tops, as shown in Figure 18 (a). Besides, global illumination changes between frames caused more false alarms, as shown in frame 900 in Figure 18 (a), accounting for the lowest precision among the ten frames. The results in Figure 18 (b) were improved by the LCNN model, again demonstrating the effectiveness of the LCNN model in removing false targets. Comparing the false negative markers in Figure 18 (a) and (b), one can notice that the LCNN model also mistakenly removed two true objects from frame 260, one from frame 400, and three from frame 900, accounting for the slight reduction of recall.
Combining the two results on Data1 and Data2, the average precision, recall, and F1 score are presented in Figure 19. It can be observed that LCNN has increased the precision from 49.5% to 90.6% (by 41.1%), F1 score from 64.4% to 91.0% (by 26.6%), only with the recall reduction from 95.1% to 91.3% (by 3.8%). LeNet5 can achieve the highest precision 95.6%, but with the poorest average recall 43.4%. The experimental results show that the proposed LCNN model can largely improve the precision at the expense of only a slight reduction of recall. It is worth mentioning that the comparison VOLUME 8, 2020   does not mean that LCNN is better than LeNet5, but that LCNN is more suitable for detecting tiny objects in our task.

3) ANALYSIS OF THE FILTERING
For simplicity, filtering was not applied in the preceding experiments. This subsection will exclusively present the filtering results and discuss the effect of different filters. Here, we are not going to discuss some sophisticated filters, such as the bilateral filter [59] and the total-variation filter [60], which are not suitable for video processing because of their high computational costs. The proposed adaptive filter and three conventional filters, including the mean filter, the median filter, and the Gaussian filter, are discussed.
In this experiment, these filters were first applied to the input video, respectively, and then the KNN background model and the LCNN model were used to produce the detection results. For comparison, we generated two groups of results: Group 1 was detected by only the KNN background model, and Group 2 was first detected by KNN and then refined by the LCNN model. In addition, the detection was also conducted on the non-filtered video. Similarly, the results were evaluated using the ten chosen frames used in the preceding experiment. Table 6 shows the quantitative evaluation on the detection results.
Group 1 shows that the Median filter achieved the lowest precision, and the Mean filter the highest. All filters caused reduced recalls, but the Adaptive filter had the highest recall among these filters. The Mean filter gained the highest F1 score, indicating its ability to achieve the best trade-off between precision and recall. Notice that the F1 scores of the Gaussian and the Adaptive filters are very close to that of the Mean filter. That is to say, the later three filters have nearly the same trade-off performance. Nevertheless, the Adaptive filter should be considered the best choice because of its highest recall among the three filters. The reason for choosing the Adaptive filter is that high recall can ensure more detected potential targets. False targets can be further eliminated using the LCNN model. On the contrary, missed targets cannot be retrieved again by other means. Therefore, higher recall is preferable when their F1 scores are at the same level.
Group 2 shows that the Mean filter achieved the highest precision, and the None filter the lowest. In terms of the recall, the None filter achieved the highest value, and the Mean filter the lowest. Although the Adaptive filter did not achieve the highest precision or recall, it gained the highest F1 score, indicating the best trade-off performance in the final results. That is to say, the Adaptive filter fulfills our purpose: suppressing noise while retaining as many true objects as possible. The evaluation verifies the effectiveness of the Adaptive filter for improving the final detection results. Table 6 also indicates that precision and recall are mutually exclusive. It is not easy to obtain high precision and high recall simultaneously, and we should choose the best option according to our purpose. The data should not be filtered when high recall is required, and the Mean filter is the best choice when high precision is required. The Adaptive filter is the option when considering the best trade-off between precision and recall.

V. CONCLUSION
Satellite videos can monitor moving objects at city-scale regions and enable potential applications in smart traffic management and urban planning. It is challenging to correctly extract small moving vehicles from these videos because of many distractors on building tops. To increase the accuracy of moving vehicle detection, this article proposed a terse framework, which first applies background subtraction models to an adaptively filtered video to obtain candidates at high recall and then applies a lightweight CNN model to suppress false targets. This article also tested and evaluated several commonly used background subtraction models, indicating that the KNN model is the best choice for detecting small moving objects from satellite videos. To remove false targets, we designed a lightweight CNN model, which can be trained quickly with only small amounts of samples. The experimental results show that the proposed CNN model can largely improve detection precision only at the expense of a slight reduction of recall. In addition, this article also proposed an adaptive filter that can suppress noise while retaining a high detection ratio.
On the other hand, the proposed method has one limitation in processing speed. When only KNN was applied, the algorithm could process seven frames per second (FPS). The FPS dropped down to 2.9 when LCNN was used (the test was just conducted on a laptop mentioned before).
There is still some room for improvement. The proposed method does not use the coherent information of vehicles between frames, causing some detection gaps. The gaps mean a detected vehicle on a frame may get lost on the next one or several frames and appear again on farther frames. The object tracking technique is suggested as a post-processing step to fill these gaps and thus improve the accuracy further.