Single Object Tracking in Satellite Videos: A Correlation Filter-Based Dual-Flow Tracker

Satellite video (SV) can acquire rich spatiotemporal information on the earth. Single object tracking (SOT) in SVs enables the continuous acquisition of the position and range of a specific object, expanding the field of remote-sensing applications. In SVs, objects are small with limited features and vulnerable to tracking drift. In this article, a correlation filter based dual-flow (DF) tracker is proposed to explore how the hybridization of spatial–spectral feature fusion and motion model can boost tracking. To represent small objects, the DF adaptively fuses complementary features using a state-aware indicator in feature flow. In motion flow, the indicator perceives the confidence of the feature flow. A dual-mode prediction model is then constructed to simulate the object's motion pattern and cooperate linear and nonlinear motion patterns to implement SOT in SVs. The ablation experiments demonstrate that the DF contributes to tracking. Experimental comparisons on 14 real SVs captured by the Jilin-1 satellite constellation show that DF achieves optimal performance with an area under the curve of 0.912 in the precision plot, 0.700 in the success plot, and a speed of 155.2 frames per second. This work would encourage the development of remote-sensing ground surveillance.


I. INTRODUCTION
S ATELLITE video (SV) has become a valuable surface observation data, which provides a wealth of static and dynamic information on specific areas [1]. In 2013, the SkySat-1 video satellite was launched by Skybox Imaging, marking a milestone in the expansion of remote-sensing observation means from imagery to video. SkySat-1 can capture panchromatic video with a ground sample distance (GSD) of 1.1 m and a frame rate of 30  The emergence of this advanced data drives the development of the remote-sensing field in the visual community. Single object tracking (SOT) in SVs, served as one of the most fundamental tasks, has prosperous application prospects in dynamic traffic surveillance and analysis [2], ocean monitoring [3], environmental monitoring [4], stereo mapping [5], and super-resolution [6]. SOT in SVs determines the position and range of an object in subsequent frames when its initial state is available only in the first frame. In contrast with SOT in natural videos, it encounters several difficulties, such as follows. 1) Limited features: SVs usually contain three bands (red/green/blue), so the spectral features of the object are limited. Moreover, due to the low resolution, the object occupies few pixels and has few spatial features such as structure, which can lead to difficulties in the accurate identification and positioning. 2) Abnormal states: SVs are filmed by satellite platforms with high-speed moving, small objects accompanied by nonstationary background are susceptible to abnormal states (e.g., occlusion, rotation, background clutter, overtaking, and motion blur), which may cause tracking drift. To overcome these issues, researchers have conducted research works on SOT in SVs, which can be classified into detection-based [7]- [10] and discriminative methods.
Detection-based methods usually use interframe motion information to detect and track the moving object. Discriminative methods include deep learning based [11]- [14] and correlation filter (CF) based [1], [2], [15]- [19]. Deep learning based algorithms extract the convolutional features of the object to determine its position, and this can increase the computational burden and slow down the tracking speed. CF-based algorithms start by training a filter with a predefined response on all training samples. The correlation operation is converted to element multiplication by fast Fourier transform (FFT) followed by Inverse FFT, resulting in a reduction in storage and computation of several orders of magnitude [20]. It then uses the pretrained filter to locate the object. Furthermore, the filter is updated in subsequent frames. The different methods for SOT in SVs will be elaborated upon in related work (see Section II-B). CF, one of This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the best discriminative methods, has been successfully applied to SVs [1], [2], [15]- [19]. It uses cyclic shift to construct training samples and converts the correlation operation into element multiplication by FFT, thereby improving accuracy and speed. Despite achieving competitive performance, single hand-crafted feature, such as histogram of oriented gradients (HOG) [21], may be limited in the representations of objects in SVs. However, the local spectrum inside an object region facilitates tracking [22]. Meanwhile, CFs [20], [22]- [24] update the template without evaluating tracking confidence and cause the template contaminated. The tracking drift is an inherent drawback of CFs, resulting in the sample drifting away from the object. Several methods [25]- [27] have been used to overcome tracking drift, but at the cost of high time consumption. These methods ignore a motion model that may be a simple and efficient means. To address problems of limited feature representation and tracking drift, we propose a CF-based dual-flow (DF) tracker. The proposed approach has the following contributions.
1) A CF-based DF tracker that cooperates spatial-spectral features and adaptive motion model is proposed for SOT in SVs. Complementary features representing texture and spectrum of objects are fused to enhance the representation in feature flow. In motion flow, a dual-mode prediction model is constructed synthesizing the linear and nonlinear motion patterns to prevent tracking drift. 2) A state-aware indicator (SAI) is defined to perceive the confidence of tracking. It achieves the adaptive selection of feature weights in feature flow while sensing the abnormal states in motion flow. 3) Ablation experiments are conducted to verify the necessity and performance of the above works for tracking in SVs. Extensive comparisons with 13 representative trackers are used to prove the superiority of the proposed method. The rest of this article is organized as follows. Related work on SOT is presented in Section II. Section III presents the general tracking framework of Staple [23]. The proposed approach is detailed in Section IV. Section V describes the experiments conducted on SVs. Finally, Section VI concludes this article.

A. Single Object Tracking
SOT is an open and fascinating field with a wide range of applications such as in surveillance [28], self-driving [29], sports competitions [30], and atmospheric motion [31]. However, many factors constrain the effects of SOT, such as occlusion and deformation, requiring a more robust and accurate tracker. Currently, SOT can be divided into generative and discriminative methods. Generative methods construct a model to represent the object and find a region that is most similar in the search region. Typical methods include mean shift [32], particle filters [33], and sparse representation [34]. How to find efficient features to represent the object is a challenge that has a significant impact on the tracking accuracy and speed. Moreover, generative methods only consider the characteristics of the object itself, which makes it easy for the sample to drift away from the object. Discriminative methods have become a mainstream research issue. Both objects and background regions are used to train the classifier, which makes such trackers more discriminative. Discriminative methods include two frameworks: deep learning based and CF based. CNN-SVM [35], one of the earliest deep learning based algorithms, combines a convolutional neural network (CNN) with a support vector machine (SVM) [36] for SOT. MDNet [37] uses large amounts of data to pretrain the CNNs offline and then fine-tune it online to adapt to changes in objects during SOT. These methods have difficulty in running real time due to their in-depth structure and online fine-tuning. To solve these problems, Bertinetto et al. [38] proposed SiamFC, which uses fully convolutional Siamese network architecture trained end-to-end for SOT. The CNNs are trained offline to solve the similarity learning process and avoid fine-tuning online. In this way, the SiamFC has a good balance of accuracy and speed, earning it the attention of many researchers. Subsequently, many trackers have been proposed such as SiamRPN++ [39] and SiamMask [40]. Although these methods [39]- [44] have performed good performance in natural videos, it remains unknown whether they would work well in SVs. CF-based trackers have emerged as a highlight since the MOSSE [45] was first proposed. CSK [46] introduces a circulant matrix and kernel trick based on MOSSE. KCF [20] then extends the CSK to use multichannel features and introduces multiple kernel functions. However, the scale variation of the object was an unresolved issue until the release of DSST [24] and SAMF [47], which adopt a scale filter to address scale change. To obtain better performance, convolutional features are also used for CFs, but the speed is inferior, such as in C-COT [48] and ECO [49]. In general, tracking results achieved by a single feature are not satisfactory. Thus, the Staple [23] combines the HOG and GCS features for tracking. The GFS-DCF [50] fuses convolutional features, HOG and CN. These trackers fuse multiple features for SOT and get improvement in performance. However, tracking drift is a hassle, and current solutions [25]- [27] mostly come at the expense of running speed.

B. SOT in SVs
Some methods are developed for SOT in SVs that include detection-based and discriminative. For detection-based methods, Du et al. [9] propose a multiframe optical flow tracker that combines the motion feature (optical flow), integral image, and multiframe difference for SOT. However, the performance of detection-based methods is far from satisfactory on SVs due to the demands on detection accuracy. Discriminative methods are divided into deep learning based and CF based. For the former, the Siamese network is used for SOT in [12], [13], and [51], but the parameters and structures of the networks may need to be adjusted for different SVs. In addition, the deep subnetwork obtains low-resolution representations, which may not be suitable for tracking small objects in SVs [12]. For the latter, faster and more robust CFs are used. Du et al. [2] combine the KCF [20] and frame difference, and a fusion strategy is embedded in the tracking framework for SOT in SVs. Shao et al. [16], [17] incorporate motion feature (optical flow) into the KCF framework achieving superior results. In [19], In motion flow, a SAI is proposed to sense the abnormal states of the object. If the SAI value of the fusion response map is greater than threshold κ, the object's position will be determined by the fusion results. Otherwise, a dual-mode prediction model is constructed synthesizing the linear and nonlinear motion patterns to prevent tracking drift and obtain the object's position when encountering abnormal states.
the KCF framework is also used to track rotating object. Xuan et al. [18] propose a CF embedded with a motion estimation algorithm. However, the tracker [18] is based on the assumption that the motion pattern of the object is linear. The object may be lost when being subjected to a curved path. In addition, using the HOG feature alone in [18] does not guarantee the robustness of the tracker. Thus, some trackers [2], [11], [17] fuse multiple features, but a spectral feature is ignored even though it is as faint as a spatial feature.

III. TRACKING FRAMEWORK
The proposed CF-based DF tracker is modeled on the translation structure of the Staple [23]. The overall motivation of the Staple is elaborated in the following. The desired window p t locates the object's position in image x t of frame t and is obtained from set S t to maximize the score where T denotes the image transformation and θ denotes the model parameter to be solved. Based on parameters θ, function f (T (x, p); θ) assigns a score to window p on image x. θ should minimize loss L(θ; X t ). The loss is determined by the previous images x i (i = 1, 2, 3, . . . , t) and the object's positions p i (i = 1, 2, 3, . . . , t), and X t can be written as where ϑ denotes the space of parameters and R(θ) denotes the regularization term with relative weight λ to limit the complexity of the model. The final score function is a linear fusion of the HOG and GCS scores where γ hog and γ gcs are the weights of the HOG and GCS feature scores, respectively. The overall parameters are θ = (β, δ, γ hog , γ gcs ), in which β and δ can be obtained via training and detection parts [23]. The fusion result f (x) is calculated by (3), and the new position of the object is estimated by finding the maximum of f (x). Finally, the parametersβ and δ need to be updated to adapt to changes in the object.

IV. PROPOSED APPROACH
In this section, we first introduce the overview of DF tracker. We then cooperate complementary features for tracking. In addition, the SAI and adaptive fusion mechanism of feature flow are described. Finally, a dual-mode prediction model of motion analysis flow is detailed. Fig. 1 shows the overall framework of the proposed DF tracker, including feature flow and motion flow. In feature flow, complementary features including the HOG, GCS, and CN of objects are exploited to represent the object. An adaptive fusion mechanism based on a SAI is then used to obtain the fusion results of feature flow. For further refinement, the results are transferred to motion flow. If the SAI value of the fusion response map is great, the object's position will be determined by the fusion results. Otherwise, a dual-mode prediction model is activated to predict the position, which analyzes the previous motion pattern to simulate the motion model.
1) HOG: It can capture the spatial texture and contours, and has inherent illumination invariance, which makes the HOG suitable for SOT in SVs. However, it is sensitive to deformation because it relies on the spatial layout of the object. It cannot achieve robust tracking for interested objects.

2) Global Color Statistics:
The GCS feature is a global spectral probability model trained from the foreground and background regions in the first frame. It is inherently invariant to permutation. However, the response map of the GCS feature is flat-peaked, which means it can serve as an auxiliary for SOT. 3) Color Names: It is an 11-dimensional spectral label excavated from the spectral features of the target. It can compensate for the information limitation of HOG and GCS with a detail spectrum. In the training part, the CN is an H-channel image F x : D → R H obtained from image x and defined as finite grid D ⊂ Z 2 . The per-image loss is where α n is channel n of multichannel image α and is circular correlation. The label function y is a desirable Gaussian function that decays from 1 for the center of the object to 0 for the shifted samples of the edge. For efficiency, α is computed in the Fourier domain, which transforms the circular correlation into a Hadamard product.α n is the discrete Fourier transforms of α n , * is a complex conjunction, and denotes pointwise. According to an approximate formulation in [24], (4) is minimized by choosinĝ d n = (ŷ) * F n T (x,p) , n = 1, . . . , H.
In the detection part, the response score f cn can be obtained from To adapt to changes in the object,α needs to be updated. η cn denotes the learning rate of the CN feature. The parametersr t andd t at frame t are separately computed from (6) and (7) in the new position. Forα, parametersr andd are updated aŝ After obtaining the HOG, GCS, and CN feature scores, avoiding complex functions, we straight use a linear score function where f fin is the fusion result and γ hog , γ gcs , and γ cn are the weights of the HOG, GCS, and CN feature scores, respectively. The object's new position is then estimated by finding the maximum in f fin . Thus, the overall model parameters are θ = (β, δ, α, γ hog , γ gcs , γ cn ) , in which β, δ, and α can be obtained from the training and detection part, whereas γ hog , γ gcs , and γ cn can be determined from the adaptive fusion mechanism that will be described in next part. θ will be updated to adapt to changes in the object.

C. SAI and Adaptive Fusion Mechanism
In tracking, the ideal tracking response map tends to a sharp Gaussian distribution, which is vulnerable to interferences (e.g., occlusion, background clutter, motion blur). In order to describe the concentration of distribution, the SAI (12) is proposed to sense the abnormal states of objects and achieve the adaptive selection of feature weights where w and h are the width and height of the response map, respectively,s is the average score of feature response map s. If the SAI value of a feature response map is greater than the threshold, the feature is dominant and the result is reliable. Then, we use a mechanism to adaptive fuse the complementary features in Section IV-B, whose weights in (11) are defined by where SAI hog and SAI cn are the SAI values of the HOG and CN feature response maps, respectively, fix gcs = 0.2 is derived from extensive experiments. Based on the adaptive fusion mechanism, the DF can make full use of the dominant feature to track small objects in SVs. The tracking confidence is then assessed for abnormal states based on SAI. If SAI > κ, the confidence of feature flow results is high and the object's position will be determined at the maximum of the fusion results. Otherwise, the confidence is low and the state of the object is abnormal, in which the position will be obtained base on a dual-mode prediction model.

D. Dual-Mode Position Prediction Model
Objects in SVs are vulnerable to abnormal states such as occlusion, rotation, background clutter, overtaking, and motion blur. Despite the proposed adaptive fusion mechanism can mitigate such impact, abnormal states inevitably degrade the tracking effects and cause tracking drift. Thus, we propose a dual-mode prediction model to obtain the object's position in motion flow. Specifically, after obtaining the object's trajectory on the basis of the historical results, the curvature of the previous trajectory is used to determine the prediction patterns. If the curvature is small, a Kalman filter [52] will be used to obtain the object's position via its linear prediction pattern. Otherwise, the object's trajectory tends to be in a quadratic nonlinear pattern, so its position will be predicted by nonlinear regression.

1) Kalman Filter for Predicting Linear Trajectory:
The Kalman filter [52] is a method for estimating the position and velocity of an object from observations with errors. Let S k = [x k , v x,k , y k , v y,k ] T denote the state vector, where x k and y k are the horizontal and vertical positions of the object at frame k, respectively, and v x,k and v y,k are the horizontal and vertical velocities at frame k, respectively. The estimation process can be divided into two parts: time update and state update.
In the time update part, the state equation and error transfer equation of the prediction process can be written aŝ whereŜk is the priori estimate of the state vector at frame k, S k−1 is the posterior estimate of the state vector at frame k − 1, D is the control vector, u k−1 is Gaussian noise with covariance matrix Q at frame k − 1, and Ek is the priori estimate of the error covariance matrix at frame k in the prediction step. The state transition matrix M can be written as The observation equation is where Z k is the observation vector at frame k, S k is the object's actual state at frame k, and V k denotes Gaussian noise with covariance matrix R. H is a 2 × 4 observation matrix In the state update part, the main three equations can be written as follows: where K k denotes the Kalman gain matrix at frame k,Ŝ k is the posteriori state estimate corrected by observation vector Z k at frame k, and I denotes the identity matrix.

2) Nonlinear Regression for Predicting Nonlinear Trajectory:
The Kalman filter is derived from the linear system, which is prone to tracking failure for nonlinear. Frequently, objects in SVs are moving smoothly along curved roads. We use quadratic nonlinear regression to simulate the trajectories with nonlinear pattern and predict the object's position. Let (z i , x i ), i = 1, 2, 3, . . . , k denote the object's position x i in the x-axis direction from frames z 1 to z k . The quadratic function of the trajectory can be expressed as where b 0 , b 1 , and b 2 are obtained by solving Through simplification, the normal equation of (25) can be written as ⎡ (26) By solving the coefficient matrix [b 0 , b 1 , b 2 ] T , the trajectory equation (24) can be obtained to simulate the motion pattern of the object. Similarly, the function in the y-axis direction can be solved, and the object's position can also be obtained.  Fig. 2 presents the experimental dataset, and Table I presents   a dominant abnormal state based on the characteristics of the scenario, and a short description of all states is given in Table II. 2) Evaluation Methodology: The precision plot and success plot are applied to measure the tracking performance [53], [54]. Center location error (CLE) calculates the average Euclidean distance between the center of the ground truth and estimated bounding box. The precision plot shows the percentage of frames for which the CLE is smaller than predefined thresholds T p . Considering the low resolution of SV accompanied by small size of objects, we use thresholds T p ∈ [1,20] to measure the performance in positioning. In the success plot, the overlap is used for evaluation. Given ground truth R G and estimated bounding box R T , the overlap can be calculated by  where ∩ and ∪ denote intersection and union operators, respectively, and | · | is the number of pixels in the region [53], [54]. The success plot shows that the success rate surpasses the threshold range T s ∈ [0, 1] , and measures the tracker's performance in positioning and estimating the size of the object. In this article, all trackers are ranked by the area under the curve (AUC) of the precision plot and success plot. Compared with the precision plot, the success plot is more representative [9]. Thus, we mainly rank trackers based on the AUC of the success plot, and use the FPS to evaluate tracking speed.

A. Experimental
3) Implementation Details: The weight λ is set to 1e − 3, and the fixed area is 60 2 . Considering that the changes of objects are stable, the learning rates η hog , η gcs , and η cn are set to 0.01, 0.005, and 0.005, respectively. The effects of SAI threshold κ on tracking result are presented in Table III. An optimal result is obtained with κ = −0.6, and a sample is shown in Fig. 3. The other parameters are set to the same as those in Staple [23], and all trackers are executed on a workstation with a 3.20 GHz Intel(R) Xeon(R) Gold 6134 CPU (32-core) and NVIDIA GeForce RTX 2080 Ti GPU.

B. Ablation Study
To validate the proposed DF, five variants are conducted, including two of addition experiments (DF_CF and DF_CFAF) and three of removal experiments (DF_NAF, DF_NRE, and DF_NKF). Fig. 4 shows the precision and success plots, and Table IV summarizes the components and experimental results of these trackers. DF_CF is the baseline tracker, indicating that DF has only the translation structure of Staple. DF_CFAF achieves adaptive fusion of CN over DF_CF, and DF_NAF removes the adaptive fusion of CN from DF. DF_NRE and DF_NKF remove the nonlinear regression and the Kalman filter of the dual-mode prediction model from DF, respectively. 1) For Feature Flow: In Table IV, by comparing with the baseline DF_CF and DF_CFAF, it can be seen that the AUC of the precision plot is improved from 0.675 to 0.731 (5.6% improvement) and the success plot is enhanced from 0.477 to 0.555 (7.8% improvement) using the feature flow. While comparing the DF and DF_NAF, we find a 14.6% and 10.5% reduction in the AUC of the precision and success plots after removing the adaptive fusion part from DF. Due to the absence of feature flow, the DF_NAF cannot adaptively fuse the complementary features of the object, making it difficult to represent small objects, which leads to tracking failure. Fig. 5 shows the tracking examples of DF_NAF and DF, where DF can discriminate object from background and avoid tracking drift.
2) For Motion Flow: By comparing the DF and DF_CFAF, it can be seen that the AUC of the precision plot is reduced from 0.912 to 0.731, whereas the success plot is reduced from 0.700 to 0.555 without the motion flow. This is due to the inability to perceive the abnormal states of the object and predict its position. Therefore, DF_CFAF encounters tracking drift. Comparing the DF_CF, DF_NAF yields a gain of 9.1% in the precision plot and 11.8% in the success plot. Furthermore, to evaluate the effects of the motion flow, DF_NRE and DF_NKF are added for validation. As presented in Table IV, the AUC of DF is superior to those of DF_NRE and DF_NKF, and the DF preforms optimal performance. This is because the dual-mode prediction model cooperates the linear and nonlinear motion patterns, allowing it to handle abnormal motions such as lane changes and turns. As shown in Fig. 6(a), the vehicle moves on a straight road when another one with similar features passes by quickly. The DF_NRE locates the vehicle, whereas DF_NKF encounters failure. This is because the Kalman filter predicts the linear trajectory more precisely than the nonlinear. In Fig. 6(b), a vehicle encounters complete occlusion by bridges while traveling at high speed on a curved highway. In this case, DF_NKF locates the vehicle, whereas DF_NRE loses it when subjected to the occlusion by a bridge. This is attributed to the property of nonlinear regression in DF_NKF. Overall, the proposed DF can determine the prediction mode based on the motion patterns, so it achieves superior results.

C. Comparison With State-of-the-Art Methods
We compared the proposed method with 13 trackers, namely, KCF [20], SAMF [47], Staple [23], C-COT [48], fDSST [55], ECO [49], SiamRPN [42], SiamRPN++ [39], ASRCF [56], GFS-DCF [50], CFME [18], SiamFC++ [57], and TransT [58]. These methods include CF based and deep learning based. The CF-based CFME is an open-source design for SOT in SVs. Few trackers are tailored for SVs. The codes are not public and some key variables are omitted. Moreover, these methods were tested on unpublished datasets and different benchmarks. Therefore, we selected CFME for comparison. Table V summarizes the characteristics of trackers and experimental results, sorted by AUC of the success plot. Fig. 7 presents the average precision and success plots. With AUC of 0.912 and 0.700 in the precision and success plots, the proposed method performs remarkable performance, whereas KCF achieves the worst. CFME produces competitive performance due to the fact that the motion average and Kalman filter are embedded in KCF to mitigate tracking drift, ranking the first in the compared trackers. The proposed DF tracker boosts CFME by 10.2% and 9.8% in the precision and success plots, respectively. Compared with ECO, the champion of VOT2017, the proposed method provides a gain of 19.9% in the precision plot and 17.3% in the success plot due to the exploitation of potential spatial-spectral features. Compared with ASRCF and GFS-DCF, the proposed approach reaches 23.1% and 20.6% boost in the success plot due to the consideration of motion model. This suggests that the motion information contained in adjacent frames facilitates tracking in SVs. In contrast with SiamRPN++, the proposed method achieves a solid improvement in accuracy. Compared with Staple, DF increases the precision and success plots by 24%+, and compared with SiamFC++ and TransT trackers, the proposed method exceeds   them by 26% and 38.5%, 25.7% and 35%, in the precision and success plots, respectively. Overall, experimental results verify that the proposed DF tracker well tracks the objects. It is attributed to both the adaptive fusion mechanism incorporated in feature flow and dual-mode prediction model embedded in motion flow. Moreover, DF is capable of running at over 155 FPS on the CPU. Compared with trackers operating on the CPU or GPU, DF can achieve real-time speed in tracking objects of SVs.
Experimental results demonstrate the state-of-the-art effects and superior speed of the proposed tracker. Fig. 8 shows the precision and success plots of per-state to evaluate the strengths and weaknesses of trackers. For clarity, Fig. 9 shows the radar plots for top seven trackers. We find that, for the precision plots, the DF ranks highest in three (occlusion, overtaking, and motion blur) out of five states and achieves the first in overall AUC. For the success plots, DF ranks among top two trackers in four out of five states. The reason why DF achieves inferior results under rotation datasets is that slight background jitter would affect the position of the object, weakening the performance of the dual-mode prediction model. The proposed method achieves the fourth place under the background clutter data. This is because the object is relatively similar to the background, which limits the extraction of prominent features. It can be seen DF achieves significant improvement under the occlusion state. This is attributed to the SAI and dual-mode prediction algorithm. The SAI perceives the occlusion and nonocclusion states, and the signal is then transmitted to the dual-mode prediction algorithm. It synthesizes the linear and nonlinear motion patterns to handle occlusion     of objects, yielding significant performance. Overall, DF is capable of coping with the abnormal states of the objects through the hybridization of the spatial-spectral feature fusion and the motion model.
In visual comparison, tracking examples of the top four trackers are shown in Fig. 10. In Fig. 10(c), a vehicle is occluded twice when moving along a curved highway. DF is capable of sensing the abnormal state and predicting the object's position, whereas C-COT and ECO all lose the object. Although the CFME can predict the object's position, it loses it due to limited consideration of the nonlinear motion pattern of the object. As an overtaking case in Fig. 10(e), a vehicle, similar to the buildings and vehicles parking on the sides of the road, travels along a narrow street. Only the CFME and DF capture the object in all frames, whereas the DF tracks more accurately. In other cases shown in Fig. 10, the proposed DF could track objects with higher accuracy.

VI. CONCLUSION
SOT has great potential in remote-sensing surveillance. In this article, we explore the SV SOT from the perspective of spatial-spectral feature fusion and motion model and propose a CF-based DF tracker to address problems of limited feature representation and tracking drift. In feature flow, an adaptive mechanism is employed to fuse complementary features. The results are then refined in motion flow. A dual-mode prediction model is constructed to simulate the motion patterns for searching the object's position, allowing the tracker robust to abnormal states. Extensive experiments on 14 SVs prove the outstanding performance in tracking objects of SVs. Future work should focus on solving the rotation of objects.