An Anti-Drift Background-Aware Correlation Filter for Visual Tracking in Complex Scenes

The model drift problem is an inevitable problem for online visual tracking. Model drift can amplify the false tracking result over time and lead to failed tracking finally. Background-aware correlation filter (BACF) tracker obtains the training samples from negative background patches instead of shifted foreground patches and can mitigate the drifting to some extent. However, in complex scenes, such as occlusion, deformation, the inappropriate update of BACF may lead to model drift. We propose an anti-drift background-aware correlation filter via introducing the temporal consistency constraint into BACF. That is, we combine the excellent ability of BACF to distinguish the foreground and background with the ability of the temporal consistency constraint to stabilize model changes, and improve anti-drift performance of BACF. To improve computation efficient, we offer a fast algorithm via alternative Direction Method of Multipliers (ADMM) in the frequency domain. Besides, we design a simple yet effective adaptive feature channel selection method, which can further improve the success rate and precision of the tracker in complex scenes. Our proposed tracker with hand-crafted features achieves a gain of 2.2%, 3.3%, and 5.3% in AUC success rate on OTB-2013, OTB-2015, and Temple Color-128 dataset, respectively. Our proposed tracker with deep features achieves AUC success rate of 69.1% on OTB-2013 dataset. Besides, our proposed tracker with hand-crafted features achieves a gain of 3.94% in EAO on VOT-2018 dataset. Moreover, the proposed tracker performs favorably against several representative state-of-the-art methods regarding precision and success rate.


I. INTRODUCTION
Object tracking is an extensively studied problem in computer vision and has numerous advanced applications, such as precision guidance, human-computer interaction, automatic driving, and so on [5]- [7]. Its primary purpose is to continuously obtain the target's related motion parameters through analyzing a sequence of images given only its bounding box of the initial frame. Object tracking is not a simple task in actual application, because it inevitably faces the interference of complex scenes, such as non-rigid deformation, scale change, motion blur, illumination variation, occlusion, and so on. Although object tracking has made much progress in recent years, it remains a very challenging problem.
The associate editor coordinating the review of this manuscript and approving it for publication was Huazhu Fu .
Arguably, most existing methods of tracking can be classified into two categories: generative methods and discriminative methods [24]- [35]. Traditional generative tracking methods are trained based on object appearance without considering background information. By contrast, discriminative tracking methods consider both of them, usually treat tracking as a binary classification task which distinguishes the object from its surrounding background. Compared with the generative methods, the discriminative methods have more attention because the discriminant tracking methods use the background information in training and can achieve more accurate target position estimation. Recently, the trackers based discriminative correlation filter(DCF) [8]- [20] receive widespread attention due to their excellent accuracy and high speed, and have shown state-of-the-art performance in many standard benchmarks [1]- [4]. The DCF trackers belong to the discriminative tracking methods. They obtain the target VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ sample by cyclic shifts in a search window area instead of dense sampling. Taking the advantage of the property that the cyclic matrix can be diagonalize with the discrete fourier transform, the DCF trackers can transform complicated spatial correlation operation into efficient element wise operation in the Fourier domain and achieve extremely high tracking speed.
Although the DCF trackers have been widely used because of their excellent accuracy and high speed, the model drift still is their inevitable problem. In Fig. 2, we initialize the tracker in the first frame and let it track the target until the end of the sequence. When the average overlap is small, the tracker occurs model drift. And when the average overlap is zero, we consider tracking a failure. As shown in Fig. 2, model drift can amplify the false result over time and lead to failed tracking finally. The background-aware correlation filter (BACF) [20] tracker obtains the training samples from negative background patches instead of shifted foreground patches and can mitigate the drifting to some extent. BACF tracker has an outstanding performance among the state-ofthe-art trackers because of the superior accuracy and real-time performance. However, in complex scenes, such as occlusion, deformation, the inappropriate update of BACF may lead to model drift.
In this paper, we focus on the problem of model drift in BACF tracker. Motivated by [36] that temporal consistency effectively enforces temporal consistency for visual tracking, we consider introducing the temporal consistency constraint into background-aware correlation filter. The temporal consistency constraint can inhibit the appearance variations of a target caused by the use of unstable temporal updating models for filters. Under the constraint, the filter model is updated around its historical value. Here, we consider combining the excellent ability of BACF to distinguish the foreground and background with the ability of the temporal consistency constraint to stabilize model changes. The problem of model drift can be mitigated effectively. In the optimization process, the problem can be solved efficiently via Alternative Direction Method of Multipliers [38] and fully carried out in the frequency domain to speed up. In practical application, different feature channels have different degrees of discrimination to the target. Some feature channels lead to a reduction in performance, because they are not useful to distinguish between target and backgrounds. Therefore, we design a simple yet effective adaptive feature channel selection method, which can mitigate model drift as well. We exploit two evaluation metrics (success rate and precision) to quantify the anti-drift capability of the tracker. Extensive experiments on the OTB-2013 [1], OTB-2015 [2], Temple Color-128 [3] and VOT2018 [4] datasets demonstrate that the proposed tracker outperforms the baseline BACF conspicuously in terms of success rate and precision. Especially in some complex scenes, the advantages of our proposed tracker are more obvious. Fig. 1 shows the qualitative comparison of our proposed tracker in occlusion, deformation, and background clutter. Moreover, the proposed tracker performs favorably against several representative state-of-the-art methods regarding precision and success rate.
The main contributions of our work are summarized as follows: (1) We introduce the temporal consistency constraint into BACF and propose a new tracker named BACF-TC. To improve computation efficient, we also offer a fast algorithm via alternative Direction Method of Multipliers in the frequency domain. The extensive experiments show that this method can mitigate model drift effectively.
(2) We design a simple yet effective adaptive feature channel selection method and obtain the response map for each channel by correlation calculation of the initial frame information and filter template. The value of the PSR indirectly reflects the ability of feature channel to distinguish between target and backgrounds. The experiment demonstrates that this method can mitigate model drift as well.

II. RELATED WORK A. BASIC TRACKING METHODS
Generally, there exist two typical ways to solve the problem of object tracking: generative tracking methods and discriminative tracking methods. Since the proposal of the Lucas-Kanade algorithm [24]- [26], generative methods become popular in many computer vision tasks. The fundamental principle of generative tracking methods is to search for the most similar region to the target object within a neighborhood without considering background information. According to the target representation type, generative methods can be divided as follows: subspace representation and sparse representation. Generative methods based subspace representation [27], [28] cannot effectively resist the interference caused by occlusion. Generative methods based sparse representation [29]- [32] have excellent performance in case of occlusion, but most of them cannot realize the requirement of real-time.
Discriminant tracking methods regard tracking problem as a binary classification problem. They are distinguishing the object from its surrounding background via the object appearance and background information. The discriminant tracking methods generally have a better performance compared with the generative tracking methods, because the discriminant tracking methods use the background information in training and can achieve more accurate target position estimation. The discriminant tracking methods are based on a discriminatively trained classifier, such as SVM [33], Adaboost [34], semi-supervised boosting [35], and so on. More recently, the discriminant tracking methods based discriminative correlation filters(DCF) are the latest development direction of the discriminative tracking methods because of its high speed and excellent performance.

B. TRACKING METHODS BASED DISCRIMINATIVE CORRELATION FILTERS
Since the proposal of MOSSE algorithm [8] which first adopt adaptive correlation filters to track, tracking methods based DCF become widely used because of high accuracy and computational efficiency. Many improved algorithms based DCF have been proposed. DSST [9] and SAMF [10] solve the target scale change issue, and achieve accurate scale adaptive visual tracking. STAPLE [11] is inherently robust to both color changes and deformations combining two image patch representations to learn a model. Danelljan et al. [12] introduce the CNN convolutional features into discriminative correlation filter based tracking frameworks. In [13], Ke Nai et al. learn the multiple correlation filters to capture different appearance patterns of the target object during the tracking.
Several works are utilizing spatial or temporal constraints to mitigate model drift and improve the overall performance of the tracker. SRDCF tracker [14] introduces a spatial regularization component in the learning to penalize correlation filter coefficients depending on their spatial location. Subsequently, C-COT [15], ECO [16] introduce a novel formulation for training continuous convolution filters and employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. In [17], Kenan Dai et al. introduce an adaptive spatial regularization which could be effectively learned with respect to a specific object being tracked and result in more reliable filter coefficients during the tracking process. These methods only consider the boundary effects but do not consider the smoothing of model updates. In [18], Feng Li et al. propose STRCF tracker which incorporate both temporal and spatial regularization constraints, and achieve superior performance over SRDCF in terms of accuracy and speed. Xu et al. [19] design a low-dimensional discriminative manifold space by exploiting temporal consistency. These methods can mitigate model drift effectively by considering the smoothing of model updates, but do not use real negative samples information. BACF [20] is learning/updating filter capable of from real negative examples densely extracted from the background, instead of shifted foreground patches. Therefore, BACF tracker can improve accuracy effectively, compared to prior CF-based trackers. However, in some complex scenes, such as occlusion, deformation, BACF tracker may occur model drift. In this paper, we propose an anti-drift backgroundaware correlation filter via introducing the temporal consistency constraint into BACF.

III. PROPOSED APPROACH
In this section, we first briefly introduce the backgroundaware correlation filter. Next, we describe how to introduce the global temporal consistency constraint to the backgroundaware correlation filter and present the proposed algorithm in detail.

A. BASELINE METHOD
Essentially, in the correlation filter(CF) framework, tracking is cast as a binary classification task which distinguishes the object from its surrounding background. In the training phase of CF framework, we need to learn a multi-channel correlation filter h from the sample x of the target appearance. Generally, we obtain the target sample x by cyclic shifts in a search window area. As the cyclic matrix can be diagonalized with the discrete Fourier transform, the CF tracker can processes each patch sample rapidly.
The CF in the spatial domain is formulated by minimizing the following objective: where x l and h l refer to the lth channel of the vectorized image and filter respectively. [ τ 1 , . . . , τ D ] represents the set of all circular shifts for a signal of length D. y is the desired correlation response. y(j) refers to the jth element of the desired correlation response. λ is a regularization. K represents the number of feature channels. refers to the spatial correlation operator. Eq. (1) is a linear least squares problem. We can obtain a closed-form solution in the Fourier domain using the Parseval's formula, Here, the hat symbol denotes the Discrete Fourier Transform of corresponding variable, such asx = DFT (x). * expresses the complex-conjugate. refers to the elementwise product.
These trackers based upon generic CF filter have an excellent performance to cope with the variations in lighting, scale, pose, and non-rigid deformations and can operate at high speed. However, they probably have over-fitting problems because of the training samples generated through the [ τ j ] operator and can not classify the target from real nontarget patches very well. In order to use the surrounding background part as negative samples at the training stage, Galoogahi et al. [20] propose the background-aware correlation filter (BACF) to directly learn from background patches. The BACF is formulated by minimizing the following objective: where P is a D × T binary matrix which crops the mid D elements of signal x k , and T D. T denotes the length of x. Furthermore, x k ∈ R , y ∈ R , and h ∈ R D .

B. ANTI-DRIFT BACF VIA INTRODUCING THE TEMPORAL CONSISTENCY CONSTRAINT
We propose to learn the anti-drift BACF via introducing the temporal consistency constraint (BACF-TC) by minimizing the objective function as follows: where P is a D × T binary matrix which crops the mid D elements of signal x l , and T D. T is the length of x. h represents the parameters of the filter at the current moment. The model parameter g model estimates from previous frames. g model = (1 − ε)g model + εg, where ε denotes the learning rate. λ 1 and λ 2 are the regularization parameters. The transpose operator denotes the conjugate transpose of a complex vector or a matrix. K is the number of feature channels. The last term in the formula refers to the global temporal consistency constraint. By introducing the global temporal consistency constraint, our proposed tracker considers the target's previous appearance and imposes smooth variation between consecutive frames.
Similar to the general CF trackers [9], our proposed approach transforms its function into the frequency domain for high computational effiency. Hence, Eq. 4 can be expressed in the frequency domain as: Here, we define the concatenated matrixX = diag x 1 , · · · , diag x K , h = h 1 , · · · , h K andĝ = ĝ 1 , · · · ,ĝ K . I K refers to a K × K identify matrix, and ⊗ denotes the Kronecker product. The hat symbol denotes the Discrete Fourier Transform of corresponding variable, such thatŷ = √ T Fy. F indicates the DFT matrix of T dimensional vectorized signal.
Here, our objective is to minimize $(ĝ, h,ζ ), and the model in Eq. 6 is convex. If we adopt the Lagrange Multiplier Method to solve, the augmented Lagrangian is minimized jointly with respect to two primal variables in Eq. 7.
The common methods to solve the optimization of unconstrained multivariable functions are as follows: Stochastic Gradient Descent method (SGD), Newton's method, and so on. However, these iterative methods are time consuming. The alternating direction method of multipliers (ADMM) [38] is an algorithm that solves convex optimization problems by breaking them into smaller pieces, each of which are then easier to handle.
Here, we adopt ADMM algorithm to solve the following subproblems alternatingly: We solve each subproblem separately as follows, To solve for h, we set its complex gradient h $ to 0. Finally, we obtain the optimal solution as follows: where g and ζ are defined as g = 1 Subproblemĝ: Here, we can we can expressĝ as T independent objectivesĝ (t), andĝ (t) = conj ĝ 1 (t) , · · · , conj ĝ K (t) , (t = 1, . . . T ).
We acquire the optimal solution by solving Eq. 12 and use Sherman-Morrison formula [39] to accelerate computation. Therefore, the optimal solution can be expressed as follows: where Subproblemζ : Givenĥ,ĝ and µ, ζ can be updated by: whereĥ = √ T PF ⊗ I K h, and µ = min (µ max , βµ). µ max denotes the maximum value of µ. β is the scale factor.
Complexity: The subproblem h has a closed form solution.
is the cost of computing the IFFT of a signal with the length of T . Ultimately, the complexity of solving is O (KTlog(T ))), where K represents the number of channels. The subproblemĝ has a closed form solution also.ĝ can be expressed as T independent objectives forĝ(t).ĝ(t) has a closed form solution in Eq. 13 and its complexity is O (K ). Finally, the complexity of solving subproblemĝ is O (KT ). Hence, the total complexity of our optimization framework is O (KTlog(T )I ), where I represents the maximum number of iterations. A single tracking iteration is summarized in Algorithm 1.

Algorithm 1 Our BACF-TC Approach: Iteration at Time
Step

C. ADAPTIVE FEATURE CHANNEL SELECTION
In this section, we design a simple yet effective adaptive feature channel selection method.
General tracking algorithms obtain the final response map via summing the responses for each channel. However, in our experiment, we find that different feature channels have different degrees of discrimination to the target. Some feature channels lead to a reduction in performance, because they are not useful to distinguish between goals and backgrounds.
In this paper, we obtain the response map for each channel by correlation calculation of the initial frame information and filter template. We adopt Peak to Sidelobe Ratio (PSR) [8] to evaluate the degree of feature channel. The PSR is defined as , where g max is the peak values. µ s1 and σ s1 denote the mean and standard deviation of the sidelobe, respectively. Fig. 3 shows the process of adaptive feature selection. When PSR drops to around the threshold, we discard the feature channel.

A. EXPERIMENTAL SETUP
Our proposed approach is implemented in MATLAB 2017b on a computer with an Intel i5-8400 CPU at 2.80GHz×6 and Nvidia GTX 1060 GPU.
In this paper, we exploit two evaluation metrics-precision and success rate. Tracking precision is the center location error, which is defined as the average Euclidean distance between the center locations of the tracked targets and the manually labelled ground truths. The success rate refers to the ratios of successful frames at the thresholds of overlap S varied from 0 to 1. The overlap score is defined as S = |rt r a| |rt r a| , where r t and r a represent the tracked bounding box and the ground truth bounding box. and denote the intersection and union of two regions, respectively. |·| denotes the number of pixels in the region. We use the area under the curve (AUC) of each success plot or precision plot and the success rate or precision at the conventional thresholds of 0.5 (IOU > 0.5) to rank the tracking algorithms. In this paper, we use the one-pass evaluation (OPE) criterion to evaluate the performance of trackers.  Table 1 shows an analysis of our contributions. Our primary contribution is introducing the temporal consistency constraint into the BACF. The secondary contribution is proposing an adaptive feature channel selection method, which may further improve overall performance. In Table 1, it has a significant performance improvement by the gains of 2.1% in success rate and 4.4% in precision compared to the baseline, while introducing the global temporal consistency constraint. Based on BACF, adopting an adaptive feature channel selection method can boost 0.6% in success rate and 1.0% in precision. On the whole, our tracker improves the performance of BACF tracker by 2.4% in success rate and 4.8% and in precision. Both of our contributions can improve the performance effectively, especially the former. Furthermore, we analyze the effect of the temporal consistency constraint λ 2 , the threshold of adaptive feature channel selection, and the learning rate ε on OTB-2013 dataset. Fig. 4 shows the AUC success rate in different temporal consistency constraint λ 2 . The AUC success rate is significantly influenced by the choice of λ 2 . The performance of our tracker achieves the best when λ 2 = 16. As shown in Fig. 5, the threshold of adaptive feature channel selection  α cannot be set too high or too low. Too high threshold will filter out useful channel features, but too low threshold can not reduce the interference channel features. Here, our tracker achieves the best performance when α = 1. Fig. 6 shows the effect of the learning rate ε on OTB-2013 dataset. When ε = 1, the temporal consistency constraint becomes g l,t − g l,t−1 2 2 , and the form is the same as the temporal regularization constraint in STRCF [18]. However, our tracker achieves the best performance at ε = 0.87. We think the reason is that a more robust model can be obtained by considering both interframe continuity and historical model information.

b: COMPARISON WITH HAND-CRAFTED FEATURES BASED TRACKERS
Here, we make an extensive comparison with the representative state-of-the-art trackers, including SRDCF [14], ECOhc [15], STAPLE [11], DSST [9], KCF [21], CSR-DCF [22], BACF [7], SAMF [10], STAPLE _CA [23]), STRCF [18]. SRDCF focuses on mitigating boundary effects by introducing spatial regularization. ECOhc adopts the efficient convolution operators and exploits available negative data by including all shifted versions of a training sample. The STAPLE tracker employs color histograms to improve the reliability of the final response. DSST and SAMF are committed to adapt to target scale changes in the tracking. CSR-DCF introduces the channel and spatial reliability concepts to DCF tracking and provides a novel learning algorithm for its efficient and seamless integration in the filter update and the tracking process. STAPLE_CA focus  on introducing background information in training. KCF employs circulant structure to solve a ridge regression problem in the frequency domain and have a high tracking speed. STRCF focuses on joint spatial-temporal filter learning in a lower dimensional discriminative manifold.
In Fig. 7, our tracker achieves success rate (at AUC) of 67.7% and presicion (at AUC) of 82.0%. It is evident that our tracker significantly outperforms others in this comparison.

c: COMPARISON WITH DEEP FEATURES BASED TRACKERS
We analyze the performance of our proposed tracker with deep features. First, similar to ECO [15], our tracker uses VGG-M [40] network trained on the ILSVRC [41] dataset to extract deep features. In Table 2, we compare the performance of our tracker which adopts different convolutional layers. The middle convolutional layer (conv3) has the best performance compared with the low convolutional layer (conv1, conv2) and the high convolutional layer (conv4, conv5). Ultimately, our BACF-TC_deep tracker adopts the outputs of conv3 layer from VGG-M network. In the process of scale search, the number of scales and scale step are set to 4.5 and 1.02, respectively. Other parameters remain the same. Here, we make an extensive VOLUME 7, 2019  comparison with the representative state-of-the-art trackers, trackers using CNN features (i.e. ECO [15], SDR-CFdeep [14], STRCFdeep [18], CFCF [42], CFnet [43], SiamFC [44]). Fig. 8 shows our improved tracker still has excellent performance compared with deep features based trackers. Our tracker achieves success rate (at AUC) of 69.1% and precision (at AUC) of 83.5%. Obviously, our tracker significantly outperforms the most tracker in this comparison.
In the following, we have a qualitative analysis in several challenging situations. a) occlusion In Fig. 9(a), the box video sequence occurs a full occlusion over frames #451 to #479. Only our proposed tracker can successfully track the box, while the other competing trackers occur model drift leading to failure in this experiment. It is evident that our proposed method outperforms its counterparts in this case. b) motion blur Due to the rapid movement of a camera or target, the video sequence may occur motion blur which can lead to failure. In Fig. 9(b), BACF, CSR-DCF and STAPLE_CA track the object unsuccessfully in the case of motion blur. However, BACF-TC achieves accurate tracking. c) out-of-plane rotation The movement of a camera or target often causes the out-of-plane rotation in the appearance. Fig. 9(c) shows the tracking performance of these trackers in the case of out-ofplane rotation. BACF, SDRCF and BACF-TC can track the walking man precisely, while the other tracker lost their target of tracking. d) background clutter Due to contains objects similar to the target in the background region, the tracking results may be affected likely. In Fig. 9(d), the basketball video sequence has background clutter over frames #485 to #725. It is an enormous challenge for tracking that people are wearing the same clothes as the tracking target in the background region. BACF, STAPLE_CA and Staple occur model drift leading to an inaccurate case that the similar objects are treated as tracking target mistakenly. By contrast, BACF-TC accurately estimate the position and scale of the basketballer which is labelled.

2) TEMPLE COLOR-128 DATASET
We perform experiments on Temple Color-128 dataset which consists of 128 color video sequences. The parameter settings are the same as on OTB-2013 and OTB-2015 datasets. In Fig. 12, our method achieves a prominent improvement    over BACF with the gains of 5.3% in success rate (at AUC) and 7.9% in precision. The precision of our method is the highest of all of the others.

3) VOT-2018 DATASET
VOT-2018 dataset contains 60 challenging sequences and has 5 visual attributes, such as occlusion, illumination change, motion change, size change, and camera motion. In Table 4, we use the expected average overlap (EAO) to analyze the tracking performance. We can see from Table 4 that our proposed tracker performs significantly better than the baseline BACF in terms of the EAO metric. Our proposed tracker (BACF-TC) improves the performance of BACF tracker by 3.94% in EAO. Furthermore, our tracker with deep features (BACF-TC_deep) achieves a conspicuous improvement over BACF with a gain of 11.59% in EAO.

4) EVALUATIONS ON TRACKING SPEED
For practical tracking applications, tracking speed is a critical factor to determine whether the tracker can be widely used.   For a full comparison, Table 5 and Table 6 list the average speed of our proposed tracker and these state-of-the-art trackers on OTB-2015 dataset. The average tracking speed of BACF-TC reaches 23.7 FPS on CPU, and the average tracking speed of BACF-TC_deep reaches 18.4 FPS on GPU. Considering combining the tracking speed and performance in practical application, our proposed tracker makes a good tradeoff between tracking speed and performance.

V. CONCLUSION
In this paper, we propose the improved BACF via the temporal consistency constraint (BACF-TC) and offer a fast algorithm via ADMM in the frequency domain. In addition, we propose a simple yet effective adaptive feature channel selection method. Our approach can mitigate model drift efficiently and achieve overall performance improvement. Extensive experiments demonstrate that the proposed tracker outperforms the baseline BACF in terms of success rate and precision. Moreover, the proposed tracker performs favourably against many representative state-of-the-art methods regarding precision and success rate.