Robust Visual Object Tracking With Multiple Features and Reliable Re-Detection Scheme

In recent years, correlation filter based trackers have seen widespread success because of their high efficiency and robustness. However, a single feature based tracker cannot deal with complex scenes such as serious occlusion, motion blur and illumination variation. In this paper, we develop a novel tracking method combining color feature, Hog feature and motion feature. The motion feature is estimated between adjacent frames by large displacement optical flow. Besides, in order to cope with boundary effect existing in traditional correlation filter based trackers, an adaptive cosine window is introduced in our method, which can highlight the target region, suppress the background region and enlarge search region. Meanwhile, a novel judge scheme combining Hog correlation response and color response is adopted to evaluate the reliability of tracking result. Finally, inverse sparse representation is presented to locate coarse positions of target in case of tracking failures. Extensive experiments on five famous tracking benchmarks including OTB100, TColor-128, UAVDT, UAV123 and VOT2016 demonstrate our proposed method outperform other sate-of-the-art methods in terms of robustness and accuracy.


I. INTRODUCTION
Visual object tracking, one of the classical and fundamental research topics in computer vision, has long been widely used in traffic monitoring, medical image processing, automatic driving and video surveillance. Although great breakthrough has been made in the past decade [1]- [13], designing a general and robust tracker remains a challenging task, due to many unpredictable factors including illumination change, scale variation, serious occlusion, motion blur, and so on.
Recently, trackers based on correlation filter (CF) have been proposed and obtained promising performance on many challenging benchmarks. The core of CF trackers is to train a discriminative classifier to separate the target from its surrounding background [14]- [16]. Through exploiting Fast The associate editor coordinating the review of this manuscript and approving it for publication was Guitao Cao .
Fourier Transform (FFT) on the circulant shifted training samples, the target in a new frame is able to be located at a very low computational complexity. Although CF trackers have achieved promising performance with an extremely high speed, there still exist some factors which severely hamper the tracking performance. First, there are undesired boundary effects in CF trackers due to periodic assumption. Discriminative correlation classifiers are trained with the circulant shifted version of the target and only the detection scores near the center of searching region are accurate. Therefore, only a restricted search area is used to train the correlation filter, which makes CF trackers easily drift to the background in the presence of heavy occlusion and motion blur. Second, historical information of target in video frames is not considered in most CF trackers. The position of target from continuous frames rarely changes greatly which can be used to improve tracking accuracy. Third, there is no re-detection scheme in traditional CF trackers in case of tracking failures. When the target is heavily occluded or is out of view, most CF trackers are not able to locate the position of target again.
In this paper, we address the above-mentioned problems by several aspects. First, in order to solve the boundary effect, we introduce an adaptive cosine window to enlarge search region, which is composed of a traditional cosine window and an adaptive target likelihood map. The target likelihood map is computed for each frame and can estimate the probability of each pixel belonging to the target or the background. Thus, this scheme can highlight the target and suppress the background. Second, we use large displacement optical flow to estimate the motion feature in adjacent frames. Then we combine Hog correlation response, response from color feature and motion feature to obtain a robust response of target. Finally, we use a re-detection module to judge the reliability of tracking result. If the tracking result is regarded as unreliable, we introduce reverse sparse representation to refine the candidates and get coarse locations of target. The flowchart of our proposed method is shown in FIGURE 1.
The main contributions of this paper are summarized as follows, -We combine traditional fixed cosine window and the target likelihood map of each frame to form an adaptive cosine window, which can effectively cope with boundary effect. -We introduce large displacement optical flow to predict the motion feature of adjacent frames, which can enhance the robustness of final response of target. -We adopt a novel judgement scheme to estimate the reliability of tracking results. If the reliability of tracking result is considered to be unreliable, we propose a reverse sparse representation scheme to locate coarse positions of target in case of tracking failures. -We conduct extensive experiments on OTB100 [17], TColor-128 [18], UAVDT [19], UAV123 [20] and VOT2016 [21] to demonstrate the superiority of our method.
The remainder of this paper is organized as follows. In section II, we give a brief description of related work on visual tracking. Section III describes the detailed introduction of our proposed method. In section IV, we report the experimental results and discussions. Finally, we draw conclusions of this paper in section V.

II. RELATED WORK A. TRACKERS BASED ON CORRELATION FILTER
CF trackers have obtained promising tracking results owing to the dense sampling and efficient computation in the Fourier domain. Bolme et al. [22] firstly applied correlation filter into the field of visual tracking using minimum output sum of square error (MOSSE). MOSSE attracted a huge amount of interest with a tracking speed of more than 600 fps. Henriques et al. [23] exploited the circulant structure of training samples in the kernel space (CSK). Later, Henriques et al. [24] further extended CSK tracker from one channel fea-ture to multiple features, named kernelized correlation filter (KCF). In order to deal with scale variation, Danelljan et al. [25] proposed a discriminative scale space tracker (DSST) to accurately estimate the scale of target. Li and Zhu [26] developed a scale adaptive tracker with multiple features (SAMF) including Hog and color-naming. To alleviate the adverse effect of boundary effect, Danelljan et al. [27] presented a spatially regularized discriminative correlation filter (SRDCF) tracker, which can train correlation filter on a huge set of negative training samples. Galoogahi et al. [28] developed background aware correlation filters which are trained with real background patches instead of shifted patches. Li et al. [29] proposed spatial-temporal regularized correlation filters (STRCF) which can be solved via the alternating direction method of multipliers. Inspired by the successful application of features from convolutional neural networks (CNN) in image classification [30], image segmentation [31] and image denoising [32], [33], Ma et al. [34] utilized hierarchical convolutional features instead of handcrafted features in the framework of correlation filter to improve tracking performance. Qi et al. [35] developed a novel adaptive weighted method to hedge each weak CNN based tracker into a stronger one. Danelljan et al. [36] presented a unified learning framework for down-weighting the corrupted training samples and up-weighting the accurate ones. Furthermore, Danelljan et al. [37] developed a generic formulation to learn discriminative convolution operators for visual tracking in the continuous spatial domain. Efficient convolutional operators (ECO) is developed by Danelljan et al. [38] to reduce the computation complexity.

B. TRACKERS BASED ON SPARSE REPRESENTATION
Sparse representation is widely used in the field of computer vision, such as face recognition, image classification, object detection and so on. Motivated by the successful application in face recognition, Mei and Ling [39] firstly introduced sparse representation into visual tracking, called 1 tracker. Each candidate in 1 tracker is linearly represented by both target template and trivial template. The candidate with the least reconstruction error is chosen as tracking result of current frame. However, the sparse coefficients corresponding to each candidate need to be computed by 1 minimization, which makes 1 tracker quite computational expensive. In order to accelerate the tracking speed of 1 tracker, Bao et al. [40] proposed an improved 1 norm using an efficient gradient descent optimization approach. Xiao et al. [41] found that tracking methods with 2 -regularized least square are able to achieve almost the same performance as methods with 1 -regularized least square, but the computational complexity is much lower. To improve the tracking performance of tracker based on sparse representation, Zhong et al. [42] builded a sparse collaborative tracking model which takes account of both holistic templates and local representations. An efficient and robust tracking method is developed by Jia et al. [43], which considers both structural and partial information of target object. Zhuang et al. [44] developed a novel reverse sparse representation formulation, which allows multiple templates to be reconstructed simultaneously by the whole candidate set. Wang et al. [45] proposed a novel inverse sparse tracer which uses a locally weighted distance metric to replace traditional Euclidean distance metric. Sun et al. [46] viewed visual tracking as a two-stage optimization problem which taking account of both temporal discontinuity and continuity information of target appearance.

A. RESPONSE FROM CORRELATION FILTER
, is derived from the circularly shifted version of image patch x along the horizontal and vertical directions. The classifier w is derived from the solution of the following minimization function: where ϕ(·) means the mapping from the linear feature space to the nonlinear one and λ > 0 is a regularization parameter. Using a kernel κ(x, x ) = ϕ(x), ϕ(x ) , the classifier w can be expressed as Here α is the dual coefficients of w and can be learned bŷ whereα means the fast Fourier Transform (FFT) of α and α * denotes the complex-conjugate ofα. k xx stands for a Gaussian kernel and is defined as where F −1 means the inverse FFT transform and denotes element-wise product. Given an new image patch z in the next frame, the correlation response map f h is calculated by The tracking result in new frame can be located by searching for the maximum value of correlation response map f h using Hog feature. In order to improve tracking performance, the model is online updated by the following formulation, where η 1 means the learning rate and the subscript t is the index of current tracking frame.

B. RESPONSE FROM MOTION ESTIMATION
Optical flow scheme, estimating the motion feature of objects from consecutive frames, is widely applied in computer vision, such as image understanding, image analysis and image registration. Motion estimation, in the form of optical flow, can be used to estimate the motion information of target across frames. In this section, we use large displace optical flow estimation scheme [47] to estimate the motion information of target in consecutive frames to promote the tracking performance. Let I t , I t+1 be the t-th and the (t +1)-th frame, s : (s x , s y ) is the location in a rectangular image domain and u := (u, v) T be the searched optical flow between the t-th frame and the (t +1)-th frame. Then, the optical flow field can be calculated by minimizing the following energy functional: where γ , ζ and β are tuning parameters. E color (u) encodes color constancy, E gradient (u) encodes gradient constancy, E smooth (u) encodes robust smoothness constraint, E match (u, u 1 ) and E desc (u 1 ) bias the displacement field.
E color (u) assumes that the color value of a pixel is not changed over time by the displacement. The formulation of E color (u) is expressed by where (s 2 ) = √ s 2 + 2 , = 0.001. E gradient (u) allows slight changes in the color value and can decide the displacement vector by a rule that is invariant under color value changes. The expression of E gradient (u) is given by (10) where ∇ = (∂ s x , ∂ s y ) T stands for the spatial gradient. E smooth (u) takes account of interaction between neighbouring pixels and introduces the smoothness of the flow field. The formulation is described as To enforce the smooth flow, the term E match (u) is integrated from descriptor matching into the variational formulation, which is described by (12) where δ(x) is a delta function indicating. δ(x) is 1 if a descriptor match is available in location s. ρ(x) describes the confidence of match. Assuming the descriptors have been matched, the matching task can be formulated by E desc (u t ), where f t (s) and f t+1 (s) mean the field of feature vectors in frame t and frame t + 1, respectively.

C. RESPONSE FROM COLOR HISTOGRAM
To effectively cope with shape deformation, a color histogram model is introduced in this section. Including the correct position as a positive example, the color histogram score can be learnt from a huge set of rectangular image patches x extracted from each frame. Then the histogram weight vector β should be obtained by solving the following ridge regression problem, where ψ x (τ ) denotes the feature pixels of image patch x in finite region H, W represents a set of pairs (x, ) and denotes the labels of image patch x. Then, histogram score can be regarded as an average vote and can be calculated by where O denotes the object region and B means the background region. By introducing the one-hot assumption, the above objective can be decomposed into the following independent terms here N j (O) denotes the number of pixels belonging to the region O for which feature j = 0. Then, the solution of above ridge regression problem is After getting the histogram weight vector, the response from color histogram of an image patch x can be achieved by an integral image. For a new image, the color histogram over the target area O and background area B are recomputed and linearly updated as follows, where p t (O) and p t (B) are the vectors of p j t (O) and p j t (B) for j = 1, 2, . . . , M, respectively.

D. ADAPTIVE COSINE WINDOW
Recently, CF trackers have been reported efficient and excellent tracking performance. However, due to the underlying boundary effect produced by the periodic assumption, the detection scores of CF trackers are only accurate around the center of target. Thus, boundary effect easily leads to a restricted searching area and hampers the tracking performance. In order to alleviate the adverse effect of boundary VOLUME 8, 2020 effect, a fixed cosine window C is introduced in the traditional CF trackers. Though the fixed cosine window C can suppress the some contamination of background region, it also shrinks the searching area when finding the true position of target.
In this section, we proposed an adaptive cosine window to overcome the harmful effect of boundary effect, which can highlight target region and suppress background region better than traditional cosine window. The formulation of adaptive cosine window W adap is given by where W is target likelihood map and is computed by equation (17) in section C. The target likelihood map is able to effectively distinguish target region and background region.

E. TARGET LOCATION
In this paper, we use the combination of correlation response of Hog feature f h , the response from color histogram f c and the motion map f m to locate the position of target, which can enhance the robustness of our method. The final response is a weighted linear combination of f h , f c and f m , where the superscript t is the frame index, ζ 1 , ζ 2 and ζ 3 are weighted paramters. The position of the t-th frame can be searched by finding the maximum value of the final response f (t) .

F. RE-DETECTION MODULE
In this section, we first check the reliability of tracking results using correlation response of Hog feature and response of color histogram. Then, if the current tracking result is considered to be unreliable, we will launch the re-detection module to refine the target location.
For the response map of Hog feature, we define be the i-th peak-to-sidelobe ratio (PSR). f i h denotes the i-th correlation response of Hog feature, µ i h and σ i h are the i-th mean and standard deviation of f i h , respectively. We also define the PSR ensemble pool be the color score of i-th frame. We also define the color score ensemble pool we consider tracking result of the i-th frame is unreliable and we will launch the re-detection module. At the same time, S i h or S i c will not be put in the ensemble pool C h and C c , respectively. o h and o c are constant parameters for M h and M c , respectively.
If S i h or S i c are discarded, we first coarsely locate the position of target based on multitask reverse sparse representation scheme. Then we use CF tracker with multiple features mentioned in our above paper to refine the tracking results.
In order to obtain reliable candidates, a discriminative reverse sparse representation based method is adopted to determine the rough scope efficiently. We construct the positive template sets T pos = [t 1 , t 2 , . . . , t p ] around the object within a small circular area and the negative template sets T neg = [t p+1 , t p+2 , . . . , t p+n ] far away from the object within an annular region. p and n are the number of positive and negative templates, respectively. If the current tracking result is considered to be unreliable, each template is represented by the candidate set Y with coefficients c, which can be computed by Eq. (21).
where c i denotes the nonnegative sparse coefficients of the i-th template and reflects the similarity between candidate and the corresponding template. We construct similarity map matrix C = [c 1 , . . . , c p+n ], which can be computed as a whole. Then, Eq. (21) is reformulated as the following equation.
arg min (22) where δ 2 ij c i − c j 2 B ij is the customized Laplacian regularization term. δ is the regularization parameter and B is a binary matrix.
For the i-th candidate, by introducing the weighted discriminative sparse similarity map and the additive pooling scheme [44], the reliability score R i is calculated by where s i−pos and s i−neg represent what extent can the i-th candidate be related to the positive and negative template sets. The i-th candidate with the higher value of R i is more possible to be the target. In order to reduce the computation cost, we discard 90% of the candidates through the reliability score. The rest candidates will be put in the correlation filter framework using multiple features with adaptive cosine window to refine the tracking result. The position of target will be located by searching for the maximum value of correlation response map from all the remaining candidates.
G. UPDATE SCHEME Inspired by [48], we use the same adaptive update scheme to adapt to various changes. For the correlation filter based tracker, the update parameter η 1 is set by  where Q, υ 1 and υ 2 are constants. For color histogram model, the update parameter η 2 is set by where P is a constant. As for the positive template sets in coarse location process, if the tracking result is reliable, we update the positive template sets according to a threshold θ. We define the similarity vector d = [d 1 , d 2 , . . . , d p ]. d i describe the similarity between the i-th positive template and the current tracking result. For the i-th positive template, if max d i > θ , (i = 1, 2,. . . ,m), we replace it with the tracking result. Otherwise, we keep the i-th positive template unchange. For the negative template set, we sample negative templates around the tracking result in the last frame.

1) IMPLEMENTATION DETAILS
Our method is implemented on Matlab 2017 and is conducted on a computer with an Inter(R) Xeon(R) 2.4GHz CPU and 128G RAM. Table 1 gives some parameters used in our method, which are fixed for all the experiments.

2) EVALUATION METHODOLOGY
In order to assess different tracking methods fairly, two evaluation metrics: precision rate and success rate, are introduced in this paper. The precision plot is defined as the percentage of tracking frames whose center location error is less than 20 pixels. Here, the center location error is the difference between the estimated positions and ground truth. The success plot denotes the overlap ratio between the predicted bounding box and ground truth. The overlap ratio is defined as S = Area(R t R g ) Area(R t R g ) . Here, R t means the predicted bounding box and R g denotes the ground truth. and are the union and intersection operators, respectively.

1) QUANTITATIVE EVALUATION
We compare our method with other 10 state-of-the-art trackers including HCFT [49], LCTdeep [50], HDT [51], MCCT_H [52], RLT [48], Staple [53], LCT [54], SAMF [26], DSST [25], ECO-HC [38]. FIGURE 2 demonstrates the precision plots and success plots of one pass evaluation (OPE) on OTB100. It is obvious that our method achieves the best tracking performance in both precision and success rates. Compared with the baseline RLT tracker, our tracker obtains a concerning improvement (4.1% in precision rate and 2.7% in success rate). HCFT, LCTdeep, HDT using deep features can greatly improve tracking performance compared with trackers using handcrafted features. However, due to the timeconsuming feature extraction process, these trackers can only run with a speed of about 1fps at the CPU platform, which can not meet the real-time requirement. Although LCTdeep and LCT trackers have re-detection module, they use a fixed threshold to assess the reliability of tracking result, which can not obtain promising results for all the challenging sequences. Our method using traditional handcrafted features, with the help of novel re-detection module, achieves the precision score of 86.4% and the success score of 65.1%.
In order to further demonstrate the superiority of our proposed method, we report tracking results for eleven challenging attributes in overlap rates in FIGURE 3. It is obvious that our method achieves the best tracking performance in almost all the eleven attributes except for low resolution (ranks second), fast motion (ranks second) and scale variation (ranks second). Besides, the tracking performance of our proposed method has been greatly improved in occlusion (63.2% vs 54.8%) and out-of-plane rotation (61.9% vs 53.8%) when compared with Staple tracker. This improvement is mainly due to the adaptive cosine window scheme, which can enlarge the searching region to alleviate the adverse effect of boundary effect. Since using multiple features, the tracking performance of our method under low resolution has been improved by 2.2% as compared with the baseline RLT tracker. Besides, our method is able to outperform methods using deep features in all the attributes, such as HCFT, HDT and LCTdeep.

2) QUALITATIVE EVALUATION
In order to explicitly demonstrate the comparison results between our method and other 10 state-of-the-art trackers, we visually show the bounding boxes of 11 trackers on several key frames of 10 representative video sequences in FIGURE 4. For both Girl2 and Lemming sequences, the major challenge is serious occlusion. Since HCFT, HDT, SAMF, Staple, ECO-HC and DSST do not consider the re-detection scheme, it can be seen that these methods are not able to cope with serious occlusion. Although LCTdeep and LCT have failure recovering mechanism, they can not deal with full occlusion and background clutter in Lemming sequence. The targets in Singer1 and Singer2 video sequences have drastic illumination variation, while scale variation is included in the Singer1 sequence. It can be  observed from the Singer2 sequence that LCTdeep, HDT, ECO-HC and SAMF are sensitive to illumination variation and drift in frame 100. Our proposed method has strong robustness to both illumination variation and scale variation. The sequences in Juming and BlurOwl are low in quality because of dramatic motion blur. Consequently, MCCT_H, SAMF, RLT, DSST, Staple and LCT have a strong tendency to lose the target. Our method with adaptive cosine window can highlight the target region, suppress the background region and locate the position of target precisely in the whole tracking process. The targets in both the Skiing and Deer sequences undergo large displacement due to fast motion. SAMF, MCCT_H, RLT, LCT, ECO-HC and Staple learn a great deal of background features and drift to the surrounding background in the process of tracking. Our method considers the motion feature between adjacent frames and can cope with fast motion easily. The targets in CarScale and Human6 sequences have significant scale variation during tracking, and occasionally have serious occlusion. The bounding boxes of HCFT and HDT keep the same during tracking, as they do not have the scale estimation component. Our method use the same scale estimation scheme as RLT and is able to predict precise scale of target all the time.

3) COMPARISON OF DIFFERENT METHODS IN SPEED ON OTB100
Tracking speed is very important for industrial application of visual tracking algorithm. It is difficult to apply trackers with a slow speed into industrial products. This section demonstrates the comparison of different methods in speed on OTB100. In order to show the comparison of different visual tracking methods validly and fairly, all the three methods with deep features (including HCFT, HDT and LCTdeep) in Table 2 are conducted on a PC equipped with an Inter Xeon CPU E5-2640 v4 with 128G RAM and a single NVIDIA TITAN Xp. The other eight methods with handcrafted features in Table 2 are conducted on the same PC platform without GPU. From Table 2, it can be observed that our tracker in a CPU platform is able to get better tracking results than trackers with deep features in a GPU platform at almost the same speed.

C. RESULTS ON TColor-128 DATASET
To further testify the validity of our proposed tracker, we conduct a comprehensive experiment on TColor-128 dateset. TColor-128 dataset consists of 128 color sequences, which have the same 11 challenging attributes as OTB100 dataset. We compare our method with other 11 state-of-the-art trackers including RLT [48], SAMF_AT [55], SRDCF [27], CoKCF [56], HDT [51], HCFT [49], SAMF [26], LCT [54], CFNET [57], DSST [25] and ECO-HC [38]. FIGURE  5 shows the comparison results with 11 trackers in terms of precision plots and success plots of OPE. It can be seen that compared with trackers using deep features such as CoKCF, HDT, HCFT and CFNET, our method achieves 7.3%, 8.7%, 8.2%, 23.8% improvement respectively in precision plot and 9%, 9.1%, 9.7%, 18.2% improvement respectively in success plot. Our tracker outperforms the other 11 trackers and ranks the first place in both precision rate and success rate due to the complementary handcrafted features. As using the motion feature of two continuous frames, compared with the baseline RLT tracker, the performance of our tracker has been improved by a margin of 3% in terms of precision plot and 2% in terms of success plot, respectively. Table 3 and 4 demonstrates the precision rate and success rate of 12 trackers with 11 challenging attributes on TColor-128 dataset. The best, second best and third best tracking results are represented in red, blue and green, respectively. It can be observed that our method obtains the best tracking performance in terms of precision plot except for motion blur (ranks second) and low resolution (ranks second). Besides, our method gets the first place in almost all the 11 challenging attributes in terms of success rate except for low resolution (ranks second) and out of view (ranks second). Thus, we can conclude that our method is effective for all the 11 attributes compared with 11 trackers mentioned in this section.

D. RESULTS ON UAVDT DATASET
UAVDT dataset contains 50 challenging video sequences captured from UAV, which are fully annotated with bounding boxes and focus on complex scenarios. These sequences covers 9 complex challenging attributes including background clutter (BC), camera motion (CM), object motion (OM), small object (SO), illumination variation (IV), object blur (OB), scale variation (SV), long-term tracking (LT) and large occlusion (LO). FIGURE 6 illustrates precision plots and success plots of OPE of our proposed method against other 14 state-of-the-art tracking methods including SRDCF [27], PTAV [58], MCPF [59], C-COT [37], FCNT [60], STCT [61], CREST [62], RLT [48], CN [63], HCFT [49], HDT [51], KCF [24], SINT [64] and ECO-HC [38]. It is obvious that our method obtains the first place in both precision rate and success rate, even better than trackers using deep features, such as PTAV, MCPF, C-COT, FCNT, STCT, CREST, FCNT, HDT and SINT. Although PTAV has the verification module and can correct the tracker when needed, this method just obtains the fourth place in precision rate and the ninth place in success rate. Compared with the baseline RLT tracker, our proposed method obtains an improvement (3.7% in precision rate and 2.2% in success rate), which proves that the adaptive  The precision rates of 12 trackers with 11 challenging attributes on TColor-128 dataset. The best, second best and third best tracking results are represented in red, blue and green, respectively.

TABLE 4.
The success rates of 12 trackers with 11 challenging attributes on TColor-128 dataset. The best, second best and third best tracking results are represented in red, blue and green, respectively. cosine window and motion feature are effective. To further validate the effectiveness of our method, tracking results on UAVDT dataset with 9 different challenging attributes are reported in FIGURE 7. We can see that our method achieves the first place in almost all the attributes except for small object (ranks second), scale variation (ranks third), longterm tracking (ranks fifth) and large occlusion (ranks fourth). Although our method is second to C-COT in small object, the difference is quite small, at only 0.1%. The adaptive cosine window can decrease the effect of boundary effect attributing to the enlarging searching region. As a result, our method is able to perform well in object blur and object motion.  and similar object (SO). As these challenging attributes in UAV123 dataset, the tracking performance of various trackers decrease drastically compared with the OTB100 dataset in terms of the same evaluation metric. FIGURE 8 demonstrates the precision plots and success plots of OPE of our proposed method against other 16 state-of-the-art methods including RLT [48], SRDCF [27], CSDCF [65], Staple_CA [66], BACF [28], CoKCF [56], SRDCFdecon [36], KCC [67], SAMF_CA [66], Staple [53], SAMF [26], DSST [25], fDSST [68], DCF [23], EOC-HC [38] and KCF [24]. From FIGURE 8 we can see that EOC-HC obtains the first place both in precision rate and success rate. Since the re-detection module used in our tracker can solve the drifting problem, our method achieves the second best tracking performance with a precision score of 68.2% and with a success score of 48.1%. The baseline RLT tracker follows our tracker and achieves the third place. Owing to the adaptive cosine window and motion feature, our method improves the tracking performance by 3% in precision rate and 4% in success rate compared with RLT. To further demonstrate the comparative results, success plots of OPE with 12 different attributes on UAV123 are reported in FIGURE 9. It can be observed that our method achieves the second performance only to ECO-HC in almost all the attributes except for fast motion (ranks fourth), full occlusion (ranks third), background clutter (ranks second), illumination variation (ranks third) and similar object (ranks third). Compared with CoKCF using mutual deep features, our method only using handcrafted features and motion feature obtains better tracking results in all the attributes. Because of the adaptive cosine window, our tracker is able to enlarge the searching region and deal better with boundary effect than BACF.

F. RESULTS ON VOT2016 DATASET
In order to further evaluate our method, a comparison with other state-of-the-art trackers which participated in VOT2016 is demonstrated on FIGURE 10 and FIGURE 11. VOT2016 dataset consists of 60 challenging videos. The tracking performance is assessed in terms of accuracy and robustness, which consider the average overlap ratio and failure times, respectively. VOT2016 also introduces the expected average overlap to rank trackers which takes account of the raw values of per-frame accuracies and failures in a principled manner. FIGURE 10 gives the robustness-accuracy ranking plots under the baseline on VOT2016. FIGURE 11 demonstrates the expected average overlap graph on VOT2016. Trackers closer to the top-right of the plot perform better. We can see that our proposed method achieves the thirteenth place among the 50 trackers and performs better than most of trackers participated in VOT2016.

G. ABLATION ANALYSIS
In this section, we conduct extensive experiments to illustrate the effectiveness of each proposed component including adaptive cosine window, motion feature and re-detection scheme.

1) EFFECT OF ADAPTIVE COSINE WINDOW
The adaptive cosine window plays an important role in our proposed method. It has the ability to highlight the target, suppress the background and can cope with boundary effect effectively. The first chart of FIGURE 12 gives the precision plots by RLT method with adaptive cosine window and fixed cosine window on OTB100. The second chart of FIGURE  12 shows the success plots by RLT method with adaptive cosine window and fixed cosine window on OTB100. Figure 14 reflects the effectiveness of adaptive cosine window with our proposed method on OTB100 in terms of precision plots and success plots, respectively. It is clear that the adaptive cosine window scheme improves the precision rate and success rate significantly. Both FIGURE 13 and FIGURE 15 report the tracking results of RLT and our method on OTB100 in terms of 11 challenging attributes with adaptive cosine window and fixed cosine window, respectively. It can be observed that both RLT method and our method with adaptive cosine window perform better than that with fixed cosine window in almost all the eleven challenging attributes.
Our proposed adaptive cosine window combines traditional cosine window and the target likelihood map of each frame with a fixed parameter η 3 . η 3 determines the extent of each part acted on the final adaptive cosine window. Figure 16 shows the tracking performance in terms of precision   and success rate by our proposed method on OTB100 with different parameter η 3 . We can see that when the parameter η 3 is set to 0.5, both precision value and success rate obtain the maximum value.

2) EFFECT OF MOTION FEATURE
Motion feature is important for our method and can be estimated by large displacement optical flow. The motion trajectory of target can be obtained and promotes the tracking performance greatly. The first picture of FIGURE 17 shows the precision plots of OPE by RLT method with motion feature and without motion feature on OTB100. The second picture of FIGURE 17 demonstrates the success plots of OPE by RLT method with motion feature and without motion feature on OTB100. Figure 19 shows the effectiveness of motion feature on our method. It can be observed that RLT with motion feature achieves a precision score of 82.7% and   a success score of 63.0%, while RLT without motion feature only obtains a precision score of 82.3% and a success score of 62.4%. At the same time, our method combining three features (including Hog feature, color feature and motion feature) is able to get better performance than without motion feature. FIGURE 18 and FIGURE 20 demonstrate tracking results of RLT method and our method on OTB100 in terms of 11 challenging attributes with motion feature and without motion feature, respectively. It can be easily seen that our method using motion feature can get better performance in almost all the attributes except for IV and BC.

3) EFFECT OF RE-DETECTION COMPONENT
In this section, our method uses the same scheme as RLT to judge the reliability of tracking results. If the tracking   results are regarded as unreliable, RLT method use traditional sparse representation to refine the candidates. On the contrary, our method uses reverse sparse representation to find coarse locations efficiently. FIGURE 21 and FIGURE 23 give the precision plots and success plots of OPE by RLT method and our method with traditional sparse representation and reverse sparse representation on OTB100, respectively. The reverse sparse representation reconstructs the template sets with a few candidates and builds a discriminative sparse similarity map to refine the candidates. FIGURE 22 and FIG-URE 24 give the tracking results of RLT and our method on OTB100 in terms of 11 challenging attributes with traditional VOLUME 8, 2020   sparse representation and with reverse sparse representation, respectively. It can be easily seen that our reverse sparse representation can promote the tracking performance significantly in terms of all the attributes.

H. FAILURE CASES
We give some tracking failure cases by our method in FIG-URE 25. For the video Jump and Diving, targets undergo fast motion as well as serious deformation. As our method do not consider in-plane rotation and fails to locate the true position of target. In the Person sequence, target person passes through the pavilion and disappears for a long time. Although the re-detection module is activated due to full occlusion, our method can not locate the true position of small target person when target reappears in the far place.

V. CONCLUSION
In this paper, we propose a robust visual object tracking method with motion estimation and reliable re-detection scheme. First, motion feature is estimated by large displacement optical flow through adjacent frames, which is combined with color response and Hog correlation response to promote the tracking performance significantly. Second, an adaptive cosine window is adopted to deal with boundary effect, which has the ability to highlight target and suppress background effectively. Third, the color response and Hog correlation response is introduced to judge the reliability of tracking results. Fourth, reverse sparse representation is adopted to refine the candidates in case of tracking failures. At last, extensive experiments are conducted on five popular benchmarks to demonstrate the superiority of our proposed method. HAIJUN