Correlation Filters With Adaptive Multiple Contexts for Visual Tracking

The local contexts deﬁne the target and its surrounding background within a constrained region, and have been proved useful for visual tracking, but how to adaptively employ them for building robust models remains challenging. By using the spatial weight maps, the correlation ﬁlter (CF) methods with spatial regularization provide an alternative to exploit the local contexts for appearance modeling. However, they generally utilize naive spatial weight map functions, and fail to ﬂexibly regulate the effects of the target and background on model learning, thereby restricting the tracking performance. In this paper, we address these issues by presenting an adaptive multiple contexts correlation ﬁlter (AMCCF) framework. In particular, a novel sigmoid spatial weight map is ﬁrst proposed to control the impacts of local contexts for learning more effective CF models. Based on this, different levels of local contexts (multiple contexts) are further modeled by incorporating the spatial weight maps with different parameters into multiple CF models. To adaptively utilize the local contexts on the tracking stage, the minimal weighted conﬁdence margin loss function with a weight prior constraint is adopted for jointly estimating the target position and adaptive fusion weights of response maps from different CF models. To validate the proposed method, extensive experiments are conducted on four tracking benchmarks. The results show that our AMCCF can adaptively leverage the local contexts for robust tracking, and performs favorably against the state-of-the-art trackers.

The and rectangles represent the target and local context regions. Note that while the target suffers from significant appearance changes, the surrounding background, i.e., the human hand, keeps less changed during tracking. (b) Illustration of three STRCF models with different levels of local contexts when trained with different parameters of sigmoid spatial weight map functions.
risk of incorporating too much background into the positive training samples, and the learned CF models may result in poor discriminative power with large coefficients on the background regions. Fortunately, spatial regularization [6], [7] has recently been used in CF methods for constrained model learning, and can serve as an alternative to this issue. By using the spatial weight maps, spatial regularization can enforce large penalties on CF coefficients outside the target during training, and thus the CF methods can not only make use of local contexts for robust tracking, but also reduce the negative impacts of too much background on CF model learning. However, the existing spatial weight maps always have naive function forms, and cannot flexibly regulate the impacts of both the target and surrounding background regions on model learning, thereby restricting the tracking performance.
In this paper, we present a novel adaptive multiple contexts correlation filter (AMCCF) framework to exploit the potentials of local contexts for visual tracking. To achieve this, we first introduce a sigmoid spatial weight map function to regularize the model learning for CF methods with spatial regularization. By using multiple controlling parameters, the proposed spatial weight map function can flexibly vary the weight change speed from the target center to surrounding background regions, and thus regulate the impacts of local context regions on CF model learning. Since the spatial weight map is always pre-defined in the first frame, we further incorporate the sigmoid spatial weight maps with different parameters into the state-of-the-art STRCF method [8], and obtain multiple CF models with different levels of local context information. Fig. 1(b) show three STRCF models with different levels of local contexts when trained with the sigmoid spatial weight maps. One can observe that the uppermost level of spatial weight map directly imposes large penalties outside the target bounding boxes, thus the CF model only has non-zero coefficients on target regions, and is better on modeling the target appearance. In contrast, the two lower spatial weight maps contain different levels of local contexts with small penalty values, hence the learned CF models can leverage different levels of local contexts for visual tracking. Benefited from these STRCF models, we can jointly implement the modeling of different levels of local contexts, and result in robust tracking results.
On the tracking stage, after evaluating the candidate region with all STRCF models, we can obtain multiple different response maps. Next, we hope to find the optimal fusion weights for all the response maps, and thus adaptively utilize the local contexts for target localization. Inspired by UPDT [9] method which learns the adaptive fusion weights for different features, we adopt the minimal weighted confidence margin (MWCM) loss function for jointly estimating the target position and fusion weights of all the response maps. In addition, since the local context always have similar appearances in two consecutive frames, we incorporate a weight prior regularization term into the loss function, and ensure that all the response maps have similar fusion weights in neighboring frames. By solving the loss function with alternating minimization algorithms, we can not only assign adaptive fusion weights for the response maps, but also obtain the optimal target position.
To validate the effectiveness of AMCCF method, we perform extensive experiments on four tracking benchmarks, i.e., OTB-2015 [10], Temple-color [11], LaSOT [12] and VOT-2018 [13]. In comparison to the baseline STRCF method, our AMCCF can adaptively leverage the local contexts for visual tracking, and give rise to more robust appearance models as well as tracking performance. Moreover, even with the hand-crafted features, our AMCCF achieves an AUC score of 67.1% on OTB-2015, and performs favorably against the state-of-the-art tracking methods.
To sum up, the main contributions of the paper are: • A novel AMCCF framework is proposed to adaptively exploit the potentials of local contexts for visual tracking. To achieve this, a sigmoid spatial weight map function is first suggested to regularize the model learning for CF methods with spatial regularization. By flexibly varying the weight change speed from target center to background regions, it can effectively control the impacts of local contexts on model learning. Based on this, multiple CF models with different levels of local contexts are obtained by incorporating the spatial weight maps with different parameters into the STRCF models.
• On the tracking stage, a MWCM loss function is adopted for jointly estimating the target position and adaptive fusion weights of different response maps. To ensure the similar fusion weights of the response maps in neighboring frames, a weight prior regularization term is further incorporated into the MWCM loss function.
• Experiments on multiple tracking benchmarks indicate that the proposed AMCCF method can adaptively leverage the local context information for visual tracking, and performs favorably against the state-of-the-art trackers. The remainder of this paper is organized as follows. Section II reviews the trackers close to our approach. Section III firstly introduces the sigmoid spatial weight map function, and then present the proposed AMCCF framework and its solution. Section IV reports the experimental results.
Finally, Section V ends this work with several concluding remarks.

II. RELATED WORK
In this section, we first give a brief overview of correlation filter methods, and then focus on the trackers exploiting the context information.

A. CORRELATION FILTER-BASED METHODS
Correlation filters aim to learn a convolution filter that computes with the training sample for generating 2D Gaussian-shaped responses. To achieve this, a dense sampling is implemented by circularly shifting the sample in a sliding-window fashion, and the CFs are trained to discriminate between the sample and its circulant-shifted versions. Denote by (x t , y) a sample pair in frame t, where the sample consists of D features with the size W ×H , and y is the Gaussian-shaped label map. Then the filter f is learned by minimizing the following objective, where * stands for the circular convolution, and λ is the tradeoff parameter. Benefited from the circulant structure of samples, the filter f can be learned very efficiently in the fourier domain with fast fourier transform. The pioneering work of correlation filters in visual tracking starts with MOSSE [3]. Since then, great advances have been made from different perspectives. Galoogahi et al. [14] extend MOSSE to support multi-channel features. Following this work, more powerful features, such as Color features [15], [16], CNN features [17]- [19] and optical flow [20], are further investigated. Meanwhile, the CF framework has also been greatly improved. For example, Henriques et al. [4] exploit the non-linear CF via the kernel trick. Danelljan et al. suggest continuous convolution [21] and Gaussian Mixture model [22] for accurate appearance modeling. Moreover, other techniques, such as spatial regularization [6], [7], [23], [24], scale estimation [25]- [27], long-term tracking [28] and context-aware tracking [5], [29] are also studied for improving the CF framework.

B. CONTEXT TRACKERS
To exploit the local context for visual tracking, early methods employ the keypoint detection techniques for mining the background regions with consistent motion correlation to the target. For example, Dinah et al. [2] detect both auxiliary objects and distractors near the target during tracking, and employ them for collaborative tracking. Similarly, Wen et al. [29] further construct the strong auxiliary object with multiple weak ones for more robust tracking. These methods, however, can only utilize the sparse local image regions for context modeling, and often require high computational costs. Recently, the local context information has also been used in several CF-based methods for appearance modeling. In addition to employing larger image regions than the target, Mueller et al. [5] exploit the surrounding background regions for regularizing the CF model learning. Zhang et al. [30] model the spatial-temporal relations between the target and its surrounding background in a Bayesian framework, and achieve robust tracking results. Besides, spatial regularization techiques [6], [7], [23] are also used in multiple CF-based methods for exploiting the local context information. In comparison to these methods, our AMCCF can not only employ the sigmoid spatial weight map for more effective context modeling, but also adaptively leverage different levels of local context information for visual tracking.

III. THE PROPOSED AMCCF FRAMEWORK
In this section, we first discuss the existing issues of spatial weight maps on CF methods with spatial regularization, and then introduce our sigmoid spatial weight map function. Finally, we present our AMCCF framework and also provide its optimization algorithm.

A. THE CF METHODS WITH SPATIAL REGULARIZATION
Spatial regularization is first introduced into the CF framework to alleviate the boundary effect problem. By using spatially variant weight map function, the CF methods can penalize the model coefficients outside the target bounding boxes with large values, and thus reduce the negative impacts of samples with discontinuous boundaries on model learning. Since the non-zero coefficients only exist in the target and its surrounding background regions of CF models, the spatial regularization term can provide an alternative to exploit the local context information for visual tracking. In the following, we take the CF model learned with the single image as an example, and give the general form of CF methods with spatial regularization, where denotes the element-wise product, and w ∈ R M ×N is the spatially variant weight map function. One can observe from Eqn. (2) that the filter f is directly dependent on the spatial weight map w, thus the choice of w has inevitably a significant impact on CF model learning.
Depending on the forms of spatial weight map functions, the existing CF methods with spatial regularization can be roughly divided into two categories: (i) SRDCF [6] and STRCF [8] define the spatial weight map with the quadratic function, which smoothly increases the penalty values from the target center to background regions. (ii) CFLB [7] and BACF [23] only impose small penalties within the target regions, but directly enforce the CF coefficients outside the target bounding box with zero values. This is equal to putting near-infinite penalties outside the target, and thus their spatial weight map can be seen as a step function.  For clarity, here we formulate these spatial weight maps on one-dimensional domain and analyze their pros and cons in detail. Denote by L and T the target and CF model sizes, then the spatial weight map with step function is defined as, where x ∈ [−T /2, T /2], µ and η denote the minimal and maximal weight values, respectively. Similarily, the spatial weight map with quadratic function can be formulated as, Fig. 2 illustrates the spatial weight map functions of Eqn. (3) and (4) when setting µ = 0.01, η = 10, L = 3 and T = 8. One can observe that w step have small values within the target regions, i.e., |x| ≤ L 2 , and take much larger values when |x| > L 2 . In contrast, w quad gradually increase the penalty values by using the smooth quadratic function, but even the boundary positions within the target still have larger values than the target center.
Based on these observations, we can discover that the existing spatial weight maps still have several drawbacks: (i) By using the step function, the non-zero CF coefficients only exist within the target regions, thus both CFLB and BACF cannot exploit the background information in the local context for target localization on the tracking stage. (ii) While the learned models in SRDCF and STRCF can have non-zero coefficients on background regions with the usage of quadratic function, even the boundary regions within the target bounding box still have larger penalties, which may also degrade the CF model learning. Therefore, it is interesting to develop a novel spatial weight map function which can flexibly regulate the weight change speed of the target and background regions, and thus exploit the potentials of local contexts more effectively for robust tracking.

B. THE SIGMOID SPATIAL WEIGHT MAP FUNCTION
In this section, we present a novel spatial weight map function to control the impacts of both target and surrounding background regions on CF model learning.
From the discussions on Section III-A, we think that an appropriate spatial weight map function should satisfy the following properties: (i) Since each part in the target regions may contribute equally to CF model learning, the spatial weight map should have similar values within the target bounding box. (ii) To leverage the local context for visual tracking, the spatial weight map needs to assign the surrounding background regions outside the target with smaller penalty values. In addition, since too many non-zero coefficients on the background regions of CF models may also degrade the tracking performance, the spatial weight map should also suppress the background with fast weight increasing speed on the background regions near the target.
To this end, we propose a novel spatial weight map with sigmoid function form, and vary the weight change speed of both target and background regions with multiple controlling parameters. In particular, the sigmoid spatial weight map function defined on one-dimensional domain is given as follows, where α, β control the weight change speed and the position of inflection point in sigmoid function, respectively. In Fig. 2, we also plot the curve of Eqn. In fact, we can flexibly vary the weight values of w sig with the parameters α and β in Eqn. (5). In Fig. 3, we further investigate the effect of both α and β values on the curves of Eqn. (5). From it we can make the following observations: on the one hand, when setting β = 0.5, increasing α can not only keep similar values within the target regions of w sig , but also improve the weight change speed on the background regions near the target. In particular, when α → +∞ and β = 0.  Finally, we extend Eqn. (5) to the function defined on two-dimensional domain, and obtain the formulation as, It is worth noting that Eqn. (6) shares the same properties with Eqn. (5), thus can be used to regulate the penalty values of both target and background regions. And the experimental results on Section IV-B also show when incorporating the sigmoid spatial weight map function into the STRCF model, we can obtain better tracking performance than the counterparts using the spatial weight maps with step and quadratic functions, validating the effectiveness of our simgoid spatial weight map function.

C. PROBLEM FORMULATION OF AMCCF FRAMEWORK
By using the sigmoid spatial weight maps, we can leverage the local context information more effectively for model learning. However, since the spatial weight map function is always pre-defined in the first frame, it cannot satisfy the need for tracking in real scenes. When the background regions in the local context suffer from significant appearance changes, they may also degrade the CF models and tracking performance. Therefore, it is needed to find a solution for adaptively using the local contexts for visual tracking.
Since the parameters in Eqn. (6) can flexibly control the impacts of both target and surrounding background regions on CF model learning, we develop a series of sigmoid spatial weight maps with different parameters, and further employ them to train multiple CF models with different levels of local contexts. Meanwhile, after obtaining multiple response maps with these CF models on the tracking stage, we jointly estimate the target position and fusion weights of different response maps with model optimization algorithms, and thus result in the proposed adaptive multiple contexts correlation filter (AMCCF) framework. Fig. 4 illustrates the flowchart of the proposed AMCCF tracking framework. In the following, we give the implementation of learning multiple CF models with different levels of local contexts, and then present how to adaptively fuse the response maps and estimate the target position.
To obtain multiple CF models with different levels of local contexts, we generate N pairs of spatial weight map functions with different (α, β) parameters in Eqn. (6), and employ them for constrained CF model learning. By integrating Eqn. (2) with a temporal regularization term f − f t−1 2 , STRCF [8] provides robust appearance models in the case of large appearance variations, and can be formulated as, Considering that the STRCF method has the real-time speed and favorable performance among the CF methods with spatial regularization, we integrate it with N pairs of spatial weight maps, and thus result in multiple STRCF models with different levels of local contexts. As shown in Fig. 1(b), we can model different levels of local contexts with these STRCF models, thus give rise to more robust tracking performance.
After evaluating the candidate regions with multiple STRCF models on the tracking stage, we can obtain multiple response maps with different levels of local context information. Since different levels of local contexts may not contribute equally during tracking, it is not favorable to choose the same fusion weights for all the response maps. For example, the background regions in the local context should be suppressed when it suffers from significant appearance variations. And it should be less restricted when the target is obscured by other objects and disappears from the camera view. Therefore, the response maps should have variable fusion weights for adaptively leveraging the local context information for visual tracking. Fortunately, in UPDT [9] Bhat et al. develop a minimal weighted confidence margin (MWCM) measure to evaluate the tracking confidence at a certain position, and further present a MWCM loss function for jointly estimating the target location and fusion weights of different response maps. Denote by y (t) the value of response map y at position t, then the MWCM measure at position t * is formulated as, where the numerator stands for the response difference at position t * and t, the denominator (t * − t) = 1 − e − K 2 (t * −t) 2 depends on the distances of position t and t * , and K controls the change speed of the denominator. We can observe from Eqn. (8) that the MWCM measure at position t * is not only determined by y (t * ), but also depends on the response at any position t in the response map, and the distance between them. Please refer to [9] for more comprehensive analysis on the MWCM measure.
Based on the MWCM measure, a MWCM loss function is further proposed to estimate the fusion weights of response maps and target position. Suppose that y i and γ i are the i-th response map and its fusion weight respectively, then the MWCM loss function is given as, Here y γ = N i=1 γ i y i is the final response map after adaptive weight fusion. By minimizing the Eqn. (9), we can not only assign the response maps with adaptive fusion weights, but also obtain the optimal target position.
Inspired by the favorable properties of Eqn. (9), we adopt it for computing the fusion weights for all the response maps, and thus can adaptively leverage the local context information for visual tracking. Considering that the local contexts always have similar appearances in two consecutive frames, the fusion weights for the response maps should not change much in neighboring frames. Hence, we take the fusion weights in frame t − 1 as the prior regularization term, and incorporate them into Eqn. (9), where γ t−1,i denotes the fusion weight of the i-th response map in frame t − 1. By using Eqn. (10), we can put more penalties on the weights with large variations, and thus keep the fusion weights similar in neighboring frames. Experiments on Section IV-B also show that Eqn. (10) can achieve better performance than Eqn. (9), validating the effectivenss of adding the weight prior regularization term.

D. OPTIMIZATION
Since Eqn. (10) is a non-convex function, it cannot be solved directly with the closed-form solution. In this section, we minimize the Eqn. (10) by alternately updating between the variable γ i and t * , which can be further explained as follows: Updating t * : Given the fusion weights γ = [γ 1 , . . . γ N ], we can easily observe that minimizing Eqn. (10) is equivalent to maximizing the term ξ t * {y γ }. Thus the position t * can be found by searching for the highest score in the response map.
Updating γ i : Following UPDT [9], an auxiliary variable ξ with ξ = ξ t * {y γ } is introduced to relax Eqn. (10) as, It can be seen that Eqn. (11) is the quadratic programming problem, and thus can be solved efficiently via the existing quadratic programming algorithms. In addition, since the position t corresponds to any position in the response map, the quadratic programming algorithm may suffer from lower speed as a result of too many inequality constraints in Eqn. (11). To address this, we sample N s positions with the local maximum in each response map to ensure the fast convergence of Eqn. (11). By iteratively updating between the variables γ i and t * , we can jointly estimate the fusion weight of each response map and the target location. We empirically find that the loss of Eqn. (10) can converge within 3 iterations on most of the sequences, and thus the number of iterations N i is set to 3 throughout all the experiments.

IV. EXPERIMENTS
In this section, we first validate the effectiveness of each component in AMCCF method with the ablation experiments, then we perform extensive experiments on four tracking benchmarks, including OTB-2015 [10], Temple-Color [11], LaSOT [12] and VOT-2018 [13] datasets.

A. IMPLEMENTATION DETAILS
For the sake of validating the effectiveness of the proposed AMCCF framework itself, we employ the hand-crafted features, i.e., HOG and ColorName [15], for feature representation. Since the running time is directly dependent on the number of STRCF models N , we set N = 3 for balancing the tradeoff between tracking performance and computational efficiency. The parameters (α, β) in Eqn. (6) are set to {(1000, 0.5), (1000, 0.54), (1000, 0.58)} for leveraging different levels of local contexts for visual tracking. And the parameters µ, η in Eqn. (6) are set to 10 −3 and 10 5 , respectively. In addition, we keep the parameters of all STRCF models the same with the original STRCF method [8] for fair comparison.
On the tracking stage, we follow the best practice in [9] and set the hyper-parameter K in Eqn.

B. INTERNAL ANALYSIS OF THE PROPOSED AMCCF METHOD
In this section, we first investigate the effect of hyperparameters α and β in Eqn. (6) on tracking performance, and then compare the proposed sigmoid spatial weight map function with the other two commonly used functions. Finally, we perform an ablation experiment to analyze the impact of different fusion approaches for the response maps.

1) EFFECT OF THE HYPER-PARAMETERS α AND β
Since the hyper-parameters α, β control the weight change speed and the inflection point in Eqn. (6) respectively, the choice of their values may have significant impacts on CF model learning. To investigate the effects of hyperparameters α, β on tracking performance, we separately generate multiple sigmoid spatial weight maps with different parameters, and integrate them with the STRCF model for evaluation. Fig. 5 gives the AUC results of STRCF method with different choices of parameters α and β on the OTB-2015 dataset. It can be seen from Fig. 5(a) that when setting α = 2000, the choice of β value has indeed a significant impact on the performance. In particular, when progressively improving the β value from 0.5 to 0.58, the STRCF method can generally obtain better results on OTB-2015, and achieves the best AUC score of 65.8% with β = 0.58. To explain this, we first analyze the Eqn. (6) in detail. It is evident that when setting α with large values, such as α → +∞, Eqn. (6) is equivalent to the step function, Since max(| x W |, | y H |) ≤ 0.5 corresponds to the target regions, the parameter β is able to finetune the regions with small penalty value µ within the sigmoid spatial weight map. To be specific, when setting β = 0.5, Eqn. (12) takes the small values inside the target regions, and has large penalties outside the target bounding box. On the other hand, when 0.5 < β < 0.58, Eqn. (12) takes small values even on surrounding background outside the target regions. Thus the learned CF model has non-zero coefficients on corresponding background regions, and can leverage the local context information for improving the tracking performance. In addition, when setting β > 0.58, too many background regions have small penalty values in the sigmoid spatial weight map, and thus the learned CF models contain excessive non-zero coefficients on the background, resulting in the degraded performance.
Next, we keep the β value as 0.58, and investigate the effects of different α values on the performance. α determines the weight change ratio in Eqn. (6), and the higher the paramter α, the faster weight increasing speed on the penalty values of background regions near the target. From Fig. 5(b) we can observe that the best performance is achieved at α = 1000 with an AUC score of 66%. Based on these results, we assign the parameters (α, β) in the AMCCF framework with {(4000, 0.5), (1000, 0.54), (1000, 0.58)}, which can learn multiple STRCF models with different levels of local contexts.
Finally, we set the α, β values in Eqn. (6) to 1000 and 0.58, and compare it with the spatial weight maps of both quadratic and step functions by integrating them with the STRCF model. And the corresponding three methods are named as TABLE 1. The AUC results (%) of the STRCF method with different types of spatial weight map functions on OTB-2015 and Temple-Color datasets. Note that STRCF quad , STRCF step and STRCF sig represent the STRCF methods with spatial weight maps of quadratic, step and sigmoid functions, respectively.

TABLE 2.
The AUC results (%) of multiple AMCCF variants with different fusion methods of the response maps on OTB-2015. Note that Baseline, MCCF, AMCCF (w/o prior) and AMCCF respresent the orignial STRCF, the variant by replacing Eqn. (10) with computing the average response maps, the variant by removing the weight prior regularization term in Eqn. (10) and the AMCCF method, respectively. STRCF sig , STRCF step and STRCF quad , respectively. For fair comparison, we keep the parameters in quadratic function the same with the original STRCF, and the parameters µ, η in step function are kept consistent with the choices in Eqn. (6). Table 1 gives the AUC results of the three methods on both OTB-2015 and Temple-Color datasets. One can observe that the STRCF methods with quadratic and step functions share similar performance on two datasets, and achieve the average AUC scores of 60.2% and 59.9%, respectively. In contrast, the STRCF method with our sigmoid function obtains the average AUC score with 61.6%, which outperforms the variants with quadratic and step functions by 1.4% and 1.7%. This indicates that by flexibly tuning the CF coefficients in both target and background regions, the proposed sigmoid weight map function is superior to the counterparts using quadratic and step functions.

2) THE COMPARISON ON DIFFERENT FUSION METHODS OF THE RESPONSE MAPS
To investigate the impacts of different fusion methods of the response maps, we implement four variants of the AMCCF method in total: the original STRCF method (termed as Baseline), the variant by replacing Eqn. (10) with computing the average response maps (MCCF), the variant by removing the weight prior regularization term in Eqn. (10) (AMCCF  (w/o prior)) and the AMCCF method. Table 2 shows the AUC scores of all the variants on the OTB-2015 dataset.
We can observe that by using multiple STRCF models with different levels of local contexts, MCCF outperforms the Baseline method by 1.2% AUC score. In addition, AMCCF (w/o prior) achieves a higher AUC score of 66.7%, and this can be explained by the fact that the MWCM loss function can jointly estimate the target location and fusion weight for each response map, thereby adaptively leveraging the local context information for visual tracking. Finally, the AMCCF method further outperforms AMCCF (w/o prior) by an AUC score of 0.4%, validating that the proposed weight prior regularization term in Eqn. (10) is helpful for assigning the response maps with adaptive fusion weights.

C. EXPERIMENTS ON OTB-2015 DATASET
In this section, we first compare our AMCCF method with the state-of-the-art trackers on OTB-2015, and then analyze the performance of all competing methods on 11 challenging video attributes. Finally, we further provide the qualitative comparison on several video sequences.

1) DATASET AND EVALUATION METRICS
The OTB-2015 dataset consists of 100 sequences annotated with 11 challenging video attributes, including scale variation (SV), illumination variation (IV), in-plane-rotation (IPR), out-of-plane rotation (OPR), occlusion (OCC), fast motion (FM), motion blur (MB), background clutter (BC), deformation (DEF), out of view (OV) and low resolution (LR). To evaluate the proposed AMCCF method, we follow the protocols in [10] and adopt the overlap precision (OP) metric for computing the ratio of frames with bounding box overlaps exceeding 0.5 in a sequence. In addition, we also plot the overlap success curve against different overlap thresholds for comprehensive evaluation.

2) COMPARISON WITH THE STATE-OF-THE-ART METHODS
To validate the effectiveness of our AMCCF, we compare it with 19 state-of-the-art trackers on OTB-2015, including (i) The trackers with hand-crafted features: STRCF [8], ECOhc [22], BACF [23], SAMF CA [42], MEEM [43], TRACA [32], SRDCFDecon [33], PTAV [34] and SRDCF [6]. (ii) The trackers with CNN features: ECO [22], CF-Net [37], VITAL [41], DeepSRDCF [38], SiameseFC [36], HDT [18], HCF [17], FCNT [35], SiamRPN [39] and MDNet [40]. Note that SAMF CA also ultilizes the local context information for appearance modeling, and the MEEM, PTAV and TRACA methods track the targets with an ensemble of multiple tracking models. For fair comparison, we employ the publicly available codes or results provided by the authors to reproduce the results of all competing methods. Table 3 illustrates the mean OP results of all competing methods on OTB-2015 dataset. From it we can observe that the proposed AMCCF method ranks the first among the trackers with hand-crafted features, and achieves a mean OP score of 83.3%. In comparison to the baseline STRCF method, AMCCF brings a mean OP gain of 3.3% on OTB-2015, which  can be attributed to the reason that AMCCF can leverage the adaptive local context information for visual tracking. Meanwhile, our AMCCF tracker is also superior to the SAMF CA , PTAV and TRACA methods, surpassing them by mean OP gains of 13.9%, 8.6% and 5.2%, respectively. In terms of tracking speed, the best three results belong to TRACA (101.3 FPS), ECOhc (42 FPS) and PTAV (27 FPS), and the baseline STRCF also achieves a running speed of 24.3 FPS. In constrast, our AMCCF method runs at 8.6 FPS on OTB-2015. However, the higher speed of these trackers comes at the expense of lower accuracies in comparison to our AMCCF. In addition, even with the hand-crafted features, AMCCF still outperforms multiple tracking methods with CNN features, such as SiameseFC, DeepSRDCF and CFNet, and is on par with the state-of-the-art ECO tracker. Fig. 6 also shows the overlap success plots of the competing methods, which are ranked with the AUC score. Not surprisingly, our AMCCF method outperforms ECOhc and STRCF with AUC gains of 3.7% and 2%, respectively. In comparison to the trackers with CNN features, AMCCF still performs favorably against the competing trackers, and ranks fourth among all competing methods, validating the effectiveness of AMCCF method.

3) VIDEO ATTRIBUTE-BASED COMPARISON
In this section, we investigate the performance of our AMCCF method on all 11 video attributes of OTB-2015 dataset. Fig. 7 shows the AUC scores of all the competing methods with hand-crafted features. One can observe that AMCCF outperforms the baseline STRCF and other competing methods on most of the attributes. Here we mainly analyze the results on four attributes close to our AMCCF method.

4) OCCLUSION (OCC)
Obviously, when the target is occluded by the background regions, it is not enough to detect the object only with the target appearance information. By modeling the local context information with the image regions larger than the target, the STRCF method can alleviate the problems induced by the occlusion attribute, and achieves an AUC score of 61.5%. However, the proposed AMCCF method still outperforms the baseline STRCF by an AUC gain of 2.9%. This can be attributed to the reason that AMCCF can adaptively leverage the local context regions for visual tracking, thus avoid the risks of incorporating too many background regions on the CF models.

5) OUT OF VIEW (OV)
Similar to the occlusion attribute, when the target disappears from the camera view, it is also needed to use the local context for accurate target localization. We can observe from Fig. 7 that AMCCF ranks the first among all the competing methods, and outperforms the ECOhc and STRCF methods by 5.8% and 0.9%, respectively. It indicates that AMCCF is able to handle the problems induced by the out of view attribute.

6) IN-PLANE/OUT-OF-PLANE ROTATION (IPR/OPR)
While the rotation attribute does not cause the target disappearance from the camera view, it still results in the large appearance variations of the target. From the results in Fig. 7, one can see that our AMCCF method outperforms the baseline STRCF on these two attributes, and achieves AUC scores of 62.8% and 65%, respectively. We owe these improvements to the reason that AMCCF can adaptively leverage the local context information for robust tracking.   Coke, DragonBaby, Girl2 and KiteSurf. It can be seen that the proposed AMCCF method can well track the target and performs better than other methods in the case of occlusion (Coke, Girl2 and KiteSurf), out-of-plane rotation (Basketball) and out of view (DragonBaby) attributes. This further validate the effectivenss of our AMCCF method on leveraging the local context information for visual tracking.

D. EXPERIMENTS ON TEMPLE-COLOR AND LaSOT DATASETS
Here we further evaluate our AMCCF method on both Temple-Color [11] and LaSOT [12] datasets. The Temple-Color dataset is extended from OTB-2015, and contains 129 color video sequences in total. LaSOT is a recently released large-scale tracking dataset collected from the youtube website. It is more challenging than other datasets since the average length of each video in LaSOT is over 2500 frames, and the target often suffers from various challenging factors, such as fast motion, deformation and out of view. We perform comparative experiments on the testing set of LaSOT, which contains 280 video sequences from 70 categories. Similar to the OTB-2015 dataset, the overlap success plots are used as the evaluation metric on both Temple-Color and LaSOT. Fig. 9 presents the AUC scores of the competing methods on both Temple-Color and LaSOT datasets. We can observe from Fig. 9(a) that the ECO method with CNN features achieves the best performance on Temple-Color dataset. In contrast, our AMCCF ranks the second among all the methods, and outperforms the baseline STRCF with 1.8% AUC gains. Similarly, as shown in Fig. 9(b), AMCCF achieves an AUC score of 32.1% on the large-scale LaSOT dataset. In comparison to the ECOhc and STRCF, it surpasses them by AUC gains of 1.7% and 1.3% respectively, validating the effectiveness of our AMCCF method on the large-scale tracking dataset.

E. EXPERIMENTS ON VOT-2018 DATASET
Finally, we evaluate the proposed AMCCF method on the VOT-2018 dataset [13]. VOT-2018 consists of 60 challenging videos from real-life datasets annotated with rotated bounding boxes. Following the practice in [13], we evaluate the competing trackers in terms of accuracy, robustness, and expected average overlap (EAO). The accuracy measures the average overlap ratio between the results and ground truth bounding boxes in a sequence. The robustness score computes the average number of tracking failures on the dataset. And EAO is a combined evaluation metric of both accuracy and robustness. VOLUME 8, 2020 We quantitatively evaluate our AMCCF method by comparing it with 10 representative tracking methods on the VOT-2018 dataset. Table 4 illustrates the results of all the methods in terms of the three metrics. From it we can observe that AMCCF achieves an EAO score of 0.238, outperforming the ECOhc and STRCF with hand-crafted features by EAO gains of 2.6% and 6.4% respectively, validating the effectiveness of the proposed AMCCF method again.

V. CONCLUSION
In this paper, we present an adaptive multiple contexts correlation filter (AMCCF) framework for exploiting the local context information during tracking. To achieve this, we first introduce a novel sigmoid spatial weight map function into the CF trackers with spatial regularization. It can flexibly regulate the impacts of both target and surrounding background on model learning, thereby resulting in more discriminative CF models. Based on this, we model the different levels of local contexts by incorporating the spatial weight maps with different parameters into multiple STRCF models. For adaptively leveraging the local contexts on the tracking stage, the MWCM loss function with a weight prior constraint is adopted for jointly predicting the target position and adaptive fusion weights of all response maps. Experiments on four challenging datasets show that our AMCCF can adaptively employ the local contexts for robust tracking, and performs on par with the state-of-the-art trackers.