Visual Tracking With Online Assessment and Improved Sampling Strategy

The kernelized correlation filter (KCF) is one of the most successful trackers in computer vision today. However its performance may be significantly degraded in a wide range of challenging conditions such as occlusion and out of view. For many applications, particularly safety critical applications (e.g. autonomous driving), it is of profound importance to have consistent and reliable performance during all the operation conditions. This paper addresses this issue of the KCF based trackers by the introduction of two novel modules, namely online assessment of response map, and a strategy of combining cyclically shifted sampling with random sampling in deep feature space. A method of online assessment of response map is proposed to evaluate the tracking performance by constructing a 2-D Gaussian estimation model. Then a strategy of combining cyclically shifted sampling with random sampling in deep feature space is presented to improve the tracking performance when the tracking performance is assessed to be unreliable based on the response map. Therefore, the module of online assessment can be regarded as the trigger for the second module. Experiments verify the tracking performance is significantly improved particularly in challenging conditions as demonstrated by both quantitative and qualitative comparisons of the proposed tracking algorithm with the state-of-the-art tracking algorithms on OTB-2013 and OTB-2015 datasets.


I. INTRODUCTION
Visual tracking has been studied over several decades, however, it is still an active research topic in the field of computer vision and pattern recognition [1], [2]. Although visual tracking has been found application in a wide ranges, such as intelligent transportation systems (ITS) [3], vision-based navigation [4], surveillance [5] and motion recognition [6], it still remains challenging in the presence of spatiotemporal variation of targets such as occlusion, illumination variation, and out of view. This causes concerns in outdoor applications particularly for safety critical applications such as driverless cars. Reliable visual tracking is essential to perception and The associate editor coordinating the review of this manuscript and approving it for publication was Guitao Cao . decision making for ensuring timely and appropriate response to environment and events under all the possible weather conditions and traffic conditions. Therefore, improving the robustness of tracking algorithms in the face of these challenges has become an urgent problem in their engineering applications. We believe a promising approach to enhance the tracking robustness of the existing algorithms is to first construct evaluation mechanism for online monitoring the tracking performance, and then improve the tracking performance when it is not reliable by modifying the tracking algorithm as appropriate. This motivate the research reported in this paper.
Generally, visual tracking algorithms are categorized as either discriminative or generative. In the past few years, due to the disadvantage that generative approaches do not make effective use of surrounding information that can distinguish a target from its background, discriminative approaches have gradually become the current mainstream in the field of visual tracking. The discriminative approaches treat the tracking problem as a detection task and learn information about the target from each detection online. Consequently, the discriminative approaches are also referred to as trackingby-detection. Discriminative approaches can also be further divided into two categories: feature-to-classifier trackers and deep learning-based trackers.
Feature-to-classifier trackers aim to establish a classifier that distinguishes a target object from its background. To adapt to the changes of target appearance in dynamic scenes, these trackers must meet two requirements: firstly, the feature representing the difference between a target region and background should be robust and discriminative with respect to variations in both the extrinsic and intrinsic environments; secondly, the classifier for detection must be updateable online. Since online updating can be formulated as a process of online learning, the uncertainty of the labels corresponding to the new training samples obtained from the current tracking results may lead to drifting problems. Therefore, for feature-to-classifier trackers, in order to avoid incorrect tracking results contaminating the classifier, it is very crucial to construct an assessment method to online evaluate the reliability of the tracking result.
Deep learning-based trackers also contain two subcategories [7]. The first subcategory is the deep feature-based trackers, which merely use a pre-training deep network to extract features. In this subcategory, the parameters of the network are not adjusted during tracking. For example, the CNT tracker propagates an image forward in a convolutional neural network (CNN) to extract weak features, and then uses these features to construct a classifier to distinguish a target or background [8]. In the framework of spatially regularized discriminative correlation filters (SRDCF), deep features extracted from the first layer of the VGG network was to enhance the performance of the SRDCF tracker [9]. It is worth noting that SRDCF framework can achieve a better tracking performance than the traditional framework of discriminative correlation filter (DCF) because it mitigates the negative boundary effect of the inherent periodic assumption of the standard DCF. This conclusion can also be proved by [10] which investigated and compared the tracking performances of deep features within both the traditional framework of the DCF and SRDCF framework. However, as the SRDCF framework introduces a spatial regularization component to improve tracking performance, the real-time performance of the algorithm is greatly reduced. Compared with handcrafted features, deep features extracted by a pretraining deep network can represent objects more comprehensively and have a stronger ability to classify different objects [7]. The other subcategory is the tracker that specifically constructs a network framework to extract feature and evaluate candidate regions of tracking [11]- [14]. For example, the SiameFC tracker takes the target template and current search region as inputs, and exploits a deep network to generate a response map for the tracked object with a convolution operation [11]. As an excellent representative of this subcategory, the computational speed of the SiameFC is more than 80 FPS on the GPU platform [11], which indicates that it has outstanding real-time performance.
As a high-speed feature-to-classifier tracker, the kernelized correlation filter (KCF) employs high dimensional features, e.g., the histogram of oriented gradients (HOG), and Gaussian kernel regression to compute a response map, to track a target in accordance with the location of the peak of the response map [15]- [17]. In general, the advantages of the KCF come from four aspects. Firstly, it uses cyclically shifted sampling to achieve a large enough number of samples for training. Secondly, since convolution operation in the spatial domain is converted into the element-wise multiplication in the frequency domain by Fourier transforms, the real-time performance of the KCF is improved greatly. Thirdly, the KCF concedes multi-channel features that enable further extension of its high dimensional features to distinguish the target from the background by simply summing them in the frequency domain. Lastly, the regression model based on the kernel method can improve the classification performance. Considering that the KCF has advantages in tracking performance compared with the standard DCF, and has the advantage of computational efficiency compared with the SRDCF, we choose the KCF as the basic tracking framework of the proposed algorithm. However, despite its excellent tracking performance in a normal conditions, the KCF cannot yield a reliable performance when it confronts with challenges such as occlusion, fast motion [18], [19]. In our opinion, this is due to two reasons as discussed below: (1) Since the majority of correlation filters use a fixed learning rate to update the regression model in each frame, the errors from subsequent frames will accumulate continuously. For example, when occlusion occurs, due to the disappearance of the target in the several consecutive frames, the tracking results with no tracked target, as new training samples, directly contaminates the regression model and lead to tracking failures. Therefore, it is vitally important to evaluate the tracking performance online and then update the regression model according to the evaluation output.
(2) As a dense sampling scheme, cyclically shifted sampling of the KCF limits the scope of target searching and leads to tracking failure when the distance of the target positions in two consecutive frames caused by fast motion exceeds the search scope. However, merely expanding boundary of cyclically shifted sampling may result in amplification of the boundary effect and lead to an inaccurate representation of the image content [9]. Thus, it would be sensible to incorporate a new sampling scheme to increase the search scope whilst keeping the boundary of cyclically shifted sampling unchanged.
Motivated by the above observations, this paper firstly proposes a method of online assessment of response map in the framework of the KCF, and then proposes a strategy that VOLUME 8, 2020 combines cyclically shifted with random sampling in deep feature space. The main contributions of this paper are as follows.
(1) This paper proposes a method in which the response map is used to online evaluate the reliability of the tracking result in each frame. Specifically, this method firstly designs two indexes denoting the response map shape, and then constructs a 2-D Gaussian estimation model by these two indexes for the reliability assessment of the tracking result.
(2) It proposes a scheme to enhance the cyclically shifted sampling of the KCF by adding random sampling which broadens the search scope of candidate regions when the reliability of the tracking results of the basic KCF is insufficient.
(3) To fully take advantages of deep features in performance and of handcrafted features in efficiency, this paper further incorporates a deep feature-based regression model into the proposed hybrid sampling scheme, and then proposes a strategy of combining cyclically shifted sampling with random sampling in deep feature space. Moreover, according to the result of online assessment, handcrafted and deep feature-based regression models are used interchangeably and updated using different learning rates in the proposed algorithm.
We compare the proposed algorithm with the state-of-theart trackers on large benchmark datasets OTB-2013 [20] and OTB-2015 [21]. Both quantitative and qualitative experimental results demonstrate that the proposed algorithm performs favorably against state-of-the-art tracking algorithms.
The remainder of this paper is organized as follows. Section 2 discusses the related work, and Section 3 presents details of the proposed tracking algorithm. Section 4 presents and analyzes our experimental results and offers related comparisons with other state-of-the-art tracking algorithms. Finally, Section 5 presents our conclusions.

II. RELATED WORK
There are many reviews about visual tracking. This section only discusses some of the most relevant work motivating our tracker, including sampling mode for tracking tasks, and feature representation of targets in tracking tasks.

A. SAMPLING SCHEME FOR TRACKING
The tracking problem can be described as deciding a way to track an object with little a-priori knowledge. For featureto-classifier trackers, sampling is an indispensable tool to complete online learning and detection. Sampling scheme is used to collect sufficient training samples in the target's neighborhood, where typically each sample characterizes a sub-window of the same size as the target region. Generally, sampling schemes used in tracking algorithms are divided into two types: random and dense sampling.
As a representative of random sampling, particle sampling is based on Monte Carlo methodology. Since both the computational burden and tracking accuracy are proportional to the particle number, real-time performance is always a huge challenge for particle filter-based trackers [22], [23]. In tracking tasks, dense sampling is to collect all the subwindows with a certain step size in the target's neighborhood. Generally speaking, this scheme leads to a lot of redundancy because most of the samples have a large amount of overlap regions in tracking tasks. Fortunately, Henriques et al. associated this redundant structure with the circulant matrix [24]. The property of the circulant matrix and the circulant structure of samples allow the use of fast Fourier transforms to quickly incorporate information from all sub-windows and to obtain a regression model for detection. Therefore, trackers using this sampling exhibit excellent computational efficiency [25]. However, since the neighborhood area of dense sampling is limited, it will be difficult to identify a target whose position is far away from its current position, for example, due to its fast movement, and reappearance after occlusion occurs [26], [27]. Therefore, this paper proposes a scheme of combining cyclically shifted with random sampling to strike a better balance between computational burden and tracking performance particularly in challenging conditions.

B. FEATURE REPRESENTATION FOR TRACKING
For the past few years, diverse methods of features representation have been proposed for tracking tasks [28]. Generally, the features used in tracking tasks can be divided into three levels: primary, intermediate (handcrafted) and advanced. Primary features include edges, contours and color information, which are ubiquitous and widely used in tracking tasks [29]- [31]. Although many primary features, such as the color histogram, can frequently offer a robust defense against noise, they may not perform well when variations occur in illumination. Compared with primary features, intermediate or handcrafted features, such as HOG [32], local Haar-like features [33] and the scale invariant feature transform (SIFT) [34], have more discriminative abilities that can distinguish a target from its background. In general, advanced features fall into two categories. The first category is sparse features that are further extracted from handcrafted features by sparse coding, such as sparse coding spatial pyramid matching (ScSPM) [35]. The second category is referred as deep features that are mainly generated from the outputs of different layers of a pre-training CNN, and have shown strong advantages, e.g. good generalization and migration ability [7], [10], [36]. However, the computational complexity of deep features is much higher than that of handcrafted features. Therefore, the proposed algorithm aims to make use of the advantage of deep features in performance and of handcrafted features in efficiency.

C. ONLINE EVALUATION OF TRACKING RESULTS
The idea of online evaluation of tracking performance originated from the Tracking-Learning-Detection (TLD) tracker [37], where the tracking performance is evaluated to decide online learning or detection progress. Motivated by this work, the parallel tracking and verifying (PTAV) uses Siamese network to verify the tracking result calculated by the DSST tracker and improve the tracking performance [38]. Although the idea of online assessment in this paper is similar to that in the PTAV tracker, there are several fundamental differences between them. Firstly, the basic tracking framework is different. Our work was developed based on the KCF, but PTAV selects the DSST as the basic tracker. Generally, since the kernel method is used in the KCF framework to estimate the regression model, the performance of the KCF tracker is better than the performance of the translation filter of the DSST. Secondly, our approach is able to adjust trackers every frame through online assessment but PTAV only operates on sampled frames. This is because PTAV uses a Siamese network with substantial computational burden. To ensure running time efficiency, the verification is run only on sampled frames and cannot adjust the tracker every frame. By contrast, the proposed algorithm designs a method of online assessment of response map which can evaluate and verify the tracking performance every frame. Thirdly, the mechanism of increasing the search scope is different where the performance of the tracker is not reliable. PTAV improves tracking performances by decreasing frame sampling interval and increasing the size of the local region to search for the target. In our opinion, in the framework of the DSST used in PTAV, expanding the size of the local region excessively may reinforce the negative effects of boundary effect. Instead, our algorithm broadens the search scope by combining cyclically shifted with random sampling to avoid enlarging these local regions. Lastly, the operation of the tracking part and the verifier/assessment is different. The tracking part and the verifier work in parallel on two separate threads in the PTAV while online evaluation is used as a trigger to switch different features and sampling schemes in a serial manner in our proposed algorithm.

III. METHODOLOGY
The proposed tracking algorithm first uses cyclically shifted sampling and a handcrafted feature HOG to compute the response map of each frame in the basic KCF framework, and then evaluates the reliability of the tracking result of each frame by online assessment of the response map. If it is assessed to be unreliable, the proposed algorithm employs the scheme of combining cyclically shifted with random sampling in deep feature space to improve tracking performance of this frame. The key to realize the switching between the two strategies is the online assessment of the response map. Therefore, the module of online assessment and improved sampling strategy can be regarded as a whole and embedded into an existing tracking framework.

A. FRAMWORK OF KCF
Considering the KCF is the essential framework of the proposed tracker, we firstly introduce this framework briefly in this section. Generally, the KCF framework contains three modules: regression model training, target detection and regression model updating.

1) REGRESSION MODEL TRAINING
Consider a feature map Y t ∈ m×n×C representing the target region and its padding, and a Gaussian-shaped label matrix r ∈ m×n where C is the dimension of the feature, m × n is the size of feature map. For the first frame, based on the given target region and its padding, the parameters of the regression model k YY 1 ,α 1 is computed by [15] where λ is a regularization parameter, σ is the Gaussian kernel parameter, k YY t ∈ m×n is the kernel correlation between andα t are the outputs of the training module.

2) TARGET DETECTION
Depending on the target location in the previous frame, the KCF generates the candidate patches in the current frame by cyclically shifted sampling. Given the feature map of the test image patch Z t ∈ m×n×C determined by the target location in the previous frame t − 1, [15] where R t is the response map of the current frame, each element of R t denotes the possibility of the target being located in the corresponding position. The position of the tracked target is determined by the location with the maximal value of R t ∈ m×n as where [x t , y t ] is the position of the detected target.

3) UPDATE
According to the tracking result of each frame, a new feature map of the target Y t is produced. In order to learn the latest target appearance, the KCF uses the following scheme to update the existing regression model.α t is first updated in the frequency domain: followed by updating Y t as where δ is the learning rate, a fixed value in the KCF.

B. ONLINE ASSESSMENT OF RESPONSE MAP
This section presents a method to online assess the response map R t calculated by Eq.(2). Assessment results directly determine whether to employ the strategy of combining cyclically shifted with random sampling in deep feature space. According to the principle of cyclically shifted sampling, a desirable response map has only one sharp peak and remains smooth in all other regions, because there is only one sample where the target locates at the center. Therefore, the shape of a response map can reveal the reliability of the tracking result. As shown in Figure.1 (a) and (b), the response maps of the 50 th and 90 th frames of sequence Jogging are regular, and there is only one sharp peak and the other regions remain smooth in these two response maps, and their tracking results are reliable. When the target is close to the telegraph pole in 60 th and 80 th frames, the peaks of response maps become smaller and the other regions of response maps start to fluctuate due to partial occlusion and background clutter. As the target disappears in 74 th and 78 th frames, two peaks appear and corresponding values decrease further, and the surrounding region fluctuates seriously. Considering the peak and fluctuation degree can denote the reliability of a response map, we design two indexes to assess them: (1) Maximum of response map R max : R t max = max(R t ), the high of the peak R max indicates the reliability of the tracking result.
(2) Area ratio of independence regions AR, which is defined as follows: where τ is the threshold of segmentation and estimated by Ostu algorithm [39]. As it is well known, Otsu algorithm can ideally segment an image where the difference between the foreground and background is outstanding. Furthermore, since a desirable response map is sharp around the peak and smooth in all other regions, the area of the foreground obtained by the Otsu segmentation algorithm accounts for a small proportion of the area of the entire desirable response map. Therefore, the lower AR, the more reliable the tracking result.
In Figure.1 (c) and (d) respectively shown the changes in the values of R max and AR during the tracking process of Jogging, we can clearly observe that the values of R max and AR significantly change between 65 th and 79 th frames due to the target disappearance caused by occlusion. Figure.1(e) further shows the distribution of the two indices in 2-D space where a blue circle marks the location of the indices corresponding to frames from 65 th to 79 th with poor tracking performance, and by '' * '' for successfully tracking the target, respectively.
Considering that there is a certain correlation between two parameters, we propose a method to online evaluate the response map by constructing a 2-D Gaussian estimation model (the black ellipses of Figure.1 (e)). Suppose that the tracking results of the first S frames of each tracking sequence are correct, according to the observation vectors containing the two indices I t = [R t max , AR t ] T , t = 2, . . . , S, a 2-D Gaussian distribution model can be calculated by maximum likelihood estimation (MLE), its mean vector u and covariance matrix are expressed as follows: where N is the number of observation vector I t . For the initialization of 2-D Gaussian estimation model, N = S − 1. When t = S + 1, we can compute the reliability of the response map R t according to If p(I t ; u t−1 , t−1 ) > ε, the tracking result of frame t is reliable and then this vector I t representing reliable sample is used to update the 2-D Gaussian distribution model, ε is a threshold. The online assessment method of the response map is summarized as follows.
Online assessment of the response map using the 2-D Gaussian model is one of the main contributions in this paper. It monitors the tracking performance in real time and quantifies the reliability and confidence of the current tracker. It is not only used as a trigger for selecting one of the two tracking strategies presented in this paper, but also as a prerequisite for switching between the detector and the tracker in image understanding system. Furthermore, it can find a much wide range of applications. For example, it could be used as a condition monitoring method for visual tracking, and as an indicator of the level of uncertainty or confidence of the visual sensors in the context of multi-sensor data fusion (e.g. which sensor outcome shall be trusted more in this driving condition) or fed into decision making (e.g. reduce vehicle speed or change driving strategy). We will explore this further in our future work.
Output: the tracking result of frame t is reliable and 2-D Gaussian model (u t , t ) and I; else 7.
Output: the tracking result of frame t is not reliable and 2-D Gaussian model (u t , t ) and I ; end if end if Return output

C. SCHEME OF COMBINING CYCLICALLY SHIFTED WITH RANDOM SAMPLING
Although cyclically shifted sampling can guarantee the performance of the tracker in real time, the search scope of this sampling mode is limited. In most of KCF-based trackers, the search scope in the current frame is determined by the location of the tracking box in the previous frame. When the occlusion occurs, the target may not be is in the search area, as shown the red dotted line in Figure.2. This may cause the KCF-based trackers using cyclically shifted sampling to fail to track the target successfully. Hence, in order to broaden the search scope of the candidate region for tracking, this paper proposes a scheme of combining cyclically shifted sampling with random sampling, which is used to track the target when the reliability of the tracking results using cyclically shifted sampling is insufficient. This combination scheme contains two modules: sampling and detection. VOLUME 8, 2020

1) SAMPLING
If [x t−1 , y t−1 ] is the location of the target in frame t − 1, then where (x i t , y i t ) is the location of the random sampling in frame t, and each location i = 1, 2, · · · , η represents a tracking candidate region and the corresponding feature map is Y i t ∈ m×n×C , η is the number of random samples, and N (0, ) is white noise of a Gaussian distribution with standard deviation .

2) DETECTION
As shown in Fig 2 where '' * '' denotes the center points of candidate regions obtained by random sampling, the random sampling expands the search scope and can ensure that the target is re-tracked when the occlusion occurs. The proposed algorithm only enables the random sampling if the tracking result of cyclically shifted sampling is unreliable, which indicates that the target may have temporarily disappeared in the image frame. Consequently, we do not need update the module during the process of random sampling to avoid corrupting the regression model.

D. STRATEGY OF COMBINING CYCLICALLY SHIFTED WITH RANDOM SAMPLING IN DEEP FEATURE SPACE
Feature representation plays a significant role in all tracking algorithms. Handcrafted feature, e.g. HOG, has been widely used in many KCF-based trackers and achieved good performance. In recent years, with the development of deep learning, it has been shown that the deep features extracted from a pre-training CNN model exhibit a better performance compared with handcrafted features in the same tracking framework. Thus, following the conclusion of Ref. [10], the proposed algorithm employs the activation of the fifth convolutional layer of a pre-trained VGG-2048 network as the deep features to replace the handcrafted features when tracking results based on them are not reliable. In order to further improve the tracking performance of the proposed algorithm, this paper uses the deep features in the scheme of combining cyclically shifted with random sampling, and then forms a strategy of combining cyclically shifted with random sampling in deep feature space. When the evaluation shows that the result obtained by Algorithm 1 is unreliable, this strategy is used to improve the tracking performance as described below: A new tracking algorithm is proposed by integrating the online assessment of response map and the strategy of combining cyclically shifted with random sampling in deep feature space into the KCF framework.
In the first S frames of a test video sequence, the proposed algorithm trains two regression models based on handcrafted and deep features, respectively, and initializes a 2-D Gaussian estimation model for response map assessment. In the subsequent frames, if the evaluation result of the response map using the handcrafted feature-based regression model in a frame image is reliable, this regression is updated using the fixed learning rate. Otherwise, this model is not updated, and then the strategy of combining cyclically shifted with random sampling in deep feature space is employed to track the target. For the deep feature-based regression, it is updated using a fixed learning rate every k frames if the tracking result of this frame is reliable. It follows that using either of them alone cannot effectively improve the tracking performance of the existing framework. The proposed tracking algorithm is summarized as follows:

IV. EXPERIMENTAL RESULTS
In this section, we conduct experiments to evaluate the proposed tracking algorithm on two challenging public benchmark datasets, containing the OTB-2013 Visual Tracker Benchmark with 50 image sequences [20] and its updated version OTB-2015 with 100 image sequences [21].
OTB datasets involve 11 attributes, including occlusion (OCC) occurred in 48 test sequences, fast motion (FM) in 43 sequences, illumination variation (IV) in 38 sequences, motion blur (MB) in 31 sequences, deformation (DEF) in 45 sequences, out-of plane rotation (OPR) in 63 sequences, scale variation (SV) in 65 sequences, background clutter (BC) in 30 sequences, out-of-view (OV) in 14 sequences, in-plane rotation (IPR) in 51 sequences, low resolution (LR) in 10 sequences. One-pass evaluation (OPE), which is to run the tracker throughout a test sequence with initialization from the 36954 VOLUME 8, 2020

Algorithm 3 The Proposed Tracking Algorithm
Input: Test sequence, bounding box (x 1 , y 1 ), S representing the first S frames, is the update interval of deep feature-based regression model, L denoting the total number of the frames of the test sequence.

Initialize the regression models.
Input the first frame image, t = 1, according to bounding box (x 1 , y 1 ), using HOG descriptor and a pretrained VGG-2048 network to calculate handcrafted feature Y 1 and deep feature Y D 1 respectively, and initializing two regression models According to the frame image t and (x t−1 , y t−1 ), calculating handcrafted feature Y t ; 3.
Updating a 2-D Gaussian model (u t , t ) for response map evaluation using Algorithm 1.
Output: Using Algorithm 2 to achieve tracking result of frame t (x t , y t ).
End if End if End for Return output * mod((t-S)/ ) = 0 denotes the result of dividing t − S by is an integer, and also means that the regression model is updated using a fixed learning rate every frames on the premise that the tracking result of this frame is reliable.
ground-truth position in the first frame, is used to objectively evaluate the performance of trackers by two indicators: precision plot and success plot. The precision plot is defined as the percentage of frames whose average Euclidean distance between the center positions of tracked bounding box and the ground-truth is less than the given threshold [20], [21].
The success plot denotes the percentage of successful frames whose overlap rate between the tracked bounding box and the ground-truth bounding box is larger than the given threshold [20], [21]. Evaluated trackers are ranked by the area under the curve (AUC) of each success plot.
The remaining section consists of three parts. The first part is used to describe the details of experimental setup. Secondly, effectiveness of contribution of the proposed algorithm analyzed and compared with the tracker without online assessment of response map and the scheme of combining cyclically shifted with random sampling. In the last part, we compare our tracker with state-of-the-art trackers.

A. EXPERIMENTAL SETUP
We run our proposed tracker in MATLAB 2016a on an Intel i7-7700 CPU (2.8 GHz) PC with 8 GB of memory. All experiments are carried out using the following parameters. For the KCF framework, according to parameter defaults for the KCF, σ of the Gaussian kernel is set to 0.5, the regularization parameter λ = 0.001 and the learning rate δ = 0.01. VGG-2048 network for deep feature extraction can be download from the MatConvNet toolkit (http://www.vlfeat.org/matconvnet/pretrained/). The size of the image patch for deep feature extraction is expanded to 224 × 224 × 3 by bilinear interpolation. Moreover, our proposed algorithm directly uses the scale detection module of the DSST for scale variation [38].
In Algorithm 1, the threshold of ε is set to 0.01. This threshold has a significant influence in the frequency of using Algorithm 2 during tracking and will be discussed in the ablation study of this section.
For Algorithm 2, because the value of η directly affects the computing efficiency of random sampling, to ensure that the function of random sampling can be fully utilized, we set the value of η to 50. The value of directly determines the search range of random sampling. When the value η is determined, the entire search area may not be effectively covered if the value of is too large. On the contrary, random sampling degenerates into an exhaustive search if the value of is too small. Thus, after analyzing the target displacement between adjacent frames of the test data set, of Eq. (9) is set to equal to the width of the ground-truth in the first frame of each test sequence. The combined effect of two parameters on tracking performance will be further discussed in the following ablation study.
In Algorithm 3, two parameters, S and k, used for deep feature-based regression model training and update, need to be preset. Considering that the traditional KCF can normally track the target successfully in the first 20 frames of all test sequences used in our experiment, we set S to 20 denoting the first S frames. In our experiments, update interval of deep feature-based regression model k equals to 20. In the ablation study, we will discuss the effects of these two parameters of Algorithm 3 on tracking performance, respectively.

B. FUNCTION OF TWO MODULES
In this section, we analyze the function and impact of these two proposed modules, namely online assessment of response map, and a strategy of combining cyclically shifted sampling with random sampling in deep feature space, by designing comparative experiments between the proposed tracker and the other three trackers on OTB-2013 dataset. Using either of them alone cannot effectively improve the tracking performance of the existing framework. This is because Algorithm 1 is the trigger for Algorithm 2 and one module must be followed by the other. As described above, if there are no these two modules, the proposed algorithm degenerates to a KCF tracker. Therefore, adding the KCF to the comparative experiment of this section can test the function of these two modules. To evaluate the influence of online assessment and random sampling without using deep features, we investigate the tracking performance using the handcrafted feature, instead of deep features in Algorithm 2, which is referred to as 'without deep feature' in Figure 3. Figure 3 shows the precision plots and success plots of the comparative experiments on the OTB-2013 dataset. From Figure 3, it can be clearly observed that our tracker integrating these two modules has significant advantages in precision plots, compared to the KCF and 'without deep feature' trackers. The results of Figure.3 confirms that integrating these two modules together is beneficial to improve the performance of a tracker. Moreover, it plays an important role to use deep feature in the strategy of combining cyclically shifted with random sampling for improving the tracking performance.
Considering that the deep feature used in this paper comes from the DeepSRDCF tracker, we select it as one of trackers for comparison to evaluate the effectiveness of the deep feature. As mentioned above, the SRDCF framework used by the DeepSRDCF tracker is superior to the standard DCF and KCF because it introduces a spatial regularization component to mitigate the boundary effect [10]. However, Figure.3 shows that the proposed tracker can achieve tracking performance similar as the DeepSRDCF by integrating the online assessment and improved sampling strategy into the KCF framework. As mentioned above, the main contribution of this paper is to introduce two modules into an existing tracking framework. It is not restricted to the KCF framework as discussed in this paper. Therefore it is expected that these two modules can be introduced into the SRDCF framework to further improve its performance. Furthermore, the realtime performance of the SRDCF framework is much worse. Specifically, as shown in table.2, the computational speed of the DeepSRDCF tracker is only about 2 frames per second (FPS) on our experimental platform is far less than the 12 fps of our tracker.

C. ABLATION STUDIES
The threshold of ε in Algorithm 1 is the most important tuning parameter and directly determines the evaluation result of response map. When the value of ε is chosen to be too small, most of the tracking results are assessed to be reliable, the benefit of the proposed approach cannot be fully realized, and the tracking performance will not be improved significantly. On the contrary, if the threshold of ε is chosen to be large, real-time performance of the algorithm is significantly reduced since the strategy of combining cyclically shifted with random sampling in deep feature space is employed quite frequently. Taking Jogging as an example, Figure.4 shows the relationship between the change of ε and the number of the strategy of combining cyclically shifted with random sampling in deep feature space activated. Furthermore, we compare the five different ε on the OTB-2013 dataset and the results are shown in Table 1. Tracking speed in FPS is used to evaluate the real-time performance of the tracker. From Table 1, we can find that FPS decreases rapidly as the value of ε increases. When the value of ε exceeds 0.01, the increase trend of precision and success rates is significantly reduced. Therefore, considering the balance between real-time performance and tracking performance, in this paper we choose 0.01 as the value of ε.   In Algorithm 2, the values of η and directly determine the searching range of random sampling and the computational speed of Algorithm 2. In order to analyze the effect of different values of η and on tracking performance, we compare the precision rates, success rates and FPSs corresponding to the different values of η and on the OTB-2013 dataset and the results are shown in Table 2 where ε = 0.01. From Table 2, we can observe two trends: (1) On the premise of the value of η is constant, as the search area expands, the values of precision rate and success rate increase first and then decrease. (2) On the premise of the value of is constant, increasing the number of random samples can improve the tracking performance but reduces the real-time performance. As we all know, the second trend is easy to understand. An increase in the number of samples will inevitably lead to an increase in computational burden and an increase in the search density in a certain area. The former is the cause of the decline in real-time performance, and the latter is the reason for the improvement in tracking performance. We firmly believe that in the first trend, the reason for the improvement in tracking performance at the beginning is that the expansion of the search area can obtain more candidate areas, conversely, when the search area is enlarged to a certain extent, because the number of samples does not increase proportionally with the expansion of the search area, the sampling density decreases, some candidate regions containing the target are ignored, and more background interference is introduced, as a result, tracking performance is degraded.  In Algorithm 3, the tracking results of the first S frames of each test sequence, which are achieved by the traditional KCF, are used to train a deep feature-based regression model. Since the traditional KCF can successfully track the targets in the first 20 frames of most test sequences, from Table 3 we can find that the tracking performance is not much different when S = 10 and S = 20. Furthermore, when S = 30, some false results (between 20 th to 30 th frames) from the traditional KCF tracker may contaminate the deep feature-based regression model, resulting in degraded tracking performance. The tracking performance indexes for three different update intervals of deep feature-based regression model are shown in Table 4. Table 4 indicates that changes in the value of k have little effect on tracking performance.
We further use the image sequences annotated by 11 attributes to comprehensively evaluate the performance of trackers. Figure 6 and 7 show the precision and success  plots of the proposed tracker and other 8 trackers respectively on the OTB-2015. Although there is no tracker that shows excellent performance on 11 attributes, the proposed tracker shows excellent performance on most of attributes. Specifically, the proposed algorithm achieves the best performance on 9 attributes in term of the precision rate, including illumination variation (82.1%), out-of-plane rotation (80.9%), scale variation (77.5%), occlusion (77.2%), deformation (79%), motion blur (78.7%), out of view (72.5%), background clutter (86.1%) and low resolution (87.6%). In term of the success rate, the proposed algorithm significantly outperforms the compared trackers on 6 attributes, including out-of-plane rotation (57.6%), occlusion (56.6%), deformation (57.5%), in-plane rotation (56%), out of view (54.6%) and background clutter (60.9%). It can also be seen that the robustness of the proposed algorithm in the presence of various challenges significantly outperforms the other 8 tracking algorithms.
Moreover, Table 5 shows the speeds of nine algorithms in FPS, obtained from the average values when running OTB-2015 on our computational platform. Considering that the SiamFC needs to run on the GPU, we do not test the computational speed of the SiamFC in this comparison experiment of real-time performance. In the nine trackers, the proposed tracker ranks third. Although the calculation speed of the proposed method is slower than that of the Staple and DSST, Figure 5 shows that the tracking performance of our method is significantly better than these two trackers. Table 5 also demonstrates that although the proposed algorithm uses deep features, its computational speed is still much faster than SRDCF without deep features and CNT which is one of the CNN-based trackers.

2) QUALITATIVE EVALUATION
This section provides a qualitative analysis of the proposed algorithm, the tracker without these two proposed modules and the other eight algorithms, with the tracking results shown in Figure.8. In the Girl2, when full occlusion occurs, only our algorithm can track the target successfully at Frame 144. In the Human3, when partial occlusions occur that the target crosses a pole and passes by other pedestrians, only our algorithm and MEEM can accomplish the tracking task at Frame 144. This clearly shows that the proposed algorithm exhibits an excellent performance in re-tracking the target when the target reappears after being occluded. In the MotorRolling, despite all the other nine algorithms failed to capture the target, the proposed algorithm can capture the rotated target successfully. In the Biker, the head of the biker moves quickly from left to right. In the Jumping, the player bounces up and down at a high rate. For these two sequences, except for our algorithm, CNT and SiamFC_3S, the other algorithms could not capture the fastmoving target reliably. In the Human6, the target moves out of view in Frame 380 and 548, respectively. Our algorithm can re-track the target when it re-entered the field of view in VOLUME 8, 2020 Frame 385 and 554, respectively. Motion blur occurs when the target region is blurred due to the motion of the target or camera. In the BlurOwl, our algorithm, SRDCF and MEEM achieve superior performance than the other algorithms in coping with this challenging condition.

V. CONCLUSION
This paper aims to improve robustness and reliability of visual tracking in challenging operation conditions. In the promising KCF framework, two new functional modules have been proposed and developed to further enhance its tracking performance. An online assessment method has been proposed to evaluate tracking performance and reliability based on the response map. To this end, a new criterion was developed by constructing a 2-D Gaussian estimation model based on the peak and the area ration of independence regions in the response map defined in the paper. When the tracking performance is assessed to be unreliable, a strategy of combining cyclically shifted with random sampling was proposed to improve the tracking performance. These two proposed modules are then integrated into the current KCF tracker to constitute a new tracking algorithm. With this framework, deep features have also been exploited to further enhance its tracking performance and reliability. We extensively test our algorithm on two well documented benchmark datasets with very encouraging results. Detailed qualitative and quantitative analysis and comparisons with eight existing competitive tracking algorithms clearly demonstrate attractive tracking performance of our proposed algorithm in terms of accuracy and reliability without a significant increase of the computational burden in coping with a wide range of challenging operations, including illumination variation, out-of-plane rotation, scale variation, occlusion, deformation, motion blur, out of view, background clutter and low resolution. A tuning parameter is introduced to trade off between the reliability and accuracy of the tracking and its real-time performance.
The proposed online performance and reliability assessment method could find a wide range of applications such as real-time tracking performance monitoring, and characterization of the confidence or uncertainty level of the visual tracking information for the purpose of data fusion with other sensing sources, or as an input to follow-on decision making. It is expected that it would have a significant implication in a wider application of visual tracking, particularly for safety critical situations such as autonomous driving. This will be explored in our future work. ZHOU-YU ZHANG received the B.E. and M.E. degrees from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2014 and 2017, respectively, where he is currently pursuing the Ph.D. degree in guidance, navigation and control. His current research interests include sense and avoid, object detection, and object tracking.