A Scale-Adaptive Particle Filter Tracking Algorithm Based on Offline Trained Multi-Domain Deep Network

In this paper, a MDSPF method is proposed to learn a robust observation model for representing the targets by training a CNN with a number of video sequences. The CNN architecture is composed of three shared convolutional units, two shared fully connected (Fc) units and a multiple domain Fc unit, and it is offline trained by a multi-domain learning strategy. After training, the shared convolutional units are remained as an observation model for our tracking framework. The features from the shared convolutional units can well adapt to the challenges in tracking tasks. A scale-adaptive particle filter is also proposed in our framework to improve the robustness of particle filter method. Different from most existing particle filter tackers, it can efficiently shepherd each particle towards a more precise location and scale through similarity evaluation. Extensive experiments are conducted on Object Tracking Benchmark (OTB), UAV123 and LaSOT datasets to verify the efficiency of our proposed method.


I. INTRODUCTION
Visual tracking is considered the problem to estimate the location, shape, motion trajectory, and the size of a target in the coming sequences while only its initial state at the first sequence is given [1]- [4]. Visual tracking is a fundamental issue in computer vision and video processing, and it has been used in a wide range of applications, such as intelligent video surveillance, automatic drive, robot navigation, and humancomputer interaction [5]- [7]. During the past decades, a number of visual tracking algorithms have been proposed for different tasks [8], [9].
The existing trackers can be divided into two main branches: discriminative methods and generative methods. The generative methods learn a statistical model to describe The associate editor coordinating the review of this manuscript and approving it for publication was Guitao Cao . the target appearance, and targets are located through generative processes [10]. Incremental learning tracker (IVT) [11] and L 1 tracker (L1T) [12] are two popular generative trackers, the former applies principal component analysis (PCA) to represent a target, while the latter represents a target by a sparse combination of over-complete basis vectors. The discriminative methods regard tracking as a binary classification to distinguish target from background, and they have become popular for their good performance [13]. Many classical methods have been proposed, such methods include structured output tracking with kernels (Struck) [14], kernelized correlation filters (KCF) [15], and tracking-learningdetection (TLD) [16]. Recently, some works are extended based on the above methods.
Generative trackers usually produce better results in scenarios with less complexity, while discriminative trackers are more robust against complexity due to the fact that background information are taken into consideration. In real applications, a robust observation model is essential for discriminative tracker to distinguish target from complex background, and most exiting trackers build an observation model based on hand-craft features. Hand-craft features have achieved good performance in many tracking tasks, however, they are still limited for adapting to practical scenarios with various challenge factors. For instance, color features are robust for the deformation of targets but they may fail when the illumination varies. Although feature fusion and multiinformation integration methods are adopted to optimize the robustness in some works [17], it is essential to explore the deep semantic features of targets and construct a robust observation model for improving the performance of target identification.
In recent years, convolutional neural network (CNN) has achieved great success in computer vision filed [18], [19]. Due to the strong ability in describing the appearance of targets, CNN features have been proved to be effective on visual tracking with various scenarios [9]. A large number of training samples is necessary for the training process of CNN, but training samples are usually insufficient in tracking tasks because only the bounding box in the first frame is given. Most CNN-based tracking methods offline train a CNN on large-scale image datasets or directly using a pre-trained CNN with a fine-tune process. Those CNN-based trackers are difficult to learn a unified representation for the video sequences with various challenges including deformation, fast motion, and illumination variation.
In addition, most discriminate trackers initialize discriminate models based on a initial frame and online update the classifier to increase the robustness for the deformation of targets, but the discriminate model may fail to track targets in the coming frames while drifting occurs in one frame. One reason for this phenomenon is that the sampling strategy does not facilitate the tracking method. Most sampling strategies usually generate candidate samples based on the current frame and ignore the location information in priori frames. When a target drifts in current frame, the samples generated through those strategies become unreliable and the models would be updated with wrong samples.
Most existing CNN models generate candidate samples based on the current frame, and the unreliable motion searching strategy may cause a failure in online tracking process. Particle filters can recursively predict the states of targets based on the priori information, although it can improve the reliability of candidate samples, the hand-craft based appearance model of particle filters lack background information and deep sematic understanding of targets. It is natural to combine CNN model with particle filter, and a scale-adaptive particle filter (PF) tracking algorithm is proposed by exploring the features of offline trained multi-domain network (MDSPF). Our work is carried out from the following two aspects.
(1) A multi-domain CNN architecture is offline trained on several video sequences to learn a robust observation model of targets. In the process of offline training, videos in different scenarios are regarded as separate domains with respect to specific characteristics. After training phase, the common characteristics are learned by the multi-domain training strategy, and feature maps of the shared convolutional layers are remained as an observation model to represent the appearance of targets. As a result, the general features including deep semantic information of targets and backgrounds are obtained, and these features can make the particle filter adaptive to different tracking tasks with various challenge factors.
(2) The tracking process is regarded as a probabilistic problem which matches a target by propagating posterior density. In general, dense sampling is adopted to predict all possible states of targets in particle filter methods. Considering improving the reliability and efficiency of generated particles in a new frame, a novel scale-adaptive particle filter is designed. The reference samples and particles are generated based on the initial frame. To help the samples handle the scale variation of targets, candidate samples with various scales are generated around each particle through Gaussian distribution. Then the particles are shepherd and weighted by evaluating the similarities with reference samples. Compared with hand-craft observation models, the weights obtained by our observation model are more reliable. Finally, the state of target is calculated with the weighted particles, and the particles are updated through sequential importance sampling.
The main contributions are listed as follows: • A multi-domain CNN architecture is adopted to offline learn a robust observation model that can adapt to tracking tasks with different challenge factors, the features of trained shared convolutional layers achieve good performance in discriminating target from background.
• A scale-adaptive particle filter is designed to cope with online tracking tasks, the proposed framework predicts the most possible location based on the Bayesian sequential importance sampling of particles, and the particles are shepherd through similarity evaluation to adapt to the scale variation.
• Extensive experiments are conducted on OTB, UAV123 and LaSOT datasets to demonstrate the robustness of the proposed MDSPF algorithm, and the performance of our method are compared with some state-of-the-art trackers. This paper is organized as follows. In Section II, the related works are introduced. The MDSPF algorithm is presented in Section III. Section IV illustrates the details of the proposed online tracking framework. Section V shows the experimental results to demonstrate the performance of our proposed MDSPF method.

II. RELATED WORKS
In this section, we introduce the particle filter (PF) based trackers, correlational filter (CF) based trackers and deep trackers. VOLUME 8, 2020

A. PF-BASED TRACKERS
The PF-based trackers generally evaluate the state of a target through a set of particles. Dense sampling is generally adopted to obtain sufficient particles because it can entirely cover the possible states, but it is time consuming because all possible states need to be estimated. To solve this problem, many methods have been proposed to optimize the sampling of particles. In [20], subspace representation was adopted to improve the efficiency of particle filter. A factored sampling algorithm was proposed to generate better particles by combining the previous configuration with additional information [21]. Ross et al. [11] presented incrementally observational model to learn the change of target appearance. Liu et al. [22] proposed a Bayesian inference framework and a structural constraint mask to improve the robustness. Zhang et al. [23] combined traditional particle filter with correlation filter algorithm, and a multi-task correlation particle filter (MCPF) was put forward to use correlation filter response for guiding sampling particles modeling the distribution of samples.
Different from the mentioned PF-based trackers, the observation model is built based on multi-domain deep convolutional networks which is robust for representing the appearance of targets, and a scale-adaptive particle sampling strategy is proposed to serve for scale variation.

B. CF-BASED TRACKERS
Due to computational efficiency and robustness, the CFbased methods have drawn much attention recently. Bolme et al. [24] applied correlation filters into the field of visual tracking, and the proposed method trained the filter with grayscale features by minimizing the output sum of squared error (MOSSE). Henriques et al. [25] proposed a circulant structure with kernels (CSK), which dealt with the insufficient samples problem by exploiting the circle structure of shifted image patches. To further improve the performance of CF-based trackers, many works have be presented for feature representation.
Based on CSK method, a KCF tracker extended the grayscale feature to HOG feature and gave access to multichannel features [15]. Danelljan et al. [26] extended the features of RGB color space to 11 channels for improving the robustness of correlation trackers. Bertinetto et al. [27] combined HOG feature with color histogram and a feature fusion method was proposed to improve the accuracy. However, those trackers do not adapt well to the scale variation of targets.
In order to handle the scale variation in tracking tasks, scale adaptive with multiple features tracker (SAMF) [28] and discriminative scale space tracker (DSST) [29] were proposed based on KCF. In the SAMF method, seven scales were defined directly to estimate the scale variation of targets in SAMF method. In the DSST method, a translation filter and a scale filter were independently trained to obtain the location and scale.
Different from the mentioned CF-based trackers, the deep CNN feature is adopted to represent the appearance of targets instead of hand-craft feature, and the particle filter framework is used for online tracking.

C. DEEP TRACKERS
CNN has shown good performance in the representation of targets in visual tracking tasks. Wang and Yeung [30], Wang et al. [31] proposed a deep learning tracker (DLT) and extended stacked denoising autoencoder to CNN with spatial pyramid pooling. Hong et al. [32] combined a pretrained CNN with support vector machine (CNN-SVM), which added an extra online SVM layer at the top of hidden layer to learn the appearance of target object through calculating a certain discriminative saliency map.
Furthermore, some works have been conducted to exploit the power of CNN in visual tracking. He et al. [33] introduced a twofold Simaese network which fuses semantic model with appearance model for tracking a target. Chen et al. [34] combined a multi-attention model with long short-term memory (LSTM) to improve the robustness of deep tracker. Danelljan et al. [35] proposed a continuous convolution filters for tracking (CCOT) with respect to fuse traditional features and deep features from different spatial pyramids. In order to reduce redundancy, efficient convolution operators (ECO) was proposed based on CCOT [36]. Different from the above deep trackers, sequence training based convolutional network (STCT) connected multiple weak classifiers after feature extraction, and it ensembled the weak classifiers into a strong classifier to improve tracking accuracy [37]. Nam and Han [38] put forward a multi-domain network (MDNet) to learn a generic feature in tracker tasks through a multi-domain learning strategy. Zhong et al. [39] proposed a hierarchical tracking method which divides the searching process into a coarse level and a fine level, the tracker achieved coarse-tofine verifying by combining reinforcement learning with CFbased model. To handle the occlusion problem with motion reasoning for multi-person tracking, Zhou et al. [40] designed a novel deep alignment network and a robust coarse to fine schema, and it showed good robustness for multi-person tracking.

III. PROPOSED ALGORITHM
To learn a robust observation model, the tracking task is addressed as a scale-adaptive particle filter in our proposed multi-domain scale-adaptive particle filter (MDSPF) algorithm. The proposed tracking framework is illustrated in Figure 1. At first, a multi-domain network is offline trained on video datasets. In online tracking process, the reference samples and particles are initialized in the first frame, then the features of the samples are extracted through the pretrained shared CNN layers. After that, the state of target in a new frame is estimated by scale-adaptive particle filter and box bounding regression. Each particle is shepherd towards a possible scale through a scale estimation process and updated sequentially.  The network architecture of our proposed method is presented in subsection III-A, and subsection III-B introduces the offline training strategy of CNN. The similarity estimation and scale-adaptive particle filter algorithms are illustrated in subsections III-C and III-D.

A. NETWORK ARCHITECTURE
The offline training framework and CNN architecture are shown in Figure 2. In this section, the architecture of shared convolutional layers is introduced. The convolutional layers are the same as the corresponding parts of VGG-M network [41], and the dimension of feature maps varies with input size. As shown in Figure 2, the original input is image with fixed size, and there are three convolutional units including Conv1, Conv2 and Conv3. Both Conv1 and Conv2 consist of one convolutional layer, one ReLU layer, one batch normalize layer and one max pooling layer, while Conv3 only has one convolutional layer and one ReLU layer. The features are extracted from the last layer of Conv3, and they are concatenated into a vector as the feature in online tracking process.
The shared convolutional layers are relative smaller than architectures adopted in typical object detection or recognition tasks. In general, a visual tracking task is regard as binary classification issue, so model complexity is much less compared with general detection or recognition tasks. The spatial information is getting diluted as deepening the network, and the features will bring negative influence in locating targets when the CNN goes too deep [38].

B. TRAINING STRATEGY
Since the pre-trained CNN network learned on object recognition datasets can not effectively represents the discriminant information between background and target, the offline training technique in MDNET is adopted to learn a common property which is desirable for different scenes including illumination variation, motion blur, and deformation. The offline training procedure is also shown in Figure 2, and the training is conducted based on Stochastic Gradient Descent (SGD) method with sequences in different scenes. In the process of offline training, the shared CNN layers are connected with two shared fully connected units (Fc4-Fc5), both of the two VOLUME 8, 2020 units consist of a fully connected layer, a ReLUs layer and a dropout layer. The last fully connected layer (Fc6) with K branches is related to the training samples from K sequences, respectively. Both Fc4 and Fc5 layers have 512 output units and the Fc6 contains K binary classification layers. The softmax cross-entropy loss function is defined as where y i is the label of samples i, m is the total number of training samples, and a i denoting the value of soft-max layer is calculated by where z i denotes the corresponding soft-max layer output of sample i with label y i .
To learn a common property in different scenes, multidomain network in the K -th iteration is updated with training samples of the K -th sequence. Repeating the training process until the network is converged or reached the max number of the predefined iterations. The shared convolutional layers are remained as observation model after offline training, and the generic features are obtained from the shared convolutional layers in the process of online tracking.

C. SIMILARITY ESTIMATION
The tracking task can be regarded as a patch match problem between ground-truth and candidate samples, Euclidean distance is used to match the similarity, then the probability of each candidate sample is obtained. Assuming that the feature map of a given ground-truth is x ∈ R m×n×l , the feature map of a candidate sample is y ∈ R m×n×l , where m, n and l are the dimensions of output feature map, the 3-dimensions feature map is transformed into a feature vector R m×n×l → R D , where D = m × n × l is the dimension of obtained vector, and the min-max normalization is adopted to transform the distribution of each dimension into (0, 1). The samples close the location of ground-truth with different scales are extended as reference samples, and those reference samples are denoted as X = {x p |x p ∈ R D , p = 1, 2, . . . , N } to estimate the similarity, while candidate samples are denoted as Y = {x q |x q ∈ R D , q = 1, 2, . . . , M }. The distance among reference samples and each candidate sample is calculated as where p is referring to the p-th reference sample and q refers to the q-th candidate sample. M and N are the total numbers of candidate samples and reference samples. Then the probability of candidate samples that may be the target is defined as P q , and it obtains an inverse relationship with the following distance:

D. SCALE ADAPTIVE PARTICLE FILTER
The proposed scale-adaptive particle filter combines traditional Bayesian sequential importance sampling and stochastic sampling, where the stochastic sampling strategy generates samples with different states from Gaussian distribution. The particle filter can recursively estimate the state of target (the location and scale of an object) by solving a finite set of weighted particles. Assuming that the state variable of an object at time t is denoted by s t and the observation variable are presented as o t , the posterior density function p(s t |o 1:t−1 ) can be recursively calculated by where p(s t−1 |o 1:t−1 ) has been obtained at time t−1, and the predicted state at current time t is p(s t |s t−1 ), o t is the observation feature at time t, and it can be calculated through CNN. When o t is available, the state can be predicted through the following function: where p(o t |s t ) is the likelihood function, the posterior p(s t |o 1:t−1 ) can be approximated obtained by n particles s i t n i=1 through the following function: w i t denotes the weight at time t with the particle i, δ(·) denotes the Dirac delta measure. The weight of particle i is calculated by where q(·) is the importance density function, and it is equal to p(s i t s i t−1 ). It can be derived that w i t ∝ w i t−1 p(o t s i t ). To avoid the degeneracy problem, sequential importance sampling strategy is adopted to reserve the heavy particles [42]. In this case, the weight of each re-sampled particle is reset as w i t−1 = 1 n ∀i. Then the importance weights obtain proportional relationship to the likelihood function p(o t s i t ), and can be written in the follow function: The mentioned re-sample strategy reserves the important particles according to the calculated weights of precious step, and particles will be updated by the likelihood function.
To adapt to scale variation of target, candidates samples with surrounding locations and scales are generated through Guassian distribution around each particle. For each candidate sample, the similarity is computed through the function (4), and the response is denoted as R(s i t ). Then, each particle is shepherded by evaluating the most similar location and scale of the candidate samples. The process is defined as a scale estimation (SE) operate S SE : R D → R 1 , the D is the state space dimensionality of each particle, then the state of particle i is shepherded s i t = S SE (s i t ), and the response R SE (s i t ) of the SE for particle i is defined as the most similar state. Then it can be set as p(o t s i t ) = R SE (s i t ). The weight of each shepherded particle is obtained as As a result, the joint final state of target at time t is estimated as follows:

IV. THE IMPLEMENTATION DETAILS
The implementation details of tracking framework is described in this section.

A. SAMPLES GENERATE STRATEGY
To track a target in the coming frames through the proposed algorithm, it is necessary to stochastically generate a set of predict candidates. Considering simplifying the computation, the reference samples and particles are generated from Gaussian distribution, the mean is center location and scale of state t−1: where r denotes the mean of previous width and height of state t−1, a and b are the preset parameters which determine the range of the location and scale of generated samples. For offline training, the positive samples are defined as the candidates with intersection-overunion overlap (IoU ) >0.7, and the negative samples are those with IoU <0.5. For online tracking, the reference samples are generated at the first frame as positive samples, the initial particles are generated with fixed scale and the scale is the same as the ground-truth. For shepherding the particles, the candidate samples are generated around each particle with various scales.

B. OFFLINE TRAINING
Several video frames from different scenarios are adopted to train the CNN network, and the generic features from shared convolutional layers are learned. In the training process, the input images are wrapped into size of 107×107×3 before being put into the convolutional layers. 50 positive samples and 200 negative samples are selected from labeled frames in each sequence for offline training. The learning rates for convolutional layers and fully connected layers are set as 0.0001 and 0.001 with totally 100K iterations in the offline training process, where K is the number of our choosing video scenes.

C. BOUNDING BOX REGRESSION
The matched location through the above-mentioned function is coarse due to the high-level abstraction of CNN-based features and our samples strategy. Bounding box regression technique is applied to optimal the matched bounding box [43]. The Conv3 features are used to training our linear regression model and it is only trained in the first frame to avoid the increasing complexity.

D. ONLINE TRACKING ALGORITHM
The online tracking inference is described as follows: (1) 500 reference samples are selected from the first frame in tracking video around the ground-truth with IoU >0.7, and 1000 samples are selected from the first frame and they are used to train the bounding box regression model.
(2) The feature maps of reference samples are extended to 4096 × 1 vectors and the vectors are reserved for the particle filter sampling.
(3) The initial particles are generated around the target with fixed scale which is the same as the initial state. Then 256 candidates are generated around each particle to predict the possible scales and locations in the coming frame.
(4) A weighted coarse state of each particle is shepherded by estimating the similarity between those candidates and reference samples.
(5) The state is estimated through those weighted states, and the particles are updated through the importance sampling strategy.
(6) The trained bounding box regression model are adapted to get the final location.

V. EXPERIMENTS
In this section, the proposed MDSPF algorithm is evaluated on OTB, UAV123 and LaSOT datasets. The datasets are introduced in subsection V-A. Then a series of experiments on OTB dataset is conducted to analysis the different settings of our algorithm in V-B. And our proposed MDSPF tracker is compared with state-of-the-art trackers on OTB, UAV123 and LaSOT datasets in V-C. The algorithm is programmed using Matlab and MatConvNet toolbox [44] and runs at the speed of around 2 fps with a CPU i7 − 9500.

A. THE OTB, UAV123 AND LASOT DATASETS
The OTB Dataset: The OTB dataset [45], [46] is a popular benchmark in the field of single target tracking. There are three versions which are OTB-50 that contains 50 tracking labeled sequences with bounding box annotations, OTB2013 containing 50 sequences and OTB2015 (OTB-100) containing 100 sequences. OTB dataset takes 11 attributes into consideration such as target deformation, illumination variation, and occlusion.
The UAV123 Dataset: The UAV123 dataset [47] is a tracking benchmark captured from unmanned aerial vehicle (UAV). The UAV123 dataset contains 123 sequences from an aerial viewpoint with an average length of almost 1,000 frames. Different from the OTB datasets, 12 challenging aspects are taken into consideration in UAV123 dataset.
The LaSOT Dataset: The LaSOT dataset [48] aims to provide a dedicated platform for training deeo trackers and assess the long-term tracking performance. The testing subset includes 280 sequences with an average of 2500 frames per sequence. VOLUME 8, 2020 Evaluation Method: The success and precise plots are adopted to evaluate different methods on the above two datasets. For success plot, the estimated bounding box in one frame is defined successfully predicted if it has an IoU overlap ratio beyond a certain threshold with ground-truth. For precise plot, the estimated bounding box in one frame is considered successful if the center pixel distance to the center of ground-truth is less than a certain threshold. Then the overall plots are obtained with a varying threshold values, and the performances of different tracking methods are finally ranked based on the area under curve (AUC) score for successful evaluation and the precision score over a certain threshold for precise evaluation. In our experiments, the threshold of precision score is 20 pixels (Prec@20).

B. MODEL ANALYSIS
Different settings of our tracking method are tested to verify the contribution of each component. Without considering the bounding box regression and particle filter, a series of experiments are conducted for testing the performance of different similarity estimate algorithms and features in different layers. Then the results on single-reference sample, generativereference samples, bounding box regression and particle filter are validated. The experimental results are discussed as the following.
The Convolutional Feature of Different Layers: In the experiments, the performance of features in different shared convolutional layers is examined on OTB2013 dataset, and Table 1 illustrates the overall AUC and Prec@20 scores of our comparison experiments. We can see that the features of Conv3 obtain better results than the features of Conv1 or Conv2, that is because a deep semantic understanding with discriminate features is learned in Conv3 through the multi-domain learning strategy. It also means the features in Conv1 and Conv2 include redundant information for discriminating target from background, as a result, fusing the features in Conv3 with features in the other 2 layers causes a decrease both in AUC and Prec@20 scores. As shown in Table 1, the performance of Conv3 features have significantly improvement than the performance of fused features. In addition, max pooling and average pooling are added after Conv3 to reduce the dimension, it is clear that the AUC and Pre@20 scores are lower than the results of using features from Conv3, that is because pooling operates cause a loss of discriminate information.
Different Similarity Estimation Algorithm: The results of different matching functions are compared on the OTB2013 dataset. In the experiments, Euclidean distance, Manhattan distance, Chebyshev distance and Cosine similarity are taken into consideration, and the Conv3 features are selected for testing the performance of different matching functions. As shown in Table 1, the similarity estimation function based on Manhattan distance, Cosine similarity and Euclidean distance obtain close AUC and Prec@20 scores on OTB2013 dataset, and they achieve big improvements compared with the scores of Chebyshev distance. The underlying reason is that only the max difference value among the differences of all dimensions is used to estimate the similarity in Chebyshev distance, which ignores the global differences in the multi-dimension features. According to the results in Table 1, Euclidean distance is adopted for the following experiments.
Bounding Box Regression and Reference Samples Evaluation: Without bounding box regression, the experiments of using generative reference samples or single reference sample are conducted respectively on OTB-50, OTB2013 and OTB2015, respectively. And the comparison results are shown in Figure 3. As in Figure 3, we can conclude that using generative reference samples can significantly improve the successful rate, since considering more reference samples can make the results more adaptive to the challenges. Then the bounding box regression strategy is added based on the setting of using generative samples. It can be observed that the bounding box regression also attribute an improvement on success rate. Experimental results indicate that bounding box regression and using generative reference samples can effectively improve the tracking performance.
Scale-Adaptive Particle Filter Evaluation: The model settings of particle filter (PF), single domain scale-adaptive particle filter (SPF) and the MDSPF algorithm are evaluated on OTB datasets. Results in Figure 3 demonstrate that benefiting from the multi-domain CNN architecture and scale adaptive strategy, the MDSPF method can significantly improve the success rate. Furthermore, the experimental results with scale-adaptive particle filter obtain better performance than those only using the similarity estimation.

C. PERFORMANCE COMPARISON
To examine the efficiency of the proposed method, the MDSPF tracker is compared with some state-of-the-art trackers on OTB2015, UAV123 and LaSOT datasets in this section.
As shown in Figure 4, the results indicate that our proposed MDSPF method achieves good performance with an AUC score 0.605 and a Prec@20 score 0.811. Compared with those hand-craft trackers Staple, KCF, DSST, LCT, and MEEM, our offline trained CNN has great advantage in representing the appearance of targets. For the CNN-based trackers TRACA, VOLUME 8, 2020  SiamTri, CNN-SVM, DCFNet, SiamFC, and CFNet, our MDSPF method also achieves better robustness in locating targets.
Tables 2 and 3 summarize the precision and success scores of all algorithms focusing on 11 challenging attributes provided in the OTB2015 dataset, the bold type scores represent  that our tracker ranks the best in those attributes. It can be seen that our MDSPF algorithm significantly improves the AUC and Prec@20 scores in most attributes. The improvement for AUC and Prec@20 scores demonstrates that our multidomain training strategy obtains a robust observation model which can adapt to different challenges in tracking tasks. Note that our method achieves best performance in the attribute of scale variation, it indicates that our method can estimate the scales of targets more precisely because of the scale-adaptive particle filter. Some qualitative evaluation results in different sequences of the top 5 state-of-the-art trackers in our experiments are shown in the Figure 6. In the sequences of Vase, Jump and MotorRolling, the MDSPF method shows good robustness for the deformation of the targets. In Vase and Human9, the MDSPF method can handle with the scale variation of  targets. In Girl2 and MotorRolling, the MDSPF method can well re-locate the targets when drifting occurs.
State-of-the-Art Comparison on LaSOT: Figure 9 shows the success and precise plots on LaSOT dataset.
The proposed method is compared with some state-of-theart trackers on the testing subset without retraining CNN model, those trackers include STRCF [58], ECO_HC [36], CFNet [53], BACF [57], TRACA [49], PTAV [59], CSRDCF [60], Staple_CA [27], and fDSST [29]. Among the compared algorithms, our MDSPF method achieves an AUC score of 0.304 and a Pre@20 score of 0.293. From the above results, it is obviously that our MDSPF method obtain relatively good performance compared with those state-of-the-art trackers, it indicates that our method also shows advantage in longterm tracking tasks.

VI. CONCLUSION
To build a robust online tracking framework for visual tracking tasks, this paper presents a novel tracking method based on deep network and particle filter. Compared with existing algorithms recent years, our method adopts a multi-domain training strategy to learning the general discriminated features which can efficiently distinguish targets from background, the tracking task is addressed with a scale-adaptive particle filter, and the proposed scale-adaptive strategy can precisely predict the possible location and scale of targets. In addition, we compared our proposed MDSPF algorithm on OTB, UAV123 and LaSOT datasets with several state-of-theart trackers. The experimental results demonstrate that our MDSPF algorithm achieves better performance than those trackers and shows good robustness in different attributes for various challenge factors in OTB, UAV123 and LaSOT datasets. In the future work, we plan to accelerate our method as the following two aspects: (1) we will compile our code on GPU to achieve the acceleration; (2) we will attempt to improve the performance of the classifier with less scale samples. His research interests include visual object tracking, machine learning, and deep learning.