Tracking With CNN Based Correlation Filters on Spherical Manifolds

The correlation filters tracker allows features from multiple channels. The fusion of features by simply summing over them in an Euclidean space would destroy the inherent geometry among multiple features, resulting in the lose of their phase information which is crucial to tracking task. To provide a better fusion of features from multiple layers of a convolutional neural network (CNN) in the classical CNNs based correlation filters algorithm, we introduce spherical manifolds and computing intrinsic mean on spherical manifolds in the article, so that fusion of features and online update of filter kernels can be implemented over a spherical manifolds. In addition, we introduce a random projection method, imposed on CNN features before feature fusion to compress features for the sake of reducing computational complexity and modeling complexity. Extensive experiments on OTB-50 dataset demonstrate that the proposed algorithm outperforms state-of-the-art methods with respect to both precision and success rate.


I. INTRODUCTION
The mainly task of tracking is estimating the location of a visual target in each frame of an image sequence. It has various practical applications, especially for human-machine interactions, visual surveillance and unmanned control systems [1]- [3]. Despite significant progress has been achieved in recent years, object tracking is still one of the most challenging problems in computer vision owing to factors, such as partial occlusion, deformation, scale variations, illumination variation, background clutter, in-plane/out-of-plane rotations and motion blur [4]- [6].
In recent years, Correlation filter (CF) based discriminative algorithms have gained high attention owning to high accuracy of object tracking and low computational complexity [7]. To estimate an object's translation in spatial domain efficiently, the classical CF method exploits the dense sampling in the target region via a circulant shift matrix, by which the tracking algorithm can allow an element-wise operation in the Fourier domain, and the target location is inferred by searching for the maximum value of the response map. Meanwhile, CF is proposed to process multi-channel The associate editor coordinating the review of this manuscript and approving it for publication was Hiu Yung Wong .
With the rapid development of deep learning, lots of tracking algorithms [9]- [12] based on the features extracted from the deep convolutional neural networks (CNNs) were proposed, and achieved great improvements compared with traditional methods. It has been shown that the CNNs-based trackers perform well against methods using hand-crafted features, such as Scale Invariant Feature Transform(SIFT) [13], HOG [14] and color histogram. For CNNs based correlation filters, the response map of a layer with D channels is computed by wherex ∈ C P×Q×D is the target object represented by CNN features from multiple channels,ŷ ∈ C P×Q denotes the Gaussian shape label matrix,ẑ ∈ C P×Q×D denotes the candidate object. They are all in Fourier domain. The operator F −1 denotes the inverse Fast Fourier Transform(FFT).
In the above CNNs features approaches, an essential step is computing kernels (i.e., the numerator and denominator in the Eq. 1, we denote the numerator as cross correlation kernel, denote denominator as autocorrelation kernel). To construct the kernel, the CF algorithm allows simply summing over every kernel built by single channel feature in the Fourier domain, which is equivalent to fuse the kernels of the multi-channel features.
Since visual tracking only focus on the trajectory of the target object, phase information plays a significant role. It is well known that an object's translation is only determined by the phase information of the response map, the corresponding amplitude information is redundant. Adapting the fusion by Eq. 1, we will introduce the unwanted energy information of each feature, which may have no advantage on the tracking performance.
To address this issue, spherical manifolds is introduced to fuse the kernels from multiple channels over an unit hypersphere, allowing a substitution of the mean in a hypersphere for the mean in the normal Euclidean space. Thus, phase information of the features from multiple channels is kept, benefiting object tracking task. However, computing the mean in a hypersphere concerns an iterative process, so computational load would be expensive in case of high dimensional features. To reduce computational complexity, rand projection algorithm is introduced to compress the dimension of CNNs features before feature fusion.
The main contributions of this article are listed in the following items: 1) We propose a novel method to generate random projection matrix based on spectral graph theory, which avoids the high time-consuming operation for Gram-Schmidt orthogonalization during rand projection process. 2) We propose a novel CNNs features based CF algorithm on spherical manifolds. 3) We propose an online update strategy of CF kernels based on geometry of spherical manifolds. 4) We carry out extensive experiments on a large-scale benchmark dataset to demonstrate the effectiveness of the proposed algorithm in comparisons to the state-ofthe-art trackers.

II. RELATED WORKS
This work is closely related to CF algorithms. In this section, a brief overview concerning CF tracking is presented.  [16], they used color attribute features which is multiple channels instead of intensity features to achieve better tracking results. In 2015, Henriques et al. [4] proposed a multi-channel version of the tracker which used HOG features allowing the representation of objects through 31 dimensional features. Having considered the unwanted boundary effects which can severely degrade the performance of the tracking model, Danelljan et al. proposed Spatially Discriminative Regularized Correlation Filters(SRDCF) algorithm [7] in the same year.
Due to the rapid development of deep CNNs, in 2016, several trackers [12], [17], [18] based on Siamese networks were introduced, and became a new research hotspot because of their simplicity and competitive performance. In 2017, Valmadre et al. [19] improved the Fully-Convolutional Siamese Networks(SiamFC) [12] tracker by integrating discriminative correlation filters into the Siamese framework. For the methods which focus on integrating convolutional features from a fixed pre-trained deep network, in 2015,2016,2017, Danelljan et al. proposed Deep SRDCF algorithm [9], Continuous Convolution Operator Tracker(C-COT) algorithm [20] and Efficient Convolution Operator(ECO) tracking algorithm [11], these algorithms were all based on the deep CNNs features and achieved a remarkable performance compared with those based on the hand-crafted features. In 2016, Ma et al. designed an effective correlation filters tracker Hierarchical Convolutional Features for Visual Tracking(HCFT) [21] on each CNN layer, and obtained the target location from the multi-level response maps in a coarse-to-fine fashion. In 2018, they proposed a robust tracker HCFT* [22], in which they applied the classifier to two types of region proposals for scale estimation and target redetection from tracking failures. In 2016, Qi et al. combined several weak CNNs trackers from numerous convolutional layers into a stronger one named Hedged Deep Tracking(HDT) algorithm [23], which introduces an improved Hedge algorithm by considering historical performance of weak trackers. In 2018, Qi et al. introduce Siamese network to improve the original HDT algorithm by defining the loss of each weak tracker for the proposed hedge method [24]. It has been proved that the CNN model is very suitable for developing robust appearance model in the tracking task, due to its powerful ability on feature extraction [25].
Generally speaking, CF based on hand-crafted features allows less accurate or robust when faced with more complex scenarios, due to the fact that hand-crafted features are usually specially designed for certain aim, while it can run at a high speed. On the other hand, CF using CNNs features can achieve a great progress in terms of accuracy and robustness, but it has a limitation in the tracking speed.

III. PREVIEW OF THE CNN BASED CORRELATION FILTERS
The CNN based Correlation Filters (TCCF) [10] employedthe same hedge method to HDT [23], additionally equippedwitha scale estimation module. It is the baseline of our VOLUME 9, 2021  proposedtracker, consistingof a Local CorrelationFilter (LCF) and a Scale CorrelationFilter (SCF) as demonstrated in Fig. 1 [10]. For the LCF, CNN features from multiple layers of the pre-trained VGG-16 [26] are used to represent the target appearance. The parameter settings of VGG-16 are given in Table.1.

A. LOCATION ESTIMATION
Use x k ∈ R P×Q×D denotes the feature map extracted from the k-th convolutional layer, the regression target y is defined by a 2D Gaussian shape label matrix. Letx k = F(x k ) and y = F(y), where F denotes the Discrete Fourier Translation (DFT). In the Fourier domain, the k-th deserved filter can be computed bŷ The solution to Eq.2 can be given bŷ where · denote element-wise product, and the division is also performed element-wise. Thex k * i denote the complex conjugation ofx k i ,α k has the form aŝ Given the testing data z k , having transformed the data to the Fourier domain byẑ k = F(z k ), we can compute the response maps by Then the k-th tracker outputs the target position with the largest response Considering that there are K convolutional layers will be used, the final location (x , y ) is obtained by where w k ≥ 0 and K k=1 w k = 1, which can be determined according to the performance of each tracker [23].

B. SCALE ESTIMATION AND MODEL UPDATE
According to the Discriminatiive Scale Space Tracker(DSST), the SCF can be implemented after the location estimation, which is independent of the translation filters.
Here, HOG feature is used, a brief comment on the HOG for scale estimation can be found in [27]. For the SCF, a set of scale factors are predefined {α j = θ [ j 2 −j] |j = 1, 2, . . . , J , θ > 1}. Given a training sample, J image patches are cropped around the estimated target position, for a scale factor α j , the corresponding image patch has the size of α j P×α j Q, where P×Q is the size of the target in the previous frame. After reshaping these scaled image patches, a one dimension Correlation Filter will be applied, the deserved scale can be obtain.
For more details about the scale estimation and model update, refer to [10], [27].

IV. PROPOSED TRACKER A. FUSION METHOD OF KERNELS BASED ON SPHERICAL MANIFOLD GEOMETRY
Since visual tracking focus on the trajectory of the target object, which is only determined by the phase of the response map. Therefore, to remove the impact of amplitude of each kernel, we consider a linear kernel with normalized energy, which can be expressed as: where · 2 denotes the L 2 norm, D is the number of the channels. Since K xx i 2 = 1, a natural geometry structure is introduced to these kinds of the kernels with unit norm, i.e., they are embedded on a unit hypersphere. Taking the geometry of spherical manifolds into account, we need to implement the summations in Eq. 1 in a spherical space, allowing a substitution of the mean in a hypersphere for that in the normal Euclidean space.

1) SPHERICAL MANIFOLDS AND OPERATORS ON SPHERICAL MANIFOLDS
A low-dimension spherical manifolds is embedded in a high-dimension Euclidean space. Let C N be a N-dimensional Euclidean space, a unit sphere is given by Two basic operators, Log and Exp, are defined for spherical manifolds.
To project a point q ∈ S N −1 to the tangent space of p ∈ S N −1 , denoted by T p S N −1 , a Log operator is defined as [28] Log p q = q − p, q p To project g ∈ T p S N −1 into the spherical manifolds, an Exp operator is defined as [28] Exp p g = pcos( g 2 ) + g sin( g 2 ) g 2 From Eq. 9 and Eq. 10, we can know Log p q ∈ T p S N −1 , Exp p g ∈ S N −1 .

2) FUSION METHOD OF KERNELS BASED ON SPHERICAL MANIFOLD GEOMETRY
Let {p 1 , p 2, · · · , p n } ⊂ S N −1 denotes points on a spherical manifolds, we compute the intrinsic mean to fuse them together, which is defined as where d(·, ·) denotes Riemannian distance on S N −1 . It indicates that the intrinsic mean is a point on spherical manifolds which has the minimum sum-of-squared distance to all of given points. In this sense, the intrinsic mean is similar to the clustering center in clustering methods. From Eq. 11, computingp is an optimization problem, by minimizing a sum-of-squared distance function as In addition, given an assumption that all points are confined in a strongly convex neighborhood, Karcher [29] shown that the gradient of Eq. 11 was given by The minimum of f (p) is obtained as the gradient f (p) = 0, namely the stationary point. According to Eq. 11, intrinsic mean has the minimum sum-of-squared distance to {p 1 , p 2, · · · , p n } ⊂ S N −1 , so we have the following equation Project D kernels onto the tangent space ofK xx m−1 by Eq. 9.

7)
if K xx * m−1 2 ≤ ε then 8) break. 9) end 10) end which means Thus, computingp is to find the stationary point of Eq. 15, which is an iterative process as [30] where m denotes the number of iterations. Although the convergence of this process is not guaranteed in a general manifold, it is well behaved on the hypersphere [31]. Considering the kernels in Eq. 8, we can apply the same iterative process to fuse them under the constraint of the spherical manifold structure, the algorithm is given in Algorithm 1.
The initial value of iterationK xx 0 can be formulated as: The summation of vector in tangent space will be gotten

B. RANDOM PROJECTION FOR THE DIMENSION REDUCTION
As mentioned in the introduction, in visual tracking using the features extracted by the fine-tuning pre-trained CNNs, the dimension of these features is usually very high, computing the mean in a hypersphere will be time-consuming. To conduct a more efficient calculation, we need reduce the dimension of the CNNs features. Here, the random projection is considered. Random projection is a powerful means of dimensionality reduction, where the original high-dimensional VOLUME 9, 2021 data is projected into a lower-dimensional subspace using a random matrix. Unlike most transform-based dimensionality-reduction techniques which are highly data dependent, random projection is data-independent and computationally more efficient than other widely used dimensionality reduction methods, such as principal component analysis and maximum noise fraction transform. Random projection has been applied to various areas and demonstrated good performance, including information retrieval, machine learning, remote sensing, and so on [32]- [35].
Random projection is composed of random matrix creation and matrix multiplication [34]. The projection matrix is usually chosen as an orthogonal matrix. For a high-dimentional feature, Gram-Schmidt orthogonalization would be very time-consuming [33]. In this work, spectral graph theory is employed to generate a random orthogonal matrix very fast. Given a undirected, connected and weighted graphs g = {v, ε, W}, where v is a finite set of vertices with |v| = D, ε is a set of edges, and W ∈ R D×D is a weighted adjacency matrix. The non-normalized graph Laplacian is defined as where D is a diagonal matrix with D ii = j W ij . As the graph Laplacian L is a real symmetric matrix, it has a complete set of orthonormal eigenvectors, From L, an orthogonal projection matrix can be generated as the following processings.

C. UPDATING CORRELATION FILTERS ON THE GEOMETRY OF SPHERICAL MANIFOLDS
In a video sequence, the target can often change appearance by changing its rotation, pose or lighting conditions. The strategy to update filter plays an important role during the tracking process.

Algorithm 2 The projection algorithm
Input: the feature X ∈ R P×Q×D , projection matrix P ∈ R D×d (d D). Output: the feature Y after random projection.
1) Reformate the feature into a feature matrix denoted as Centralize the features in X by X i = X i −X(i = 1, 2, · · · , D). 4) Project the centered features into the random subspace by Y = X P = [y 1 , y 2 , · · · , y d ]. 5) Obtain a reduced set of features by Y = [y 1 +x, y 2 + x, · · · , y d +x]. 6) Reformat Y into a tensor P × Q × d. 7) end Algorithm 3 The update strategy of filter Eq. 9 and getK x t x t * . 2) Project ηK x t x t * back to the spherical manifolds by Eq. 10 and get the Updated autocorrelation kernel K x t x t . 3) Compute the Updated filter by using Eq. 24.

4) end
The classical correlation filters algorithm allows the filter to be updated by weighting them in the normal Euclidean space:α where t is the number of frames, η > 0 is the learning rate. Since the autocorrelation kernel (i.e., the denominator in the Eq. 1) has been embedded in a spherical manifold, update strategy can be divided into two steps: update the autocorrelation kernel and then compute the filter. K x t x t can be updated as: whereK x t−1 x t−1 denotes the autocorrelation kernel of t − 1 frame,K x t x t denotes the autocorrelation kernel of t frame, η > 0 is learning rate, this procedure is illustrated in Fig. 2. Hence, the updated filterα t can be learnt as below:

D. WHOLE FRAMEWORK FOR THE PROPOSED TRACKER
Putting above the algorithms together, we will first obtain the weak tracker of k-th convolutional layer and its corresponding target position estimation as Algorithm 4. Generate target candidate region z t and get CNNs feature z k t ∈ C P×Q×D according to the location of y t−1 .

7)
Compute the dimensionality reduction featuresẑ kr t ∈ C P×Q×d by Algorithm 2. 8) ComputeK x kr t−1 z kr t by Algorithm 1.

9)
Compute a response map over the candidate regions by Eq. 5.

10)
Compute the estimated location by the candidate with the maximum filter response.

11)
Update the filter by Algorithm 3.

12) end
The final location estimation can be achieved by using weighted sum of the target positions of the weak trackers, which can be found in [10]. Meanwhile, the scale CF can be implemented after the location estimation, For more details about the scale estimation, refer to [27]. An overview of our overall architecture is given in Fig. 3.

V. EXPERIMENTAL RESULTS
The proposed algorithm is based on the TCCF, in which some modules are added. We denote TCCF with our graph based Random projection as TCCF rp , and denote TCCF rp on the spherica manifolds as TCCF rps .
In addition, To show more advantage in the fusion of the kernels, we also apply the the proposed fusion method to classical Dual Correlation Filter (DCF) [4], denoted as DCF s . The proposed algorithm DCF s is implemented in MATLAB 2016a and runs on the same CPU with the standard parameters provided by the authors.
We perform the experiments on OTB-50 [38] benchmark and compare with baseline trackers and the state-of-the-art methods. The features and experimental setup of proposed algorithm and baseline trackers are listed in Table. 2.

A. QUANTITATIVE ANALYSIS
Here we provide a quantitative comparison of our approach with the state-of-the-art trackers. Two criteria for evaluations of tracking performance are used on the OTB 50 benchmark: the precision plot and the success plot. The precision plot measures center location Euclidean distance between estimated position of the target and ground-truth, plotted over a range of thresholds. The success plot contains the percentage of frames over a range of overlap precision thresholds between estimated position box of the target and ground-truth box.
Overall performance evaluation: Fig. 4 show the results of precision plot and success plot, respectively. TCCF rps has performed the bast in both precision and success rate on OTB 50. As for the TCCF tracker, the proposed algorithm TCCF rps has the outperformance by 9.6% and 13.4%. The TCCF rp tracker also bring a gain in success and precision both compare with TCCF even using the features with less than one tenth of the dimensions of the original deep features.
As for the HOG features, DCF s compared with DCF tracker, the above two performances are improved 2.6% and 1.4% respectively.
Attribute-based evaluation: We also compare three trackers for every attribute, namely illumination variation(IV), scale variation(SV), occlusion(OCC), deformation(DEF), motion blur(MB), fast motion(FM), in-plane rotation(IPR), outof-plane rotation(OPR), out-of-view(OV), background clutters(BC) and low resolution(LR). The results are shown in Table. 3. Fig. 5 and 6 show the results of distance precision and success rate over eight attributes respectively.   Shown from the results of the attribute-based evaluation, for the trackers using CNNs feature, TCCF rps achieves the best performance among the all trackers in these 11 attributes. Compared DCF s with DCF and KCF, the proposed DCF s has a overall outperformance in the three trackers using HOG feature. It can be seen that the performance is indeed improved, as the difference in the energy of each channel feature is eliminated.

B. QUALITATIVE ANALYSIS
We present some tracking results in Fig. 7, where challenging frames among 51 image sequences are selected, including Ironman, Shaking, Jogging2, Lemming, Subway, Couple, MotorRolling and Coke. In most complex scenes, our algorithm can locate the target more accurately than baseline algorithms, which shows that it is helpful for the improvement on tracking to introduce the geometric structure of each kernel component.   For example of the trackers using HOG feature, Fig. 8 shows the results and response maps of DCF s and DCF in the jogging video sequence. It can be seen that the pedestrian is occluded by the background. DCF s can keep a robust tracking of the target in the video sequence, while DCF fail.
For example of the trackers using CNNs feature, video sequences subway characterized by OCC, BC and DEF, which is shown in Fig. 9, It can be also seen that TCCF rps can also perform tracking successfully.
It is well known that to estimate the target location more accurately, the response map need to be sharper at the peak. Shown from the Fig. 8 and Fig. 9, DCF s and TCCF rps have less distraction in response map than DCF and TCCF due to the geometry constraint of spherical manifolds.    In addition, a frame-by-frame comparison of our proposed algorithm on example sequences is presented, showing the center location error in pixels. The results came from the trackers using HOG feature and CNNs feature are presented in Fig. 10 and Fig. 11, respectively. It can be seen that our tracker can still identify the target stably when approaching the end of the video sequence.
We also compare the mean frames per second (FPS) of each method, the result is shown in Table. 4 and Table. 5. It can be seen from Table. 4 that despite some decline in frames per second, our tracker still operates beyond realtime. Meanwhile, it also demonstrates from Table. 5 that dimensional reduction by random projection is essential to enhance the computational efficiency.

VI. CONCLUSION
In this article, we propose a novel correlation filtering tracking algorithm based on spherical manifold geometry. By embedding each kernel on the spherical manifold, we have a substitution of the mean in a hypersphere for the mean in the normal Euclidean space, by which a fusion of multiple kernels is achieved. To maintain consistency of our approach, a corresponding online update strategy is also proposed based on the geometry of spherical manifolds. The experiments show that fusion kernels with the normalized amplitude through the geometry of spherical manifolds does have a contribution to dealing with these challenges more effectively. It has been proved that our approach is generic and can be extended to many CF based tracking methods with multiple channels features.
XUANDE ZHANG was born in Ningxia, China, in 1979. He received the B.S. degree in computational mathematics from Ningxia University, Ningxia, in 2000, and the M.S. and Ph.D. degrees in applied mathematics from Xidian University, Xi'an, China, in 2006 and 2013, respectively. He is currently a Professor of electrical and information engineering with the Shaanxi University of Science and Technology. His research interests include numerical analysis, machine learning, and its application in computer vision.