Superpixel Tensor Pooling for Visual Tracking using Multiple Midlevel Visual Cues Fusion

In this paper, we propose a method called superpixel tensor pooling tracker which can fuse multiple midlevel cues captured by superpixels into sparse pooled tensor features. Our method first uses superpixel segmentation to produce different patches from the target template or candidates. Then it encodes different midlevel cues like HSI color, RGB color, and spatial coordinates of each superpixel into a histogram matrix to construct a new feature space. Next, these matrices are formed to a 3-order tensor. After that, the tensor is pooled into the sparse representation. Then the incremental positive and negative subspaces learning is performed. Our method has both good characteristics of midlevel cues and sparse representation hence is more robust to large appearance variations and can capture compact and informative appearance of the target object. To validate the proposed method, we compare it with eight state-of-the-art methods on 24 sequences with multiple visual challenges. Experiment results demonstrate that our method outperforms them significantly.


I. INTRODUCTION
T HE study of visual tracking has been achieved great successes in recent years. However, because of the heavy occlusion, drifts, fast motion, severe scale variation, large shape deformation, etc., visual tracking is still a challenge in computer vision [1], [2], [3].
To overcome these challenges, several effective visual tracking methods have been proposed. Different levels of appearance and spatial cues are successfully applied in visual tracking methods [4], [5]. Compared to high-level structure cues and low-level visual cues, midlevel visual cues are shown more effective in representing the structure of the image [4], [5], [6], [7]. Some researchers applied superpixel methods to obtain the midlevel cues in visual tracking and their methods show robust against heavy occlusion and drifts [4], [5], [6], [7], [8]. Because of the utilization of superpixels, the dimension of original data will reduce (for a matrix, it will reduce to a vector). It will cause the losing of spatial information of the target object. Hence, how to fuse spatial information into the appearance model constructed by midlevel cues is still a problem [8]. Some researchers used the Euclidean distances from the target to the candidates as the weight to integrate Chong  spatial information [4], [5]. It can preserve some spatial information to some extent. However, the spatial information like the shape of target is still lost. Hence, some of these methods are more sensitive to color variations than spatial variations and it might result in a bad performance in a situation like background clutters. Some researchers have shown that integrating more information in sparse representation can improve the tracking performance [9]. And some researchers tried to fuse the depth cue with superpixel-based target estimation using graph-regularized sparse coding and improved the discriminative ability of the trackers [8]. What's more, different color channels are suitable to different tracking scenarios. Integrating different color channels into the unified framework will also help to improve the robustness of tracker. Hence, it is of great interest to develop an elegant method for fusing multiple midlevel cues in sparse representation.
In this paper, we propose a visual tracking method called superpixel tensor pooling tracker (SPTPT) which can integrate multiple midlevel cues (like the information of different color channels, spatial coordinates, shape, etc.) obtained by superpixels in a unified sparse coding tensor form. With the utilization of sparse representation and midlevel cues, our method has both good characteristics of the midlevel cues and sparse representation. What's more, through fusing multiple midlevel cues, our method shows more robust than some state-of-the-art methods under large appearance variations. The contribution of this paper can be concluded as follows, 1) This is the first attempt of using superpixels to obatin tensor-pooled sparse features. With the utilization of superpixels, the patches obtained are more meaningful than the patches obtained by sliding window. 2) Our method provides an effective fusion method for fusing multiple midlevel cues in a unified sparse representation. Hence the constructed discriminative appearance model can take advantage of different midlevel cues.
To validate our method, we select 8 state-of-the-art tracking methods (TPT [5], SPT [3], STRUCK [10], TLD [11], VTD [12], CSK [13], SMS [14], and CT [15]) and 24 sequences with multiple tracking challenges from the benchmark of the paper [16], and we compare their reasonable lower-bound performances (we used one default parameter for all sequences without any tuning). Experiment results show that the lowerbound performance of our method is significantly better than exisiting methods.
The rest of this paper is organized as follows. In Section 2, we first introduce the superpixel segmentation method used in our method. Then we describe the fusion model and the arXiv:1906.04021v1 [eess.IV] 10 Jun 2019 Step 1: Use particle filter (PF) and affine transformation to produce some candidates.
Step 3: Compute the histograms of color and spatial information in each superpixel.
Step 4: Construct the histograms to a 3-order tensor.
Step 5: Continue Steps 3-4 till the features of all superpixels of all candidates are obtained.
Step 6: Determine whether the current frame is the 1st frame. If yes, produce the dictionary matrix, store the tensor of it into a updating sequence and then go to Step 1, otherwise go to Step 7.
Step 7: Use the dictionary matrix to do the pooling of tensors obtained in Step 3.
Step 8: Evaluate the likelihood of tensors after pooling in positive and negative subspaces.
Step 9: According to the likelihood, update the discriminative appearance model (if the max likelihood > 0, store the tensor of max likelihood into the updating sequence and use PF draw some negative samples to update negative subspace and if algorithm reaches update rate u, use IRTSA to update positive subspace). Continue Steps 1-9 till the last frame.
incremental positive and negative subspaces learning method. In Section 3, we first illustrate the experiment settings and evaluation metrics. We then analyze of the experiment results. Finally, we present the conclusion in Section 4.

II. SUPERPIXEL TENSOR POOLING TRACKER
In this section, we will introduce four main parts of the proposed SPTPT.

A. Patches Extraction using Superpixel Segmentation
Producing meaningful patches is very important to construct tensor-pooled sparse features. Compared to the patches obtained by sliding window [2], [3], superpixels are more meaningful, because superpixels can preserve the image structure and reduce the redundancy. Hence, we introduce superpixel segmentation to obtain patches (superpixels) in SPTPT. To construct tensor-pooled sparse features, we need to keep the number of patches obtained precisely. Hence, a superpixel method which can control the superpixel number precisely is needed. We select simple non-iterative clustering (SNIC) [17] which can generate precise number of superpixels as the superpixel method in SPTPT (compactness coefficient: 20, number of superpixels: 30, in this paper).

B. Fusion Model for Multiple Midlevel Cues
After producing the superpixels, we apply local sparse codes to encode features. Multiple midlevel cues (HSI color channels, RGB color channels, and spatial information are used in this paper) of different superpixels are first constructed to several same size histogram vectors, each vector for each cue in a superpixel is as follow, where, n is the number of bins of each histogram. Each element in f represents the frequency of each bin in a superpixel region. It can be calculated as follow, where, c is the number of pixels corresponding to bin i in a superpixel region and r is the total number of pixels in this superpixel region. These vectors are combined to construct the feature matrix of each superpixel as follow, where, m is the number of features, in this paper, m = 8 (H, S, I, R, G, B, x, y).
After that, we reshape the feature matrix to a feature vector a ∈ R B of each superpixel and compute the sparse coefficient vector h ∈ R z of it using the formula as follow, where, B = n × m, D ∈ R z×s is the dictionary matrix learned by the clustering result of a of the superpixels obtained in the 1st frame, z is the number of cluster centroids and s is the number of superpixels.
To preserve the spatial order of the superpixels, we arrange the sparse coefficient vector h of each superpixel of each candidate in a unified 3-order tensor T ∈ R z×s×v according to the spatial order of their corresponding superpixel in the candidate templates. v is the number of candidate templates.

C. Incremental Positive and Negative Subspaces Learning
The update and learning scheme used in SPTPT is based on the incremental subspace learning [18]. In order to make the tracker more robust against drifts, we refer to papers [2], [3] to introduce discriminative framework called negative subspace learning into the learning scheme. Hence, the learning scheme used in SPTPT is called incremental positive and negative subspaces learning.
If algorithm reaches the update rate u (in this paper, u is set to 5), then a 3-order tensor T ∈ R z×s×u corresponding to positive subspace is constructed. We use IRTSA algorithm [18] to find the dominant projection subspaces of it. The details of IRTSA can be referred in [18]. To learn the positive subspace, it is necessary to evaluate the likelihood between the candidate sample and its approximation in the learned positive subspace. Given a 3-order tensor J ∈ R z×s×1 of a candidate in the new frame, the evaluation of its likelihood in the learned positive subspace can be determined by the reconstruction error as follows, where, M is the mean tensor of T , M (i) (i = 1, 2) is the column mean and M (i) (i = 3) is the row mean of the mode-i unfolding matrix of T , J (i) is the mode-i unfolding matrix of J , and γ is the control weight, in this paper, γ = 0.5.
As to the negative subspace learning, in contrast to the positive subspace learning which collects the positive samples one per frame using the sparse pooled tensor features obtained from the tracked frames and is incremental learned by IRTSA, the negative samples are only collected in the last tracked frame through extracting superpixels a certain distance threshold (several pixels) around the estimated location of the target [2], [3]. Since these negative samples are collected in only one frame rather than a sequence of frames, the negative subspace is learned directly by doing tensor decomposition (TD) of the sparse pooling tensors of these samples. The likelihood can also be calculated by Equations (1)-(3).

D. Motion Model based on Bayesian Inference
The motion model of SPTPT is based on Bayesian inference. Let X t = {x t , y t , ϑ t , s t , β t , φ t } represent the state (affine transformation parameters) at tth frame, where x t is x translation, y t is y translation, ϑ t is rotation angle, s t is the scale, β t is the aspect ratio, and φ t is skew direction, let S t represent a set of the observations {S 1 , S 2 , ..., S t } at time t. The posterior probability is calculated as follow, where, p(S t |X t ) represents the observation model here is the likelihood function, and p(X t |X t−1 ) denotes the dynamic model between states X t and X t−1 . We apply a particle filter [19] to generate samples (number of positive samples: 600 and number of negative samples: 200, in this paper) through estimating the distribution. The optimal state can be obtained by using the maximum a posteriori (MAP) estimation, where, b is the number of samples and X i t represents the sample i of state X t . The dynamic model p(X t |X t−1 ) is formulated using random walk as follow [3], where, Ω is a diagonal covariance matrix. And its diagonal elements are σ 2 x , σ 2 y , σ 2 ϑ , σ 2 s , σ 2 β , σ 2 φ , respectively. Finally, the likelihood of a candidate in both positive and negative subspaces is formulated as follow: To make the tracker more robust and avoid overfitting, SPTPT uses the likelihood function above to control the learning: only when the best candidate's likelihood exp(RE (−) − RE (+) ) > Φ, it can be accepted into the updating scheme, where Φ is a threshold, in this paper, it is set to 0. The details of SPTPT are shown in Fig. 1.

A. Evaluation Metrics
Precision plots and success plots [16] are used to evaluate the overall performance and robustness of trackers. A precision plot illustrates the ratio of frames whose center location error is within a threshold distance to the ground truth. A success plot illustrates the percentage of frames of which overlapping rate between tracked results and ground truth is larger than a certain threshold. The final rank is according to the area under curve (AUC) of each tracker. Also to compare the performance of each tracker in per sequence, the number of successful tracked frames and center location error [20] are used. Fig. 2 shows the visual comparison results of SPTPT, SPT (midlevel cues based method without using midlevel cues fusion), and TPT (tensor pooling tracker which uses sliding window to get patches and without using midlevel cues fusion). It shows that SPTPT has both good characteristics of SPT and TPT, at the same time, outperforms them. Table I shows the number of successful tracked frames and center location error obtained by each tracker in per sequence. The value in bold face means the best value and the value in underline italic type means the second best value. We can see that our tracker achieves the best or the second best score in most of sequences. Figs. 3-4 show the overall performance of nine trackers. The AUC of success plot and precision plot of our tracker are significantly higher than other trackers. It shows the robustness of our tracker. As to the running time, compared to TPT: 2.8s per frame (SPTPT), 3.0s per frame (TPT), SPTPT is faster than TPT.

B. Performance Analysis
IV. CONCLUSION In this paper, we have proposed a visual tracking method which can fuse multiple midlevel cues obtained by superpixels to construct tensor-pooled sparse features. Our method has both good characteristics of the midlevel cues and sparse representation. In the validation, our method shows more robust against different visual tracking challenges than stateof-the-art methods.
ACKNOWLEDGMENT This work was supported by the Hong Kong Research Grants Council (Project C1007-15G) and City University of Hong Kong (Project 9610034).