Target Tracking Method Based on Adaptive Structured Sparse Representation With Attention

Considering the problems of motion blur, partial occlusion and fast motion in target tracking, a target tracking method based on adaptive structured sparse representation with attention is proposed. Under the framework of particle filtering, the performance of high-quality templates is enhanced through an attention mechanism. Structure sparseness is used to build candidate target sets and sparse models between candidate samples and local patches of target templates. Combined with the sparse residual method, reconstruction error is reduced. After optimally solving the model, the particle with the highest similarity is selected as the prediction target. The most appropriate scale is selected according to the multiscale factor method. Experiments show that the proposed algorithm has a strong performance when dealing with motion blur, fast motion, partial occlusion.


I. INTRODUCTION
Target tracking automatically locates a target in subsequent frames according to the state of a known target in the initial image frame. Target tracking, as one of the research hotspots in the field of computer vision, plays an important role in the fields of intelligent transportation, medical, military, and intelligent surveillance. According to the methods established by the target observation model, the target tracking methods [1]- [5] can be divided into two categories: discriminative methods and generative methods. The discriminative method establishes an observation model for the foreground and background of the initial frame image and determines the background and target information in subsequent video frames to achieve target tracking. Discriminative methods mainly include correlation filtering methods [6], [7], [26]- [29] and deep learning methods [8], [9], [30]- [37]. The generative method represents the target through the learned appearance model and selects the candidate patch with the smallest reconstruction error as the target area of the next frame. Generative methods mainly include sparse representations [10], [11], [44], mean shift [12], [13], and particle filtering [14]. Although these methods have been proven to achieve good results, they still face challenges from The associate editor coordinating the review of this manuscript and approving it for publication was Jinjia Zhou . occlusion, deformation, scale change, fast motion, motion blur, lighting change, and background change.
The sparse representation of the image is based on the over-complete dictionary theory proposed by Mallat and Zhang [15] in 1993. Since sparse representation was applied in the field of target tracking by Mei and Ling [16] in 2009, the method of sparse representation has been proven to apply to target tracking. References [17]- [22], [43]- [48] used local, global, or joint sparse models to classify trackers. In [17], [22], the template T represents each target candidate area xi by means of sparse linear combination and uses a dynamic update method to describe the appearance model of the target. Although this type of method has achieved good results, when it encounters occlusion situations, the tracking efficiency decreases sharply because of the global sparse model. In [11], the target candidate region was divided into k patches of the same size x k i , which are represented by sparse templates. Reference [16] represented local patches in candidate regions as linear combinations of dictionaries by solving the l 1 minimization problem. Most of these methods are based on static local sparseness. Once similar objects appear in the scene or the target is occluded, the tracking target is easily lost. Zhang et al. [24] proposed a tracking algorithm based on a structural sparse appearance model and a particle filtering framework to represent particles and the corresponding local patches jointly.
In this paper, there are two main contributions. (1) We propose a novel sparse representation model by combining the structured sparse representation and sparse residuals and add an attention mechanism into the model. The model can significantly improve the performance and reliability of the algorithm. (2) The kernel density characteristics method is used to deal with the problem of target scale change during target tracking. The experiment proves that the strategy is effective.

II. RELATED WORKS
For the convenience of subsequent descriptions, a brief review of the work related to this article is given in this section.

A. STRUCTURAL SPARSE REPRESENTATION TARGET TRACKING MODEL
Mei and Ling [16] proposed robust tracking based on the l 1 norm minimization. The tracking problem was solved as a sparse approximation problem in the particle filtering framework. Each candidate target is represented by the sparseness in the target template set and trivial template set. In addition, l 1 regularized least squares is often used to solve the sparse problem of candidate targets, and then selecting the tracking target with the smallest reconstruction error as the location of the target in the next frame. On this basis, Xu et al. [23] proposed a tracking method based on a structured sparse appearance model, which divided the image into n patches of equal size, each local patch represented a fixed part of the target object, and all local patches represented the overall structure of the target.
However, this sparse model has the following disadvantages: (i) Though the l 1 norm can make the coefficients sufficiently sparse, under the interference of complex backgrounds and lighting changes, the sparse assumption is often not true, and the l 1 norm requires higher computational complexity. (ii) Its inability to explore the correlation between different particles is another deficiency, which will have a considerable impact on the robustness of the model.

B. ROBUST TARGET TRACKING BASED ON SPARSE REPRESENTATION
Zhang et al. [24] proposed a novel object tracking method of structural sparse representation, which not only makes full use of the inherent relationship between candidate targets and their local patches and learns their joint sparse representation but also preserves the spatial layout in local patches within each candidate target structure, and improves tracking performance by using the internal relationship between particles.
However, when this algorithm encounters the problem of target deformation, the gray features are extremely sensitive to these scenes, the correlation between column vectors will be affected, and it is doubtful whether the coefficients are still sparse. Moreover, this algorithm also lacks effective strategies for dealing with fast movements. Because it cannot  adaptively adjust the window size, it learns from the noise as a target when the target changes in scale.

III. OUR APPROACH
The sparse structured target tracking method utilizes the relationship between the local patches of candidate targets and retains the spatial layout between the local patches of each candidate target to improve the robustness of the algorithm. In the prototype-based tracking algorithm [25], the base vectors are orthogonal, so the coefficients Z corresponding to the orthogonal base vectors are dense. The above model can be solved by iterative optimization.
Based on the advantages and disadvantages of the above two methods, this paper combines structural sparse representation and prototype-based sparse representation and adds an attention mechanism to optimize the objective function to improve the robustness of the algorithm. Finally, the multiscale factor method is used to solve the problem of target scale change. The algorithm main structure is shown in Fig. 1.

A. STRUCTURAL SPARSE REPRESENTATION MODEL COMBINING AN ATTENTION MECHANISM
In this paper, the target template is selected according to the method in [24]. The target object image in the specified frame is divided into n × n sized subpatches (Fig.2). All subpatches are vectorized and combined to the target template D. The model in [24] can be described as (1). Z is obtained by solving (1) with the help of the Lagrange multiplier method.
The sparse residuals used in [25] effectively reduce the reconstruction errors in the tracking process. On this basis, to improve the performance of high-quality templates on targets, an attention mechanism [41], [42] is added to (2), and formula (3) is obtained.
Z is a sparse representation coefficient, and e is a reconstruction error.
where W represents the weight of the template. For the output y at a certain moment, W represents its attention on each part of the input x, that is, the weight of the contribution of each part of the input x to the output at a certain moment. The template D k with stronger performance ability for the target in the current frame is given a higher weight W high , which improves the performance of the target template in subsequent frames and enhances the robustness of the algorithm. Finally, considering the sparseness constraint problem of global images and local image patches, this paper considers using a combination of structural sparseness methods to enhance the sparseness of coefficient Z and make full use of the inherent relationship between candidate targets. Thus, combining (1) and (3), we propose a new target tracking model, as shown in (4).
X k consists of the k-th patch of all n candidate targets, D k represents the target template of the k-th patch, Z is the local observation representation of the k-th patch of the target template, W k represents the weight coefficient of D k , and e k represents the k-th reconstruction error of the patch. λ 1 , λ 2 , λ 3 , λ 4 represents a nonnegative parameter of the regular term. The lp, q hybrid constraints are defined as: q , Z ij denotes the element in the i-th row and the j-th column, the l 2,1 mixed norm is used for the row group of P so that the relevant local color patches have similar representations; the group lasso penalty is used on the column group of Q to identify outliers at the same time. We divide the solution of (4) into two steps. The first step uses the APG (accelerate proximal gradient) algorithm to solve P, Q. Set: Now we apply the method of composite gradient mapping to (4), and we obtain the following function: In the m-th APG iteration: The solution of the m-th iteration is obtained by solving equation (9) (R m , S m ) = min The solution of equation (9) can be divided into two parts: P and Q.
After finding P, Q, Z , in the second step, we fix P, Q, Z to solve W and e. W can be obtained by using the ridge regression constraint term, so it can be derived directly.
Then, after fixing W , P, Q, Z , e can be acquired by minimizing f (e) = X − WDZ − e 2 F + e 1 , which is essentially a convex optimization problem, and can be solved by the contraction operator [25], and the global minimum can be solved by the contraction operator [25], and e can be obtained from (14) e = βτ (X − WDZ ) (14) βτ is the contraction operator and defined as

B. HANDLING OF SCALE CHANGES
The scale change in the target is always a key issue in target tracking. The existing methods for dealing with the scale change mainly use the scale pyramid method (SAMF) [39] and multiscale factor (DSST) [40], and obtain the best performance with templates scale. In this paper, the predicted target is multiplied by the scale factor of different sizes to extract the corresponding kernel density characteristics. The optimal value obtained by solving formula (17) and template matching is the optimal scale of the predicted target. The reference target model is represented by the density estimation featureq in the feature space [26], as shown in (15). The target candidate is defined at position y and is VOLUME 8, 2020 characterized by the density estimation featurep(y), as shown in formula (16): The k-th block feature dictionary D k is obtained by collecting the k-th block target model density estimation featureq. According to the collected density estimation featurep(y) of the i-th target candidate at y, the k-th block of the i-th candidate test sample x k i is formed. Then, we add feature dictionary D k and test sample x k i to (17) to obtain the similarity value between the target candidate and the target model, and the scale with the largest similarity value is selected as the prediction target scale.
yt represents the observed value, and st denotes system status; α, β are constant coefficients.
The main process of the proposed adaptive structured sparse representation is as follows:

IV. EXPERIMENT
In this section, we show the performance of the proposed algorithm on mainstream video datasets and compare it with other algorithms, as well as some technical details in the implementation process.

A. EXPERIMENTAL SETUP
In the experiment, the video image was converted into a grayscale image. The image was initially divided into 2 × 2 local patches of the same size. The template was selected from the video frame image. The template size was consistent with the candidate target local patch size. The search radius of the candidate target is 1.5 times, and the number of candidate targets is 200. After tracking the target in each frame, update the template by comparing the error between the predicted target and the template patch; replace the patch with the largest error in the template with the target patch in the current frame, and use the learning rate η to update the remaining templates. The experimental environment is Matlab2018a, the host frequency is 3.60GHZ, and the memory is 8GB.

B. EXPERIMENTAL EVALUATION INDICATORS
There are three kinds of evaluation indexes in this experiment: average overlap rate, center position accuracy, and accuracy rate. Compare the predicted target frame obtained from the experimental real frame R boundary and P boundary of a given frame. Assume their center positions are R centeral and P centeral , the center position error is E centeral = R centeral − P centeral 2 , and the average overlap ratio is Algorithm 1 Algorithm Process 1. Initial tracking target position pos(1), obtain template dictionary D. After multiple experiments, the experimental results are best when the parameters are set as follows: attention mechanism parameter W is set to value 1, the window size padding is 1.5 times the target size, the reconstruction error e is initialized to 0, and the constants' values are λ1 = 0.001, λ2 = 0.001, λ3 = 5, λ4 = 1 For I = 2: imgnum(video frames) 2. Good point set sampling is used to obtain candidate target sets X Solving (4) is performed in two steps: a. First step: fix W , e, solve Z using the APG algorithm While t < T(T: the maximum number of iterations, t: number of iterations) m-th iteration: To solve Z , first solve (11) in two parts P, Q (12)). Z = P + Q b. Second step: fix Z , P, Q, solve W , e in sequence according to (13) and (14) 3. According to (17), the candidate with the highest similarity is selected as the tracking target to obtain the target central position loc 4. Based on the loc, the candidates obtained by scaling the target selection box with different proportions are added to (17) to obtain the final predicted target size of the current frame. In addition, we save the target position and size in pos(i) 5. Update template, learning rate η = 0.7 6. Return to step 2, save pos(i) 7. Output pos defined as: The accuracy rate is based on whether the distance Dist between the real coordinate R centeral and the predicted coordinate P centeral is less than 20 to determine the accuracy of the prediction.  In this paper, some challenging videos with low resolution, plane rotation, scale change, deformation, background change, light change, motion blur, and fast motion are selected as experimental videos. STC, UDT, SRDCF, DSST, TADT, Struck, BACF, CN, CSK, L1APG, SCT4 and STRCF are selected as the comparison benchmark algorithm in this paper. The comparison results of 17 videos are shown in Tables 1, 2, and 3. In the same video, the results of the three best-performing algorithms are labeled superscript 1, 2 and 3 respectively. Table 1 shows the comparison of the average center location error of different algorithms. Lower average center location error indicates that the algorithm's tracking results are better. Table 2 shows the comparison of the average overlap ratio of different algorithms on the video dataset. The higher the average overlap ratio is, the higher the accuracy of the tracking results. Table 3 shows the average tracking accuracy of different algorithms within 20 pixels error. The greater algorithms have better accuracy than other algorithms. Table 1 shows the average center location error of different algorithms. On most of the videos, our method is the best.
In all of the videos, our method's center location error is lower than 10 pixels, only TADT and UDT's results are similar to ours.
Comparing the results of videos Car4, Crossing, and Man, the center location error of our method is lower than 2 pixels. This proves that our method is effective and robust in the test videos.
According to the data in Table 2, the proposed algorithm has the highest average overlap rate on Car4, David2, Faceocc1, Jumping, Mountain Bike and other video sets. On shaking video, the overlap rate of our method reaches 0.911, which is much higher than the 0.862 of the second TADT. On the Crossing video set, the overlap rate of our method reaches 0.991, which is significantly higher than the rate of 0.937 of the second STRCF method. Before modeling, our algorithm segments the template and tracking target into local blocks, which enhances the local information of the target and mitigates the impact of target changes on the tracking result. The introduction of an attention mechanism further improves the accuracy and robustness of the proposed algorithm.  Table 3, the proposed algorithm has the highest center accuracy on the BlurCar2, Boy, and Faceocc1 video sets, and the tracking accuracy is the highest.

As seen in
Within the range of 20 pixels, the tracking accuracy of the proposed algorithm on most of the videos, such as Crossing, Deer and Jumping, reached 1; even on Faceocc1 video, our accuracy reached 0.98 and ranked first. The accuracy of most of the other algorithms, such as UDT, is less than 0.9. When the above video shows fast motion, motion blur and partial occlusion, the proposed algorithm can still track the target accurately. This is mainly due to the sparse model used in the proposed algorithm, which makes full use of the local and global information of the target, greatly reduces the error caused by the change in target appearance, and improves the robustness of the algorithm.
As can be seen from the tables above, In Boy, Car4, Football, Jumping, Deer and other videos, our method achieves better and more robust performance compared with that of the other methods. Our method also worked well for some of the fast-moving videos, such as Boy, and the partial occlusion videos. The structured sparse representation method not only considers the spatial layout structure of the image blocks inside each target candidate region but also considers the internal relations between the target candidate regions and between the local blocks. On this basis, this paper proposes to add an attention mechanism to strengthen the online learning of the target in the template, and continuously weaken the influence of the background on the tracking results. The attention mechanism can help us to obtain more discriminant information from sparse coding coefficients. With the continuous updating of the template and attention parameters, the reconstruction error of the moving target is continuously reduced, so the proposed method has a stronger performance on moving targets with motion blur, such as Deer, and partial occlusion. The feature of relatively uniform sampling of good point sets helps to collect more evenly distributed samples during the sampling process. It can quickly determine the approximate location of the target, decrease the algorithm's running time, and improve the efficiency of the algorithm. In the process of video processing such as the Car4 video, a reasonable scaling strategy is helpful for reasonably predicting the change in the target scale based on the size of the previous frame and the information of the current frame. The kernel density feature is not easily affected by the change in illumination and scale, which helps the system to obtain higher reliability. From Figs. 3, 4, and 5, it can be seen that the proposed algorithm has excellent performance in terms of overlap and center accuracy, and it is different from other comparison algorithms. The proposed algorithm benefits from the attention mechanism and therefore has stronger robustness. In the case of motion blur, the proposed algorithm can accurately track the target. At the same time, because the good point set sampling is used in this paper, the sampling point with a larger sampling range is more evenly distributed so it can better handle some challenges such as fast movement. When dealing with partial occlusion and partial deformation, structured sparse representation considers the commonness between particles and the spatial structure of local blocks, so it has strong robustness when dealing with partial occlusion.
As seen in Fig. 4, when the threshold of this article is approximately 0.1, the overlap ratio of the proposed algorithm is close to 98%, while TADT, UDT, and STRCF can only reach approximately 90%∼95%. When the threshold is 1, our method shows a greater advantage than other algorithms. According to Fig. 5, in the face of multiple challenges, the proposed algorithm makes full use of effective scaling strategies to ensure tracking accuracy. Because of the effective adjustment of target size, the center error is greatly reduced. It can be seen in Fig. 5 that when the threshold is low, the gap between the proposed algorithm and other comparison algorithms is small; when the threshold is larger than 15, the average center accuracy of this paper is beyond 0.9 which is higher than the 0.8 accuracy of the rest of the comparison algorithms. Table 1, Table 2, Table 3, Fig. 4, and Fig. 5 show that the proposed algorithm, which combines the structured sparse representation and the tracking method based on the sparse prototype, shows excellent processing capability compared to that of the other algorithms when dealing with partial occlusion, fast movement, light change and motion blur, and has better performance in terms of accuracy, center error and overlap. The attention mechanism improves the robustness of the algorithm. Fig. 3 specifically shows the comparison of the actual effect of this paper and other algorithms on multiple video sets.
We can see from Fig. 4 that our method shows an advantage from the beginning compared with the other methods, but as the threshold increases, this advantage decreases. This indicates that our strategy is effective for dealing with size, but there is room for improvement, and the larger the threshold is, the smaller the gap between methods. The slowly changing smooth curves in FIG. 4 and FIG. 5 also prove that our method is robust, which effectively demonstrates that adding an attention mechanism to the sparse structure is a correct choice. A reasonable weighting mechanism makes the attention mechanism more robust, which makes the template perform better and have more weight in previous frames. Some trackers use an intensive sampling method to override the state of the target object, but this can cause some other problems. First, it is hard for them to sample all possible particle filters which may include object states. However, the more uniform sampling method of good point sets greatly reduces the possibility of incomplete collection of target samples. Second, comparing with some methods that only use VOLUME 8, 2020  simple template updates in tracing, our method can reduce the possibility of replacing or updating valid target templates due to the added attention mechanism and online update strategy. Third, simple features such as gray features are disturbed by external information, while feature extraction methods based on kernel density are less susceptible to other information.
The experiments show that the proposed algorithm achieves good results. It achieves excellent results in dealing with fast motion of targets, changes in lighting, and motion blur. It has certain effects when dealing with partial occlusion and deformation problems. It is found that the algorithm still lacks effective coping strategies when facing the problem of targets that are out of view, completely occluded, and when the shape of the target changes drastically.

V. CONCLUSION
In this paper, we proposed a new moving target tracking method by combining structural sparse representation and prototype-based sparse tracking and introducing an attention mechanism. The proposed method can effectively improve the accuracy rate of tracking and the overlap rate of tracking, and the adopted scale change strategy can ensure that the algorithm can perform well in the target scale change without reducing the tracking accuracy, thus greatly enhancing the robustness of the algorithm. In the algorithm solving process, we used the APG algorithm to solve the target model step by step, and then solved the optimal scale of the predicted target through the similarity between the template and the core density characteristics of the predicted target at different scales. The algorithm realized robust tracking of the target and updated the target template according to the tracking results. Experimental results show that this method achieves the goal of stable target tracking. In future work, we will extend our idea and methodology to other multimedia applications such as segmentation [49], detection [50], recommenders [51] and dehazing [52].
JIE WANG is currently pursuing the master's degree with the Guangxi University for Nationalities, Nanning, China. His main research interests include image processing and pattern recognition.
SHIBIN XUAN received the Ph.D. degree in computer science and technology from Sichuan University, Chengdu, China, in 2011. He is currently a Full Professor and a Master's Supervisor with the School of Information Science and Engineering, Guangxi University for Nationalities, Nanning, China. His main research interests include image processing and pattern recognition.
HAO ZHANG is currently pursuing the master's degree with the Guangxi University for Nationalities, Nanning, China. His main research interests include image processing and deep learning.
XUYANG QIN is currently pursuing the master's degree with the Guangxi University for Nationalities, Nanning, China. His main research interests include image processing and pattern recognition. VOLUME 8, 2020