High-Performance Visual Tracking Based on High-Order Pooling Network

Convolution Neural Network (CNN) features have been widely used in visual tracking due to their powerful representation. As an important component of CNN, the pooling layer plays a critical role, but the max/average/min operation only explores the first-order information, which limits the discrimination ability of the CNN features in some complex situations. In this paper, a high-order pooling layer is integrated into the VGG16 network for visual tracking. In detail, a high-order covariance pooling layer is employed to replace the last maxpooling layer to learn discrimination features and is trained on the ImageNet and CUB200-2011 data sets. In tracking stage, the multiple levels of feature maps are extracted as the appearance representation of the target. After that, the extracted CNN features are integrated into the correlation filters framework when tracking is on-the-fly. The experimental results show that the proposed algorithm achieves excellent performance in both success rate and tracking accuracy.

of this task are mainly concentrated on two aspects, one is the shown that good feature representation plays a very impor-28 tant role in dealing with the above complex scenarios [51]. 29 In order to obtain a good feature representation of the tar-30 get, researchers have designed a large number of excellent 31 The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy .
hand-crafted features, such as HOG, SIFT and so on. How-32 ever, these features only have good performance for some 33 simple scenes and are difficult to deal with complex scenes. 34 In recent years, deep learning has been well applied in all 35 aspects of computer vision [3], [17], [44], and has achieved 36 great performance improvement in tasks such as image clas-37 sification [23], [43], object detection [44], image segmenta-38 tion [34], and object recognition [17]. Because of the strong 39 correlation between visual tracking task and target detection, 40 image recognition and other tasks, especially in the aspect of 41 feature representation, it is natural to apply deep features to 42 visual tracking. 43 There are many researches on using deep features for 44 tracking tasks, which can be divided into two categories 45 according to whether the features need to be updated online. 46 The first type is the method that does not need online 47 update, which can be further divided into two types. The 48 first type is to combine the pre-trained classification network 49 model and the correlation filter [18], [36], [42]. For exam-50 ple, Danelljan et al. [36] use VGG19 network as the feature 51 extractor. At the same time, in order to improve the tracking 52 performance, multi-layer feature fusion is adopted. However, 53 no explanation is given on the selection of fusion weights, 54 and only fixed weight fusion method is used. In this regard, 55 Danelljan et al.  However, all the network structures adopted by these meth-  Meanwhile, it has obvious advantages in feature representa-78 tion in the task of person re-identification [6] and semantic 79 segmentation [5] with high similarity of same class objects. 80 In order to improve the discrimination ability of the tracking 81 algorithm to similar targets, this paper makes appropriate 82 improvements to the feature extraction network, and changes 83 the original first-order max pooling to the second-order 84 covariance pooling method, so that the network can extract 85 more high-order statistics about the target features and 86 improve the model discriminative power. As shown in Fig.1, 87 in the Biker video sequence, our deep features can be more 88 focused on the target to be tracked, in the CarScale and Dog 89 sequences, the traditional first-order pooling CNN model can 90 not extract the characteristics of the target itself well, and it 91 is easy to be disturbed by the surrounding objects. However, 92 the deep model in this paper has better response value to the 93 target.

94
At the same time, the paper also retrained the network 95 on ImageNet [45] and the fine-grained image CUB200-2011 96 data set [50], which further improved the model's ability to 97 distinguish similar objects.

98
Starting from improving the representative ability of tar-99 get features, this paper proposes a visual tracking algorithm 100 based on high-order pooling networks (HOPNet). Firstly, 101 the high-order pooling network is constructed. Then a large 102 number of data sets are adopted to train this model, so that 103 the learned features can be more discriminative. Lastly, the 104 the algorithm still uses particle filter for sample sampling, where, f represents the training samples, h denotes the filters, 207 ⊗ is the spatial correlation operator. g is the ideal output, and 208 is generally set as Gaussian window function, in which each 209 VOLUME 10, 2022  Then we calculate the covariance matrix of the feature map: Then the eigenvalue decomposition 234 is used to process the obtained covariance matrix to obtain 235 eigenvalues and eigenvectors: is a diagonal matrix and λ i is a 238 eigenvalue.,U = [u 1 , . . . , u d ] is the corresponding feature 239 vector. Through the above eigenvalue decomposition, we can 240 convert the power of the matrix to solve the power of the 241 eigenvalue.
which represents the exponentiation of eigenvalues: Now, the forward propagation of the covariance pooling layer 247 has been deduced.  According to the chain derivation rule, ∂l ∂P and ∂l ∂X are as 257 computed as follows: objective function is as follows: at the problem that the dimension of the last layer of con-296 volution is too high, this paper uses 1 × 1 convolution to 297 reduce the channel dimension of the last convolution layer 298 from 512 dimension to 256 dimension, and then send it to the 299 covariance pooling layer for processing. In the testing stage, 300 the third, the fourth and the fifth layer features of the network 301 are selected as the appearance representation of the target, 302 and the extracted features are processed by cosine window to 303 eliminate the boundary discontinuity. The size of the target 304 search box is a square area with the target as the center and 305 the side length as the center, where and represent the length 306 and width of the target respectively. For the regularization 307 parameter in Eq. 12, λ = 1, µ = 16. The scale estimation 308 adopts the same parameter settings as those in reference [26], 309 the weights of multi-layer fusion are seted as 1, 0.5, 0.5 from 310 fifth layer to third layer.    In addition to the overall performance on the data set, we also 356 need to focus on the tracking performance in different track-357 ing scenarios. As mentioned above, 100 videos of OTB2015 358 can be divided into 11 attributes. Fig. 5 and Fig. 6 show the 359 tracking accuracy and success rate of different algorithms 360 under different attributes. It can be clearly seen from the 361 Fig. 5 and Fig. 6 that the algorithm in this paper has achieved 362 excellent tracking results on almost all attributes. Especially 363 in dealing with background clutters (BC), the algorithm in 364 this paper achieves 0.705 in success rate and 0.902 in tracking 365 accuracy, which is far superior to similar tracking algorithms 366 based on deep characteristics (HCF, HDT, etc.). This further 367 verifies that the high-order pooling network used in this paper 368 has a good ability to distinguish similar objects. In addi-369 tion, in the low-resolution target tracking, the accuracy of the 370 algorithm is 0.977, and the success rate is 0.702, which is 371 mainly due to the further training of the feature extraction 372 network in the fine-grained image data set, which improves 373 the representation ability of the network. In complex scenes 374 such as illumination change, rotation and motion blur, the 375 algorithm in this paper achieves the optimal tracking accuracy 376 and success rate.

378
In order to show the effectiveness of this algorithm more intu-379 itively, it is compared with the most relevant three compari-380 son algorithms on five challenging image videos, including 381 Girl2, Bolt2, Biker, Lemming, Human3. As shown in Fig. 7, 382 the algorithm in this paper can well deal with these complex 383 scenes.

384
(1) Fast Motion: because the correlation filter algorithm is 385 to detect and track the target in a fixed search area, it is 386 difficult to deal with it well when the target moves fast. 387 At the same time, fast movement will lead to obvious 388 changes in the target's appearance, which further aggra-389 vates the difficulty of tracking. For example, in Biker, 390 there is a fast and obvious appearance change of the 391 FIGURE 5. Comparison results of sub attribute success rate on OTB2015 data set, where HOPNet represents the algorithm in this paper. and tracking success rate are lower than the algorithm in 396 this paper.

397
(2) Background Clutters: the interference of similar pre-trained network still fail to track. Although the algo-410 rithm in this paper does not use the corresponding occlu-411 sion detection mechanism, but thanks to better feature 412 representation, it can accurately capture the target when 413 the target reappears, while HCF and HDT algorithm can-414 not detect the target again because of their weak feature 415 discrimination.

416
(4) Low Resolution: in the field of computer vision, the 417 detection and tracking of low resolution targets are 418 always difficult problems. Due to the low resolution, the 419 available information is less and the ability of feature 420 representation is weaker. The target tracked in Biker 421 and Human3 is smaller, and Human3 is basically a 422 pure black object, which has a higher requirement for 423 FIGURE 8. Overall performance comparison plots on Temple-Color128 data set, where HOPNet represents the algorithm in this paper. rate, which is also significantly better than CF2 tracker [36]. 455 In general, compared with the most advanced tracking algo-456 rithm in Temple-Color128 data set, the algorithm proposed in 457 this paper still has good performance and competitiveness.

458
The TrackingNet data set includes 30000 videos, covering 459 a wide range of target categories and complex tracking sce-460 narios. As shown in Table 2, our tracker has achieved 0.642, 461 0.597 and 0.738 in terms of success rate, accuracy and norm 462 precision respectively. Compared with the correlation filter 463 algorithm ECO [9] based on the same pre-trained model, 464 this algorithm has obvious performance improvement, which 465 fully verifies that the high-order pooling network has better 466 feature expression ability. However, it is worth noting that 467 compared with the tracking algorithms of Siamese network 468 based and transformer based, the algorithm in this paper still 469 has a large gap in performance. The main reason is that the 470 algorithms of TransT [7] and SiamRPN++ [24] are trained 471 through large-scale data sets, and the backbone networks are 472 also better than the algorithm in this paper. In the future, 473 we can consider combining the high-order pooling with the 474 Siamese network.

476
As shown in the Fig. 9, for the Twinings sequence, when the 477 target encounters rotation, rapid change in scale and serious 478 interference from similar objects, the algorithm in this paper 479 VOLUME 10, 2022  Finally, the feature extraction network and correlation filter 516 algorithm are combined to achieve the best tracking perfor-517 mance in large-scale data sets.

518
Due to the successful application of end-to-end idea in 519 tracking tasks, we will continue to study how to introduce 520 higher-order information into Siamese network structure to 521 enhance the ability of feature discrimination.