A Visual Tracking Algorithm Combining Parallel Network and Dual Attention-Aware Mechanism

In order to solve the problems of semantic loss and inaccurate boundary detection in the process of object tracking, a visual tracking algorithm combining parallel structure with dual attention-aware mechanism is proposed in this paper. As backbone network, parallel structure is composed of Convolutional neural network and Attention Cooperative(CAC) processing module, which is used for feature extraction. Because this structure can capture the local and global information of the target at the same time, it can solve the problem of semantic information loss. Dual Attention-aware Network(DAN) is used for feature enhancement, which is composed of target-aware attention and boundary-aware attention. Template online updating strategy is used to improve template quality, and an effective score prediction module-Template Elimination Mechanism(TEM) is designed in the CAC processing module to select high quality templates. This kind of object tracking algorithm which combines local and global information is called TrackCAC. The evaluation results on different datasets show that the algorithm can maintain high tracking precision and success in different scenarios. It shows good robustness and accuracy in the performance evaluation results on VOT datasets.


I. INTRODUCTION
In recent years, visual object tracking plays an important role in video surveillance, unmanned distribution, drone driving and infrared information perception [1], [2], [3]. The research on visual object tracking originated from the correlation filtering algorithm [4], and then the convolution neural network was applied to the tracking algorithm. The representative algorithms are DLT [5], FCNT [6], and HCFT [7]. Until 2016, with the introduction of SiamFC [8] network, the object tracking algorithm has been further improved, which can better solve the problem of tracking drift in the actual scene. In 2018, Li et al. [9] proposed SiamRPN algorithm, which is composed of Siamese network and regional proposed network, which solves the problem of scale variation and similar interference in SiamFC, and its real-time running frame rate is much higher than that The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Olague . of SiamFC. In 2019, Li et al. [10] proposed SiamRPN++ algorithm based on SiamRPN. Due to the serious imbalance between the number of parameters of classification branches and regression branches in the tracking process, the algorithm mainly proposes a sampling strategy to break the limitation of translation invariance, which can also maintain a high recognition rate in long video sequences. In 2019, Guo et al. [11] proposed SiamCAR objectt tracking network, which includes Siamese subnet and classification-regression network. Classification is used to obtain the location and scale information of the target, and regression is used to predict the target bounding box and the actual bounding box. Finally, the model is evaluated by online training and offline tracking. Later, many improved algorithms are derived, such as DaSiamRPN [12], SiamRCNN [13], SiamMask [14] and so on.
However, only convolutional neural network can't accurately capture the global information of the target. The attention mechanism proposed by Google team in 2016 [15] VOLUME 11,2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ provides great inspiration for visual tracking. Wang et al. [16] took the lead in introducing self-attention into computer vision, proposed a new non-local network, and achieved great success in video understanding and object detection, but its contribution to object tracking is not too great. This is followed by a series of improved algorithms, such as EMANet [17], CCNet [18], Hamnet [19] and independent network [20], which improve the speed, result quality and generalization ability. However, the fusion of convolution neural network and attention mechanism to extract local and global information for object tracking does not have a more mature method. In 2020, Malong Technology put forward SiamAttn algorithm [21], including deformable self-attention mechanism and cross-attention mechanism. Among them, cross-attention can gather the semantic information between the target template and the search image, and further realize the template update. In recent years, all kinds of pure depth self-attention networks (visual transformers) [22], [23], [24], [25], [26], [27] have emerged one after another, showing the great potential of attention-based models. The main idea of Transformer network is attention mechanism, which is inspired by multi-head attention mechanism and has achieved great success. The difference between the two is that there are coding structure and residual feedforward network in Transformer network. In the task of visual tracking, Transformer is often used as the backbone network for feature extraction, so it is very important to design a good backbone network. For example, Lin et al. [28] designed a pyramid structure for feature extraction, which replaced the original convolution layer by adding or deleting the convolution layer. In order to satisfy the detection of specified scenes, Redmon et al. [29] designed a backbone network-Detnet network, which is specially used for object detection. The emergence of some lightweight networks makes it possible to transform tracking algorithms from theory to practice, such as MobileNet [30], ShuffleNet [31], SqueezeNet [32], Xcept [33] and MobileNetV2 [34]. In order to improve accuracy and efficiency, networks with deeper layers can be used, such as ResNet [35] and ResNetX [36]. The Siamese network mentioned above can extract the local information of an image. Attention mechanism and Transformer network can capture the long distance global information. However, how to better combine the local information with the global information in the process of object tracking remains to be solved. In addition, how to correctly and efficiently detect the boundary information between the target and the background is also an urgent problem to be solved in the process of object tracking. Our tracker proposed a feature extraction method that can fuse local and global information, and introduced DAN. Aiming at the outline information of the target, we learned the boundary attention map for each boundary of the target frame by means of target-aware and boundary-aware respectively. Fig. 1 shows several representative visual response diagrams. It can be seen from the figure that, compared with the convolution only, under the coupled global representation, because the CNN branch gradually provides detailed local features, the patch of the attention mechanism branch is embedded in the Transformer to keep important local features, and the integration of local and global information is achieved through different sampling strategies of BM. It can be seen that the attention area proposed in this paper is more complete, while the background is significantly suppressed, which means that CAC module has higher discrimination ability for learning features. It shows that all decision-making modules become more efficient under our optimization scheme. The proposed fusion strategy produces more robust and reliable tracking response.
Through the above analysis, the main contributions of this paper are as follows: (1)Aiming at the loss of semantic, a parallel structure is proposed as a backbone network to extract features from the target image and the template image respectively. This structure uses convolution operation and self-attention mechanism to enhance representation learning. Inherited the general advantages of CNN and visual attention mechanism. The two are connected by Bridging Module (BM), and the local features and global representations at different resolutions are integrated interactively, which keeps the local features and global representations to the maximum extent and has higher discrimination ability.For an introduction to the principle of BM, please refer to III. A. 3).
(2)To solve the problem of inaccurate boundary detection between target and background, DAN is proposed, which aims at target-aware and boundary-aware. After feature fusion by Depth Cross-Correlation network(DCC), the boundary-aware is used to identify the boundary information of object, and the spatial information in features is mined from different angles by using the attention of targetaware. Through the combination of boundary and object detection, the discrimination between target and background is effectively enhanced.
(3)Aiming at the problem of online template updating, an effective score prediction module TEM is designed in CAC module to select high quality templates, thus realizing an efficient online tracker based on CNN and attention mechanism.
(4)The evaluation results on different data sets show that the tracking accuracy of this combined method is greatly improved compared with that of using convolution neural network alone. The CAC module, DAN module and TEM template update proposed in this paper are verified by ablation study and hot map analysis. The experimental results show that the tracking accuracy and success rate are greatly improved. The performance evaluation on VOT series data sets shows that the parallel structure proposed in this paper is very effective in improving the object tracking effect.
The rest of this article is arranged as follows. Firstly, we reviewed the related work in the second section, and then introduced the object tracking algorithm TrackCAC in detail in the third section. In the fourth section, firstly, the ablation study is carried out to verify the effectiveness of the proposed algorithm, and then the experiments and results of several latest datasets are analyzed. Finally, this article will be summarized in the fifth section.

II. RELATED WORK
Based on the above research, inspired by SiamCAR [11] method and target frame detection and tracking method, this paper proposes an object tracking model based on parallel structure and dual attention-aware network, which uses CNN and attention mechanism to process feature extraction in parallel at the same time [37], [38], [39], constructs ROI, and then accurately estimates the boundary of target frame through double depth cross-correlation network model and attention model, and learns to make use of spatial information from different angles. The representation ability of the feature is enhanced, and a higher accuracy of target frame estimation is obtained. finally, the template quality is improved through the template online update strategy. Finally, advanced performance is obtained on different data sets, and the modeling dimension of boundary detection is higher, and the overall performance is also higher.
Based on the above research, inspired by SiamCAR [11] method and target frame detection and tracking method, this paper proposes a target tracking model based on parallel structure and dual perceptual attention. Use CNN and attention mechanism to process feature extraction in parallel [37], [38], [39] at the same time, construct ROI, then estimate the boundary of target frame accurately through dual attention perception network, learn to use spatial information from different angles, and get higher accuracy of target frame estimation. At last, the template quality is improved through the template online update strategy. Finally, advanced performance is obtained on different data sets, and the modeling dimension of boundary detection is higher, and the overall performance is also higher.

A. BACKBONE-PARALLEL STRUCTURE
In the task of visual object tracking, the commonly used backbone network is Convolutional Neural Network (CNN), which is divided into VGG series network, GoogleNet or Inception network, Resnet residual network series and a series of lightweight networks, such as MobileNet series network, ShulffleNet series network and so on. Among them, VGG series network is the most classic VGG16 [40] network proposed by Simonyan and Zisserman. This network has a strong feature fitting ability by virtue of its 16-layer depth. It is precisely because of its powerful parameters that the calculation amount is too large in the actual algorithm deployment. All the networks mentioned above are based on convolutional neural networks.
Recently, due to the outstanding performance of Transformer in natural language processing, Transformer has also been applied to computer vision, such as ViT network [22], Swin Transformer [24], DETR [27] and so on. Until 2020, the application of Transformer in image processing has better optimized the ability of convolution neural network to extract global information. ViT doesn't have Inductive Bias like CNN. If it is trained directly on ImageNet, the effect of ViT of the same level is not as good as ResNet. However, if you pretrain on the larger dataset and then fine tune the specific smaller dataset, the effect is better than ResNet. The existing Transformer-based classification model ViT needs to be pretrained on massive data (JFT-300M, 300 million images), and then fine tuned on the ImageNet dataset, in order to achieve the performance equivalent to CNN method, which requires a lot of computing resources, which limits the further application of ViT method. Touvron et al. [41] adopted the distillation method, combined with teacher model to guide the better learning of DeiT based on Transformer. Compared with ViT, it mainly added a Distraction token, and its corresponding Token output value was as close as possible to that of teacher model. Without massive pretraining data, the SOTA results can be achieved only by relying on ImageNet data, while relying on less training resources. Because DeiT needs complex training strategies, it is not suitable for some small data training. Liu et al. [24] proposed a Swin Transformer network with hierarchical design including sliding window operation, which limits attention calculation in one window. On the one hand, it can introduce the locality of CNN convolution operation, on the other hand, it can save the amount of calculation. Inspired by Swin Transformer network, we propose a parallel structure as the backbone network to extract the features of target image and template image respectively. This structure uses convolution operation and self-attention mechanism to enhance representation learning. Inherited the general advantages of CNN and visual attention mechanism. BM is used to connect them, and the local features and global representations at different resolutions are fused interactively, which keeps the local features and global representations to the maximum extent and has higher discrimination ability.

B. BOUNDARY DETECTION-DAN
Generally, the target in tracking is directly represented as a rectangular bounding box. Although many methods for target box regression have appeared in recent years, it is still a challenging problem to estimate the target box accurately. Wang et al. [42] firstly used RPN network to get several candidate target frames, and then used full connection layer to screen these candidate frames and fine tune the parameters of target frames at the same time. Woo et al. [44] first estimated the target position by using the online updated classification model, and also got a number of candidate frames. Then, they built an IoU maximization prediction network to maximize the IoU of these frames and the real target, and finetuned the parameters of these target frames many times, and finally selected the target frame with the largest IoU as the result. Woo et al. [44] put forward a plug-and-play lightweight attention module, and made experiments on the use sequence of channel attention and spatial attention, respectively. It was found that channel attention placed before spatial attention can achieve better results, but its general ability is not strong, and it will not improve the performance of the network. Wang et al. [45] put forward a method of combining real-time object detection and depth estimation based on depth convolutional neural network. This combined method is not integrated into the feature extraction process of convolutional neural network, which increases the computational load of the network. Lin et al. [46] proposed a multi-pedestrian object tracking system framework based on image input, which can only track one class. When the number of target classes is more than one, it is very difficult to train the model. The object tracking algorithm proposed in this paper, combined with the boundary detection of dual attention aware network, maintains high accuracy and success rate in different scenarios. The contour information of the object is calculated by the related operations at the pixel level, and the attention module of boundary-aware learns the attention map of boundary-aware for each boundary of the target frame according to the feature of two directions, thus enhancing the characteristics of the boundary area and helping to locate the boundary accurately. Therefore, the dual attention network proposed in this paper is boundary-aware. It can mine the spatial information of features from different angles, effectively enhancing the accuracy of boundary detection.

III. OBJECT TRACKING MODEL BASED ON DUAL ATTENTION AND PARALLEL NETWORK
As shown in Fig. 2, The tracking model of this paper adopts the structure of the traditional Siamese network, the upper side is the template patch and the lower side is the search patch. The convolution and attention mechanism cooperation module is used to process the target image and the search image, which fully retains the local features and global representation, and has high discriminant ability. The ROI of the search image features is constructed while extracting the feature, and then the dual depth cross-correlation module is used for feature fusion. After fusion, aiming at the contour information of the target, the DAN is used to further enhance the features at the boundary level of the target. Finally the target can be tracked by accurately predicting the boundary of the target. At the same time, this paper introduces the online template update strategy, selects the high quality template online by scoring, eliminates the inferior template, and further improves the accuracy and stability of the tracker. The tracker in this paper can be trained end-to-end, and has high performance and high efficiency.

A. CONVOLUTION AND ATTENTION COOPERATIVE PROCESSING MODULE (CAC)
The CAC is composed of CNN branch and attention mechanism branch, following the design of ResNet [35] and ViT [26] respectively. It contains local convolution blocks, self-attention modules and MLP units. The module combines the local features based on CNN with the global representation based on attention mechanism to learn from each other, enhance the representation learning and improve the discriminant ability of features.
The CAC module consists of stem module, double branches, BMs bridging double branches and two classifiers (fc layer) for double branches. The stem module is a convolution of 7×7 with a step of 2, and then a maximum pool of 3 × 3, with a step of 2, which is used to extract the initial local features and then feed them to the double branches. The CNN branch and the attention mechanism branch are composed of N repeated convolutions and attention mechanism blocks respectively. This parallel structure means that CNN and attention mechanism branches can retain local features and global representations to the maximum extent respectively. BM is proposed to integrate the local features in the CNN branch and the global representation in the attention mechanism branch. Because the initialization characteristics of the two branches are the same, the bridging module is applied from the second block. Along the branches, BM gradually fuses feature maps and patch embedding in an interactive way. Finally, for the CNN branch, all the features are gathered and provided to a classifier. For the attention mechanism branch, the class token is taken out and provided to another classifier. The output of the two classifiers is regarded as the result of feature extraction and enhancement. The details are shown in Fig. 3 below, and the details of the three structures are described below.

1) CNN BRANCH
As shown in Fig. 3, the CNN branch adopts a feature pyramid structure [28], and the resolution of the feature map decreases with the increase of network depth, while the number of channels increases. In this paper, the whole branch is divided into four stages. Each level consists of a plurality of convolution blocks, and each convolution block contains n c bottlenecks. According to the definition of ResNet, bottlenecks includes 1 × 1 lower projection convolution, 3 × 3 spatial convolution, 1 × 1 projection convolution and residual connection between the input and output of bottlenecks. In the experiment, the n c is set to 1 in the first convolution block and satisfies ≥ 2 in the subsequent N-1 convolution block. Visual Transformer projects image blocks into vectors through a single step, resulting in the loss of local details. In CNN, the convolution kernel slides on the feature graph with overlap, which provides the possibility of preserving local features with fine details. Therefore, the CNN branch can provide continuous local feature details for the attention mechanism branch.

2) ATTENTIONAL MECHANISM BRANCH
This branch contains N repeated attention mechanism blocks. As shown in Fig. 3, each attention mechanism block consists of a multi-head self-attention module and a MLP block. Apply LayerNorms [47] before each layer and residual join in the self-attention layer and the MLP block. For tokenization, the feature map generated by the stem module is compressed to 14 × 14 patch embedding without overlap through a linear projection layer, which is a 4 × 4 convolution with a step size of 4. Then, a class token is changed into patch embedding for classification. 3 × 3 convolution is used in CNN branch.

3) BRIDGING MODULE
In order to solve the problem that the feature mapping in the CNN branch and the patch embedding in the attention mechanism branch can not be aligned, this paper proposes that BM continuously connect the local feature and the global representation in an interactive way. On the one hand, this paper must realize that the feature dimensions of CNN and Transformer are not consistent. The dimension of CNN feature graph is C×H×W, where C, H and W are channels, height and width, respectively, and the shape of patch embedding is (K+l)×E, where K, 1 and E represent the VOLUME 11, 2023 number, category representation and embedding dimension of image blocks, respectively. When feature mapping is sent to the attention mechanism branch, it needs to go through 1 × 1 convolution to align the number of channels embedded in patch. Then use the down sampling module to complete the spatial dimension alignment. Finally, add a feature graph with patch embedding, as shown in Fig. 3. When feedback from the attention mechanism branch to the CNN branch, the patch embedding needs to be upsampled to align the spatial scale. Then, the channel dimension is aligned with the channel dimension of the CNN feature graph by 1×1 convolution, and it is added to the feature graph. At the same time, LayerNorm and BatchNorm modules are used to regularize the features.
On the other hand, there is an obvious semantic gap between feature mapping and patch embedding, that is, feature mapping is collected by local convolution operator, while patch embedding is aggregated by global self-attention mechanism. Therefore, BM is applied to each block to gradually fill the semantic gaps.

B. FEATURE FUSION NETWORK
The RoI constructed by the CAC module is a reduced search area, although the target is already very accurate relative to the search image, but when predicting similarity interference or minimal targets, the target frame boundary detection model is still difficult to distinguish between the target and the background line in this area, because the simple RoI feature lacks target appearance information, and it is difficult for the model to distinguish which target is being tracked. What's more, it is impossible to distinguish the boundary of the tracked target.
In order to solve this problem, this paper proposes a template fusion module to fuse the information of template and RoI. The template target and the target in RoI belong to different states of the same target. The template fusion module should not only introduce template information, but also enhance the information conducive to target frame boundary detection without destroying the original information. The proposed template fusion module combines the related results of template and RoI at the pixel level with the original RoI features, so that the merged features supplement the contour information of the target on the basis of the original RoI features, which is more conducive to the detec-tion of the boundary of the target frame. The structure of the module is shown in Fig. 4. The features of template and RoI are F z and F x respectively, and the correlation at pixel level is as follows: where ⋆ p indicates correlation at the pixel level, and the process is shown in Fig. 4. In correlation, the template feature is regarded as a convolution kernel with a size of 1 × 1. This related operation calculates the similarity between each pixel in the template feature and the entire RoI. Each pixel corresponds to a position of the template, and each correlation result map will highlight the corresponding position of the target in the RoI, and the whole correlation result contains the outline information of the target. The method of this paper detects the boundary of the target box, so the relevant results are directly taken as features and combined with RoI features, so as to supplement the information of the original RoI features and extract the spatial information of corners. After related operations, the template fusion module uses two 1 × 1 convolution layers to map RoI features and related results to features with the same number of channels. Then the two mapped features are merged, and their information is fused through a 1 × 1 convolution layer to get the final enhanced RoI feature F r . In this way, the template fusion module not only introduces the template information, but also learns the target contour information, which is more conducive to the target frame boundary detection.

C. TRACKING ALGORITHM
The global feature and the template target feature are extracted by the CAC module, and the feature fusion is carried out by using the double depth cross-correlation module [10]. Then, according to the fused features, the dual attention-aware network is introduced. As shown in Fig. 5, for the contour information of the target, learn the boundary attention map for each boundary of the target frame by means of target-aware and boundary-aware, so that the model can notice the boundary position, enhance the characteristics of the boundary position and improve the boundary positioning accuracy.

1) DUAL ATTENTION-AWARE NETWORK
The attention module of target-aware uses a double pyramid network to learn spatial attention map. Firstly, two 3 × 3 convolution layers are used to capture global spatial information and enhance the receptive field. Then the feature resolution is improved by transposing convolution. In the fourth layer, the spatial attention map with the value of [0,1] is obtained through the sigmoid activation function. The target-aware network can adaptively pay attention to the information-rich target area and effectively distinguish the target from the background. The learning styles of M x t and M y t in both directions can be expressed as follows: The superscript x and y represent two directions, H x and H y represent two hourglass networks, δ represents the sigmoid function. M x t is normalized vertically and M y t is normalized horizontally. The normalized attention map is also expressed as M x t and M y t . Then multiply the attention map with the RoI feature, and then aggregate to get the direction-aware features  Fig. 3 shows the structure used to detect and learn two boundary perceptual features horizontally. Also in the last layer, the final boundaryaware attention map is obtained by using sigmoid function mapping. The calculation process can be expressed as follows: Among them, l, r, t and b denote left, right, top and bottom directions respectively. Finally, the boundary is detected by learning a thermal map for each target box boundary independently. Two transposed convolution networks are used to increase the resolution of boundary-related features, and then a fully connected layer is used to reduce the number of channels and learn the thermal map, and the final thermal map is normalized by Softmax. The thermal map of the boundary of each target box is a vector, and the value of each position represents the probability that the boundary of the target box is at that position, so the sub-pixel coordinates of the boundary can be obtained by calculating the desired FIGURE 6. Description of TEM. Includes two attention blocks and a three-layer perceptron. Score token predicts the score through the attention mechanism, and takes 0.5 as the threshold. When it is lower than 0.5, the template will be eliminate. position: where p represents the position, h represents the hot map, L l , L r , L t and L b represent the length of the thermal map vector in four directions, and x l , x r , y t and y b represent the coordinate positions of the left, right, upper and lower boundaries respectively. Therefore, the target box of the trace can be represented as (x l , x r , y t , y b ).

2) ONLINE TEMPLATE UPDATE
Online templates play an important role in obtaining time information and dealing with object deformation and appearance changes. Poor quality templates can lead to poor tracking performance. Therefore, this paper introduces a template elimination mechanism, as shown in Fig. 6, to select a reliable online template determined by the predicted confidence score. The TEM consists of two attention blocks and a three layer perceptron. First, you can learn score tokens to be used as query to participate in the search for ROI tokens. It enables score token to encode the target information of mining. Next, score token looks at all the locations of the initial score token in the CAC module to implicitly com-pare the mining target with the first target. Finally, the score is generated by the MLP layer and the sigmoid activation function. When the prediction score of the online template is less than 0.5, the template will be eliminated. For TEM training, which is performed after backbone training, this paper uses the standard cross entropy loss: Among them, y i is the real label and p i is the confidence score of the prediction. In the tracking process, a number of templates, including one static template and N dynamic online templates, are sent into the CAC module together with the tailored search area to generate the target boundary box and confidence score. This paper only updates the online template when the update interval is reached, and selects the sample with the highest confidence score.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this part, we verify the proposed algorithm in detail. It includes qualitative analysis, quantitative analysis, ablation study and hot map analysis.

A. EXPERIMENTAL SETUP
The algorithm is implemented in Python language using the deep learning architecture PyTorch. Running platform: CPU is Inteli9-10900K Magi 10 core 20 threads 3.70GHz GPU is NVIDIAGeForceRTX3090 memory 24G. First of all, we train on two large data sets, LaSOT [48] and GOT-10k(http://got-10k.aitestunion.com/index), to obtain the initial weight, and then train on the two data sets of COCO and GOT-10k. In order to improve the versatility, a variety of data enhancement methods are used in the training, including random color jitter, 25% probability to become grayscale image, random translation of the image in the range of 64 pixels, and random change of scale in the range of [0.82∼1.18]. The network uses the following loss function for end-to-end training: Among them,λ1, λ2, λ3 are used to balance the three losses, and the values are 1, 1, 0.3 respectively according to the scale of the three loss functions. L sim and L reg represent similarity loss and regression loss respectively, and L bdet is the boundary detection loss function [49]. The training optimizer is a random gradient descent algorithm Stochastic Gradient Descent(SGD) with 0.0005 as weight attenuation parameter and 0.9 as impulse term. A total of 20 rounds of training, the use of warm-up strategy training, the first 5 rounds of learning rate linearly increased from 0.0005 to 0.001, the last 15 rounds of learning rate decreased from 0.001 to 0.0001 exponentially.
In the experiment of this paper, the trained model tracker is tested and evaluated on the short-term benchmark OTB100, then the long-term tracking verification analysis is carried out on the LaSOT test data set, and finally the performance is evaluated on the VOT2020 dataset.

B. ABLATION STUDY
In order to verify the effect of the designed module on different neural networks, ablation study was carried out on a tracker based on CAC and ResNet-50. This paper validates the effectiveness of each module of SiamCAR and different parts of our proposed algorithm, and analyzes the influence and effect of each module on the performance of tracking algorithm based on data set OTB100. The experimental results of the first and second rows in Table 1 show that the proposed algorithm uses CAC module for feature extraction, and the success rate and accuracy are improved by 2.7% and 4.9% respectively compared with using ResNet50. The third and fourth rows in Table 1 show that under the premise of adding deep cross-correlation networks, our proposed algorithm improves the success rate and accuracy of the tracker by 2.0% and 5.0% respectively after adding a dual attention-aware network DAN module for feature fusion. The fifth row in Table 1 shows the cooperative use of feature enhancement and deep cross-correlation of the CAC module, coupled with the template TEM in the tracking process, the success and precision of the tracker are improved by 3.5% and 5.0% compared with only using depth crosscorrelation and ResNet50. The fourth and fifth rows in Table 1 show that the success rate and accuracy have been greatly improved after the addition of DAN module and TEM module to the tracker. Therefore, each module can improve the performance of the tracker on different backbone neural networks.
C. QUANTITATIVE EVALUATION 1) ON OTB100 OTB100(http://cvlab.hanyang.ac.kr/etracker_benchmark/ index.html) contains 29 trackers to evaluate the results of 100 sequences, with an average sequence length of 589 frames. These sequences are marked with nine attributes that represent challenging aspects of visual tracking. Attributes include illumination variation, low resolution, occlusion, fast movement, motion blur and so on. For the experiments on OTB100 benchmark, this paper mainly evaluates the algorithm through tracking precision and success.
As shown in Fig. 7, the precision plot measures the Euclidean distance error between the tracking result and the real result. Horizontal coordinates are a series of different distance thresholds (in pixels), and vertical coordinates are the percentage of all frames where the distance error is less than the threshold. The success curve represents the estimated overlap rate IoU. The vertical coordinate is the percentage of all frames whose overlap rate is greater than the threshold. The area un-der the curve is used to represent the evaluation score. The larger the area is, the higher the score is. The exact value and ranking are also given in the figure. In order to avoid confusion, this paper only selects the first nine advanced algorithms to compare SiamRPN++,SiamDWrpn,Ocean,DaSiamRPN,SiamRPN, SiamDWfc,SRDCF,DeepSRDCF,SiamFC.
As you can see from Fig. 7, the tracker in this paper performs relatively well in this benchmark test. Compared with SiamRPN++, the success and precision are improved by 3.4% and 3.9% respectively, which is due to the retention of a large amount of semantic information of DAN and deep cross-correlation network. Fig. 8 is the success and precision plots of the test under the conditions of occlusion, illumination variation, background clutterting and motion blur, which are most likely to occur in tracking. It can be seen that TrackCAC performs well in different scenarios.

2) ON LASOT
To test the performance of TrackCAC for long-term tracking, this paper further evaluates it on the LaSOT dataset, a very large public training and testing dataset consisting of 1400 video sequences and 70 categories, the LaSOT test set 280 video sequences with different tracking challenges are picked from LaSOT. The number of video frames is on average 2500 frames larger than other test sets, which brings greater challenges to track-ing. The LaSOT Toolbox provides tracking results for a range of trackers on the LaSOT benchmark, including mainstream trackers such as SiamFC, ATOM, TLD, etc. Furthermore, this paper compares TrackCAC with Si-amCAR.
For clarity, only 8 typical and well-performing track-ers are shown in the legend. Fig. 9 shows the success rate graph, accuracy graph, and normalized accuracy graph of all trackers tested on the LaSOT test set, showing that the proposed TrackCAC achieves better performance. In the three evaluation criteria, TrackCAC is 14.3%, 17.3% and 12.5% higher than SiamCAR. Our tracker outperforms all algorithms in the plots on this benchmark. Fig.10 shows the qualitative evaluation of our method and ten other state-of-the-art trackers on challenge sequences. These sequences are from OTB100, and each sequence has different properties such as occlusion, motion blur, similarity interference, low resolution, and illumination variation.

D. QUALITATIVE ASSESSMENT
This part qualitatively compares the performance of this method with 10 trackers such as SiamCAR on OTB100, as shown in Fig.10. Compared with the other nine trackers, TrackCAC can accurately locate and track the target because it makes full use of these features. This article selects the factors that may appear in the challenging long-term tracking. The order from top to bottom is occlusion sequence: woman, illumination change sequence: singer1, background cluttering sequence: bolt and motion blur sequence: jumping. In the sequence woman, people are blocked by cars for certain frames when they move irregularly. In the sequence singer1, the target is constantly deformed and blurred due to the change of illumination, but TrackCAC can still track the target accurately; In the sequence bolt, although there are many similar targets and they have experienced drastic deformation and rapid movement, TrackCAC can track the targets accurately because of the advantages of global search. In the sequence jumping, due to the high regression accuracy, when motion blur occurs, TrackCAC is still tightly locked on the little boy in the picture. CAC module plays an important role at this moment, and because of the integration of local and global information, it can recapture the target faster and more accurately after occlusion.

E. PERFORMANCE EVALUATION(VOT2020)
The VOT benchmark(https://votchallenge.net/index.html) contains 60 challenging tracking videos, in which the  robustness score is inversely proportional to the tracking effect. The accuracy and average overlap expected score are proportional to the tracking performance. EAO refers to Expected Average Overlap, which is the expected value of no-reset average overlap of each tracker in a short-time image sequence, and is the main index of VOT to evaluate the tracking effect of trackers. The comparison results are shown in Table 2. Compared with SiamMask, the model tracker increases the EAO value, accuracy and robustness by 0.8%, 11.4% and 5.8% respectively. There-fore, on the premise of meeting the real-time requirements, the overall performance of the tracking algorithm is improved.

V. CONCLUSION
In this paper, a object tracking model based on parallel structure and dual attention is proposed. The convolution and attention cooperative processing module is used to integrate local information and global information at the same time. Convolution neural network is mainly used to extract the local information of the image, and the attention mechanism is mainly used to calculate the global information of the image. This local and global information combination of object tracking algorithm is called TrackCAC. Then, through the construction of ROI, we use the double depth cross-correlation network model and attention model to accurately estimate the boundary of the target frame, learn to use spatial information from different angles, enhance the representation ability of features, and get higher accuracy of target frame estimation. Finally, the template quality is improved through the strategy of template online update. Finally, the evaluation results on three datasets show that, compared with using convolution neural network alone for object tracking, the tracking accuracy of this combined method is greatly improved. The modeling dimension of boundary detection is higher and the overall performance is higher.
HAIBO GE is a third-class Professor, engaged in teaching, scientific research and management with the Xi'an University of Posts and Telecommunications, where he is currently the Director of the Innovation and Entrepreneurship Center for college students, School of Electronic Engineering. He has presided over one key industrial chain-industrial field project in Shaanxi Province, two natural science fund projects, one key laboratory project, and more than 30 horizontal projects of the Chinese Academy of Sciences, and published nearly 50 academic articles in domestic and foreign journals (it has won the second prize for scientific and technological progress at the ministerial level). His research interests include photoelectric sensor detection, the Internet of Things technology and information processing, and computer vision. He is also the Vice Chairman of the Shaanxi Optical Society and the Editorial Board of Optical Communication Technology (core). He is also the Director of the Chinese Electronic Education Society.
SHUXIAN WANG is currently pursuing the master's degree with the Xi'an University of Posts and Telecommunications. She has published one paper as the first author in the ICNLP International Conference and one in the Computer Engineering and Applications Journal (both papers are related to object tracking). Her main research interest includes computer vision-object tracking.
CHAOFENG HUANG is currently pursuing the master's degree with the Xi'an University of Posts and Telecommunications. His main research interests include goal tracking and the application of goal tracking in life.
YU AN received the B.Sc. degree from the Xi'an University of Posts and Telecommunications, where she is currently pursuing the master's degree in electronic science and engineering. Her main research interest includes object tracking.