Attention-Based Siamese Region Proposals Network for Visual Tracking

,


I. INTRODUCTION
Visual tracking is one of the fundamental problems in image processing and computer vision. With the growing demand for artificial intelligence, visual tracking has been widely used in intelligent transportation [1], pavement detection [2], video monitoring [3] and other aspects. However, visual tracking still faces the challenges in complex scenarios, such as occlusion, illumination variation, background clutters and deformation.
Most tracking algorithms can be divided into two categories: generative and discriminative approaches. The generative methods describe the appearance characteristics of the target and minimize the reconstructed errors by searching the candidate target. The representative algorithms include sparse coding [4], [5], density estimation [6], principal component analysis [7] and so on. The generative methods simply focus on the target and ignore the background information. It is easily to lose the tracking target if the appearance changes The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang .
drastically. The discriminative methods distinguish the target from the background by training the classifiers. This kind of methods is also called track-by-detection. The representative algorithms include multiple instance learning [8], boosting [9], structured SVM [10] and so on. The discriminative approaches are more robust than generative approaches. Nevertheless, the discriminative capability is greatly restricted because they both depend on low-level hand-crafted features.
With the continuous improvement of the computing power of modern intelligent equipment, deep learning has attracted extensive attention from researchers at home and abroad. To improve the accuracy and robustness of visual tracking task, more and more researchers start to use convolutional neural network (CNN) for visual object tracking. Several aspects including network structures [11]- [14] and updating mechanisms [15], [16] are studied. However, the online network updating and samples generation process take plenty of time, which extremely limits tracking speed. A kind of tracking algorithm based on convolutional neural network, which is named siamese network, abandons the online updating process, and it is pretrained with a large number of image datasets to obtain the significant characteristic representation ability. Siamese network mitigates time-consuming problem and thus achieves real-time tracking. However, it still remains some problems. Firstly, the target template is not updated during the long-term tracking, which can easily lead to drift when target deformation and severely occlusion happen. Secondly, siamese network can only predict the target location, but cannot estimate current scale information. Besides, it is difficult to distinguish the foreground target and semantic background, and thus likely leads to drift problem.
To effectively solve the problem of multi-scale target representation, the siamese region proposals network (SiamRPN) combines siamese network with region proposals network. However, it failed to improve the ability to distinguish foreground and semantic background. In this paper, we introduce soft attention mechanism on the basis of SiamRPN to structure an adaptive appearance characteristic model, and thus improve the ability to discriminate foreground and semantic background. The attention mechanism mainly includes spatial attention and channel attention. On the one hand, the hourglass-shaped residual network is constructed to learn the plane weights and focus on the salient areas of two-dimensional feature maps. The main idea of spatial attention network is to enhance the foreground and suppress semantic background, and to assign different importance weights. On the other hand, channel attention network is constructed to learn the dimensional weights and focus on different characteristic types. The main idea of channel attention network is to eliminate redundant noisy feature maps and activate high target-relevant feature maps. As a result, the proposed method can efficiently distinguish the foreground and semantic background to avoid drift problem.
The contributions can be summarized as three folds: 1) We introduce soft attention mechanism on the basis of SiamRPN to structure an adaptive appearance characteristic model, and thus improve the ability to distinguish foreground and semantic background. The spatial attention network aims to enhance foreground and suppress semantic background to highlight characteristic otherness. At the same time, the channel attention network discards redundant information to obtain efficient characteristic expression. 2) According to the structural differences between spatial attention network and channel attention network, they deal with different characteristic gradations. Specifically, the spatial attention network deals with low-level features to learn appearance similarity characteristic. The channel attention network deals with high-level features to learn semantic classified characteristic. 3) Our experimental results demonstrate the outstanding performance of the attention-based multi-scale object tracking algorithm. The proposed method can significantly improve the ability to distinguish the foreground and background, and prevent the tracking results from quickly deviating the real target. The proposed method is compared to the state-of-the-art tracking algorithms in public benchmarks: OTB [17] and VOT [32]. The rest of the paper is organized as follows. We first review related work in Section II, and then discuss the detailed proposed method in Section III. Section IV illustrates the experimental results in public tracking benchmark.

II. RELATED WORK A. TRACKERS BASED ON CNN
Convolutional neural network has widely been applied in object detection and recognition [18]- [23]. In recent years, there are more and more tracking algorithms based on CNN. At the beginning, Wang and Yeung [24] applies a deep model to visual tracking and acquires the characteristic of the target from the pre-trained Stacked Denoising Autoencoder (SDAE). Then they proposed SO-DLT algorithm to obtain the characteristic by using CNN, and the network achieves tracking by using long-term and short-term CNN. It is a successful application for CNN in visual tracking. To improve the accuracy of the tracking algorithms based on CNN, researchers generally study from characteristic representation, network modelling and update mechanism. For example, Nam et al. [14] proposed the MDNet tracker which is divided into shared layers and domain-specific layers. The tracking network is pre-trained by using the different domain-specific layers to avoid destabilizing the network. In [11]- [13], the network is pre-trained by using the Ima-geNet and other large-scale image datasets to obtain the efficient characteristic, such as the famous VGG-Net [25]. Nam et al. [14] proposed a tree structure which includes several CNN models to avoid the unreliable samples degrading the whole network. Held et al. [26] proposed the GOTURN tracker which performs the offline training without the online update. Its speed is fast, but it cannot adapt to the target deformation. Li et al. [27] update the network in a lazy way. Specifically, it is not updated until the target appearance changes a lot. In addition, there are other network types for visual tracking, such as Siamese Network [28] and Recurrent Neural Network [29].

B. TRACKERS BASED ON SIAMESE NETWORK
The essential of siamese network-based tracking algorithms is similarity comparison. This kind of trackers has the balanced accuracy and speed. Bertinetto et al. [28] proposed a fully convolutional siamese network to solve the similarity learning problem. The siamese network is offline pretrained using a large number of samples, and then predicts the tracking result according to the response graph obtained through cross correlation. Afterwards, CFNet [30] adds the correlation filter to the template branch to simplify original siamese network and make it more efficient. However, both of them need multi-scale samples generation which makes it timeconsuming. To solve this problem, Li et al. [31] proposed the siamese region proposals network (SiamRPN), which uses region proposal subnetwork to generate multi-scale candidates. Benefit from the region proposal subnetwork, traditional multi-scale test and online update can be discarded to achieve real-time tracking. However, it cannot improve the ability to distinguish the foreground and semantic background to effectively mitigate drift. Zhu et al. [38] proposed the distractor-aware siamese network, which introduces an effective sampling strategy to make the model focus on the semantic distractors, thus achieve accurate object tracking. Li et al. [39] proposed a simple and effective space-aware sampling strategy and successfully trained a tracker with significant performance improvements. ATOM [40] divided visual tracking into two parts: target classification and target evaluation. The former is used for coarse positioning and the latter is used for fine positioning. This two-stage tracking method can improve the accuracy of the tracker. Fan and Ling [41] proposed to concatenate a series of RPNs from high-level to low-level in a Siamese network framework to solve the positioning problem. Zhang and Peng [42] proposed to enhance the robustness and accuracy of tracking by using deeper and wider convolutional neural networks. In this paper, we introduce the soft attention mechanism on the basis of SiamRPN to structure the adaptive appearance characteristic model, which aims to enhance the foreground and suppress the semantic background.

III. ATTENTION-BASED SIAMESE REGION PROPOSALS NETWORK FOR VISUAL TRACKING
In this section, we mainly describe the proposed attentionbased siamese region proposals network in detail. An overview of the proposed method is visualized in Fig. 1. The overall framework consists of the attention network and multi-scale region proposals network. The former learns the planar and dimensional weights by constructing spatial attention network and channel attention network, respectively. The latter constructs the anchor-based region proposals network to achieve multi-scale object tracking.
In the following, we first describe the overall procedure of our method. Afterwards, we elaborate the spatial attention network and channel attention network. Lastly, we describe the attention-based multi-scale tracking algorithm on the basis of the original siamese region proposals network in details.

A. OVERVIEW OF OUR APPROACH
The proposed method consists of the attention network and multi-scale region proposals subnetwork. The overall procedure of the proposed method is visualized in Fig. 2. The attention network mainly includes spatial attention network and channel attention network. The former learns the planar weights by constructing hourglass-shaped residual network to obtain the characteristic differences between the foreground and background. The latter learns the dimensional weights to eliminate redundant noisy feature maps and activate high target-relevant feature maps. Besides, the region proposals network consists of the classification module and regression module. The multi-scale region proposals network transforms the feature maps obtained from the attention network, and then the transformed target template and search region calculate the cross-correlation to obtain the response graph of classification probability and location regression.
The overall algorithm steps can be summarized as follows: Firstly, the siamese network is used to extract the features of initial target and search region. Next, the attention network is constructed to enhance the foreground and suppress the semantic background, so as to eliminate the redundant noisy feature maps and simplify appearance characteristic representation. Afterwards, the anchor-based region proposals network is used to achieve multi-scale target tracking. Finally, the search region is redetermined according to the predicted target location. The target template is fixed during the long-term tracking, and the above steps are repeated until VOLUME 8, 2020  the end of test sequence. The proposed method can significantly improve the ability to distinguish the foreground and background, and prevent the tracking results from quickly deviating from the real target, so as to effectively alleviate drift.

B. ATTENTION NETWORK
The essence of attention mechanism is to analyse the information obtained from vision and focus on the salient regions or objects, and then make use of the secondary information to assist in scene understanding, content recognition and other tasks. The proposed method introduces attention mechanism to focus on the difference between the foreground and semantic background, so as to improve the discrimination ability among different objects. The attention mechanism based on deep convolutional network can be divided into strong attention mechanism and soft attention mechanism. The non-differentiable nature of the strong attention mechanism makes it unsuitable for the back-propagation in deep networks. Therefore, we introduce the soft attention mechanism on the basis of original siamese region proposals network to obtain attention weights quickly. The attention network mainly includes spatial attention and channel attention network.

1) SPATIAL ATTENTION NETWORK
The spatial attention network adopts the hourglass-shaped residual network to highlight the foreground and suppress the semantic background. It reduces the size of feature maps by convolution and down-sampling to highlight the high-level semantic characteristics corresponding to the global receptive field. Afterwards, it expands the size of feature maps by convolution and up-sampling to amplify the activated salient foreground, so as to suppress the background and highlight the difference characteristics between the foreground and background. The structure of spatial attention network is shown in Fig. 3.
As Fig. 3 shows, the input feature maps extract the high-level characteristics through a series of convolution and down-sampling calculations. And then the size of feature maps is restored through deconvolution and up-sampling operations. At this time, the pixel values on the feature maps are the corresponding weights of the original feature maps. Specifically, the Sigmoid activation function is used to limit the pixel values of the weighted feature maps between 0 and 1. As a result, the weighted feature maps will not have obvious changes, and it can also suppress the interferential background information. The weighted feature maps are obtained by element-level multiplying the original feature maps and the corresponding weights, and its pixel values have been reduced. To avoid multiple weighting destroying data characteristics, the final spatial attention feature maps are obtained by adding the weighted feature maps and the original feature maps. Assuming that F o (x) represents the original feature maps, F w (x) represents the weighted feature maps, F s (x) represents the final spatial attention feature maps, * represents the pixel-level multiplication and + represents the pixel-level addition. The calculation process can be expressed as: It is the extreme cases that the spatial attention feature maps are the original feature maps when the weighted feature maps F w (x) = 0, which reflects the identical mapping idea of residual network. The spatial attention mechanism can enhance the foreground and suppress noisy background characteristics, so as to effectively improve the ability to distinguish the foreground and semantic background.

2) CHANNEL ATTENTION NETWORK
The channel attention network learns the dimensional weights to activate the high target-relevant characteristic types and suppress the insignificant characteristic channels, even eliminate the noisy feature maps, so as to obtain the efficient appearance characteristic representation. The high-level convolutional feature maps are essentially the semantic characteristic which is helpful for classification. The semantic characteristic is robust to deformation, but it also has the weak adaptability to the appearance changes. The channel attention network deals with the high-level characteristics can significantly improve the ability to distinguish the specific objects. The structure of channel attention network is shown in Fig. 4.
As Fig. 4 shows that the channel attention network learns the dimensional weights by the pooling and fully connection operation. It also uses the Sigmoid function to limit the weights between 0 to 1. As a result, the channel feature selection is completed by the elemental multiplication between the dimensional characteristics and the corresponding weights.
The design principle of channel attention network is that the contribution of dimensional feature maps to the target characteristics representation is different, that is to say, different objects activate different channels of the feature maps. The role of channel attention network is to improve the weights which are high target-relevant, and to suppress the weights which are low target-relevant or noisy. The weights obtained from the channel attention network according to the initial target state remains fixed during tracking. Therefore, the whole network can not only enhance the characteristic difference between the foreground and background to improve the discrimination ability, but also significantly reduce the time-consumption.

C. SIAMESE REGION PROPOSALS NETWORK
The siamese region proposals network consists of the siamese network and region proposals network. The former is used to perform feature extraction, and the latter is used to generate multi-scale candidates for object tracking. The siamese region proposals network takes the tracking task as the single sample detection. It encodes the target appearance information into the correlated feature maps, that is, it extracts the candidates on the correlated feature maps of the target template and the search region to achieve multi-scale visual tracking. The region proposals network consists of the classification module and regression module. Assuming that z represents the target template, x represents the search region, φ(z) represents the feature maps of target template, φ(x) represents the feature maps of search region, [φ(z)] c and [φ(x)] c represents their feature maps in the classification module, [φ(z)] r and [φ(x)] r represents their feature maps in the regression module and the operator * represents convolution. Then the correlated feature maps of classification module and regression module can be expressed as: where the H c represents the positive and negative activation of the anchor boundary boxes and H r represents the distances VOLUME 8, 2020 between the anchor boundary boxes and the ground-truth boundary boxes. The siamese region proposals network learns the network weights by generating the positive and negative anchors. The positive anchor samples are labeled when the overlap rate between the anchor samples and ground-truth exceeds the ceiling threshold, and the negative anchor samples are labeled when the overlap rate between the anchor samples and ground-truth are below the floor threshold. The loss function of siamese region proposals network is composed of classification loss and regression loss. The classification loss is essentially the cross entropy, and the regression loss is the smooth L1 loss. Assuming that H x , H y , H w and H h represent the centre coordinate and scale of anchor boundary boxes, respectively. G x , G y , G w and G h represent the centre coordinate and scale of ground-truth boundary boxes, then the normalized distance can be expressed as: The smooth L1 loss function can be expressed as: The classification loss is known as the cross entropy, and the regression loss function can be expressed as: Then the overall loss function is weighted by the classification loss and regression loss, which can be expressed as: where µ is the hyper-parameter to balance the weighted term. During the long-term tracking, the correlated feature maps of classification module and regression module can be expressed as the point collection: where i ∈ [0, w), j ∈ [0, h), p ∈ [0, 2k) in the correlated feature maps of classification module, and i ∈ [0, w), j ∈ [0, h), p ∈ [0, k) in the correlated feature maps of regression module. Assuming that the siamese region proposals network needs to generate K candidates. The odd channel of classification feature maps represents the positive activation, then the K points with the highest score are retained, and the collection can be expressed as: where I , J and P represent the corresponding index collection, i, j and p represent the location and scale of anchor bounding boxes. Then the obtained anchors can be expressed as: C anc = (x anc i , y anc j , w anc p , h anc p ) i∈I ,j∈J ,p∈P .
Similarly, the obtained bounding boxes of regression module can be expressed as: The obtained K candidates can be calculated by using the above anchor bounding boxes information, which can be expressed as: To obtain more accurate predicted position and scale, the bounding box regression strategy is used to adjust the candidates.

D. ATTENTION-BASED MULTI-SCALE VISUAL TRACKING
The multi-scale visual tracking based on the attention network mainly consists of attention feature selection and multi-scale candidate bounding box generation. The former constructs the spatial attention network and channel attention network to deal with different planar regions and different feature types, the latter constructs the region proposals network to generate multi-scale samples. The siamese network is the pretrained AlexNet by using ImageNet dataset. At the same time, the overall network is trained offline by using ILSVRC and Youtube-BB datasets. During the training, the weights of previous layers in the siamese network are fixed, and the weights of last two layers are updated only. To give consideration to both the speed and characteristic adaptability, the network is finetuned online by using the initial target state only, and the network weights is fixed in the subsequent frames. The role of spatial attention network is to improve the adaptability of the high-level semantic characteristics to target deformation. The spatial attention network enhances the foreground and suppresses the semantic background by constructing the similar residual network. The role of channel attention network is to eliminate the redundant channels and retain the significant characteristic types. To prevent the pooling operation from filtering the useful information, the channel attention network is used to optimize the high-level semantic characteristic, so as to improve the ability to distinguish the target foreground and the semantic background.

IV. EXPERIMENT
Our method is implemented in Python based on the PyTorch framework and runs on a Titan X GPU with 6GB memory.
The proposed method is compared to the state-of-the-art tracking algorithms on the standard datasets, which are the online tracking benchmark (OTB) [17] and the visual object tracking benchmark (VOT) [32]. The experimental result shows the effectiveness and stability of the attention-based siamese region proposals network.

A. IMPLEMENTATION DETAILS
The input sizes of the template patch and search patch are 127 × 127 × 3 and 255 × 255 × 3. After passing through the Siamese Network, the template branch and search branch can get feature maps with dimensions of 6 × 6 × 256 and 22×22×256, respectively. The spatial attention network and the channel attention network are applied to achieve attention feature selection. The proposed spatial attention module applies two maxpooling with kernel sizes of 3 × 3 and 2 × 2 and then performs two up-sampling operations with output sizes of 3 × 3 and 6 × 6, followed by a ReLU activation and sigmoid activation. The proposed channel attention module applies a maxpooling with a kernel size of 3×3, and performs a fully-connected operation with an input dimension of 3 × 3 × 256 and an output dimension of 256. Then the proposed channel attention module applies a ReLU activation, and performs a fully-connected operation with an input dimension of 256 and an output dimension of 256, followed by sigmoid activation. Then feature maps are input into the RPN network to obtain a classification feature map with a dimension of 17 × 17 × 2k and a regression feature map with a dimension of 17 × 17 × 4k. k represents different ratios of anchor and the anchor ratios we adopted are [0.33, 0.5, 1, 2, 3]. In the offline training phase, the stochastic gradient descent (SGD) method with momentum of 0.9 is used to train the model. The initial learning rate is set to 1e-3 and the weight decay is set to 5e-5. The model is trained for 100 epochs with a maximum iteration number of 10000.

B. DATASET
We evaluate the proposed method on the public benchmark datasets OTB and VOT. The two benchmarks both contain plenty of sequences with the ground-truth labels and covers various challenging scenes, such as background clutters, motion blur, illumination variation, scale variation, occlusion, deformation and so on. Fig. 5 shows the ground-truth bounding boxes in the partial video sequences of OTB dataset.

C. EVALUATION METHODOLOGY 1) EVALUATION METHODOLOGY OF OTB
The evaluation methodology is mainly based on the precision and success plot. The precision plot essentially describes the centre location error. It is the Euclidean distance between the centre of the tracking result and ground truth bounding box, which can be expressed as: where E p represents the centre position of predicted target, E g represents the ground truth, T p represents the threshold, · represents the Euclidean distance. The precision is defined as the percent of the amount of the frames whose centre location error is less than the corresponding threshold. It is changed along with the threshold. The ratio of frames with the threshold T p = 20 is set to the final precision.
The success plot is used to describe the overlap ratio, we consider a frame successful if the overlap ratio is larger than the corresponding threshold. The overlap can be expressed as: where S p represents the predicted bounding box, S g represents the ground truth, T s represents the threshold, the symbol represents intersection, represents union. Generally, we rank the results by the Area Under Curve (AUC) for the success plot. To accurately evaluate the proposed method, the comparative experiment employs one-pass evaluation (OPE).

2) EVALUATION METHODOLOGY OF VOT
The evaluation methodology is mainly based on the accuracy and the robustness. The accuracy is used to evaluate the accuracy of trackers. It can be expressed as: where A gt represents the ground truth of the t-th frame, and A t represents the bounding box predicted by the tracker at the t-th frame. Assume that the tracker will run multiple times on a sequence. Define φ t (i, k) as the accuracy of the i-th tracker on the t-th frame in the k-th repetition. Assuming the number of repetitions is N rep , the accuracy at t-th frame is defined as: The average accuracy of the i-th tracker is defined as: where N valid is the number of valid frames. The robustness is used to evaluate the stability of the tracker. The larger the value is, the worse the stability is. Assuming that the intersection of the predicted bounding box and its ground truth in a frame is 0, it is considered to be a tracking failure. F(i, k) is defined as the number of failures of the i-th tracker in the k-th repetition. The average robustness of the i-th tracker is defined as: The robustness is used to evaluate the stability of the tracker. Based on the two metrics, Expected Average Overlap (EAO) is used to evaluate the performance of trackers. VOLUME 8, 2020

D. RESULTS ON OTB
The comparative experiment not only compares the performance between the proposed method and the benchmark SiamRPN algorithm, but also compares with the state-ofthe-art trackers including TLD [33], OAB [34], MIL [8], CXT [35], Struck [10], SCM [36] and ASLA [37]. The experimental result shows the effectiveness and stability of the proposed method by drawing precision plots and success plots. Fig. 6 illustrates the precision and success plots based on centre location error and bounding box overlap ratio, respectively. As Fig. 6 shows, the proposed method is obviously better than other comparison trackers with the 66.7% precision ratio and 45.4% success ratio. To objectively evaluate the performance of the proposed method, it is also compared with the benchmark SiamRPN to illustrate the improvement effect of introducing the attention mechanism. Compared with the state-of-the-art trackers, the proposed method makes use of the spatial attention network and channel attention network to extract the significant characteristics, so as to obtain the efficient appearance characteristic representation. The spatial attention network is used to suppress the background and highlight the difference characteristics between the foreground and background. The channel attention network is used to activate the high target-relevant characteristic types and suppress the insignificant characteristic channels, even eliminate the noisy feature maps.
In addition, the proposed method performs the evaluation under challenging attributes. The comparison curves are shown as Fig. 7 and Fig. 8. As Fig. 7 and Fig. 8 show, the proposed method has good accuracy and stability in the complex tracking scene, such as deformation (DEF), background clutter (BC) and occlusion (OCC). The reason is that the attention network can enhance foreground while suppress the semantic background to highlight characteristic otherness, so as to improve the ability to distinguish appearance characteristics.
To show the comparative experimental results more intuitively, the average precision scores and success scores are listed in Table 1 and Table 2. It clearly shows that the proposed method achieves an overall precision score of 0.667 and an overall success score of 0.454, whereas the benchmark SiamRPN are 0.515 and 0.353, respectively.

E. RESULTS ON VOT
The proposed tracker is compared on VOT with 8 the stateof-the-art trackers including OAB [34], MIL [8], CT [43], Struck [10], SiamRPN [31], MEEM [44], STC [45] and DSST [46]. Experimental results show that the proposed tracker achieves excellent performance in terms of accuracy and robustness.      9 shows the EAO curve evaluated on VOT dataset. To show the experimental results more intuitively, the accuracy, robustness and EAO scores are listed in Table 3. It can be seen that the proposed tracker performs well in terms of accuracy and robustness and shows a competitive EAO compared to other trackers. In baseline experiment, the proposed tracker achieves the best EAO score of 0.272. Besides, the proposed method achieved a robustness score of 0.245 and an accuracy score of 0.596, whereas the benchmark SiamRPN are 0.317 and 0.546, respectively. This demonstrates the effectiveness of the proposed attention-based Siamese region proposals network, which helps to distinguish the target foreground and the interference background.

F. QUALITATIVE RESULTS
To intuitively show the qualitative evaluation effect of the proposed method, Fig. 10 enumerates the detailed tracking results of the partial test sequence, such as CarDark, Couple, Faceocc1, Ironman and Singer2. It can be seen from the qualitative results that some trackers will be a large deviation between the predicted target and the real target under the challenging tracking scenes. For example, the CarDark sequence has the scene attribute of background clutter, which make it easily misjudge the semantic background as foreground and lead to drift. The Couple sequence has the scene attribute of occlusion, and the tracking target deforms frequently, which increases the tracking difficulty. In the Faceocc1 sequence, the foreground is blocked by the semantic background for a long time, which increases the predicted error and makes it difficult to accurately estimate the location and scale information. In the Ironman sequence, TABLE 1. The average precision scores among trackers on different attributes. The best and the second-best results are in red and green colours, respectively.

TABLE 2.
The average success scores among trackers on different attributes. The best and the second-best results are in red and green colours, respectively.

TABLE 3.
Experimental results on the VOT dataset. The best and the second-best results are in red and green colours, respectively. the tracking target moves fast and irregular, which can result in the drift problem. The Singer2 sequence has the scene attributes of background clutter and scale variation, which make it unable to accurately predict the target state and easily cause failure.
The proposed method introduces the attention mechanism to focus on the difference between the foreground and the semantic background. The spatial attention network and channel attention network are used to obtain the salient characteristic representation of different target regions. As Fig. 10 shows, the proposed method can predict the location and scale information more accurately than other tracking methods and significantly reduce the prediction error, so as to improve the accuracy and robustness and achieve long-term object tracking.

V. CONCLUSION
In this paper, we propose a tracking method based on the attention mechanism, which focuses on the characteristic differences between the foreground and background. The method enhances the foreground and suppresses the semantic background to improve the ability to distinguish the foreground and the semantic background. The spatial attention network and channel attention network are constructed to realize salient feature selection, respectively. The former learns the planar weights by constructing the hourglass-shaped residual network, and the latter learns the dimensional weights to focus on different feature types. According to the structural differences of these two attention network, the spatial attention network deals with the low-level feature maps to focus on the appearance similarity characteristics. And the channel attention network deals with the high-level feature maps to focus on the semantic classification characteristics. The proposed method introduces the attention mechanism to simplify the characteristic representation and improve the ability to distinguish the foreground and the semantic background. The experimental result shows the outstanding performance of the proposed method compared with the benchmark SiamRPN tracker and several state-ofthe-art methods on the public tracking benchmark.
The proposed tracking algorithm essentially belongs to the template matching methods. The update strategy of the target template can directly affect the performance of visual tracking. In the future, we can design the rational and efficient template updating mechanism to improve the tracking performance.