Efficient Visual Tracking With Stacked Channel-Spatial Attention Learning

Template based learning, particularly Siamese networks, has recently become popular due to balancing accuracy and speed. However, preserving tracker robustness against challenging scenarios with real-time speed is a primary concern for visual object tracking. Siamese trackers confront difficulties handling target appearance changes continually due to less discrimination ability learning between target and background information. This paper presents stacked channel-spatial attention within Siamese networks to improve tracker robustness without sacrificing fast-tracking speed. The proposed channel attention strengthens target-specific channels increasing their weight while reducing the importance of irrelevant channels with lower weights. Spatial attention is focusing on the most informative region of the target feature map. We integrate the proposed channel and spatial attention modules to enhance tracking performance with end-to-end learning. The proposed tracking framework learns what and where to highlight important target information for efficient tracking. Experimental results on widely used OTB100, OTB50, VOT2016, VOT2017/18, TC-128, and UAV123 benchmarks verified the proposed tracker achieved outstanding performance compared with state-of-the-art trackers.


I. INTRODUCTION
Visual object tracking is a fundamental and challenging task for a wide range of computer vision applications, including intelligent surveillance [1], autonomous vehicles [2], game analysis [3], and human-computer interface [4]. An object bounding box is usually provided in the first frame of a video, and the tracking algorithm predicts new object locations in succeeding frames. Although many frameworks have been proposed, it remains an arduous task to develop a generic object tracker to handle various tracking challenges such as scale variation, illumination variation, fast motion, motion blur, occlusion, deformation, and background clutter.
Generative and discriminative strategies are commonly employed to solve the visual tracking problem. Generative strategies construct an analogous appearance representation for the target to find candidate positions in successive frames using neighborhood location searches around the existing The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Remagnino . target [8]. Discriminant strategies consider classification or regression frameworks to discriminate foreground from background for solving the tracking problem [9].
However, predicting target locations using discriminative methods classically requires large datasets for training or updating online to ensure acceptable classifier performance. This situation has altered somewhat with the introduction of the minimum output sum of squared error (MOSSE) [10] filter, which allows adaptive training schemes to perform robust and efficient object tracking. The MOSSE filter uses a Fourier transform to minimize the sum of the squared error between actual and desired output. Several previous studies have proposed approaches based around the MOOSE filter, e.g. CSK [11] used kernel methods to improve the underlying MOSSE filter, and CN tracker [12] employs color attributes to improve input data representation. However, handling challenges using hand-crafted features, such as histogram of oriented gradients (HOG) and color histograms with discriminative correlation filters (DCFs) significantly reduce performance due to VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ circular boundary effects, hence they are unsuitable for active tracking.
Recently, due to the powerful feature representation ability of the Convolutional neural networks (CNNs), have gained considerable research attention recently in many computer vision fields, such as semantic segmentation [13], object detection [14], and activity recognition [15]. Deep features are exploited within DCF based trackers to address these challenges, including HDT [16], deepSRDCF [17], and ECO [18]; or deep tracking frameworks MDNet [19], CNNSI [20], and FCNT [21]. However, using the pre-trained model as the tracker backbone for feature extraction is unsuitable due to inconsistencies between tracking and other visual tasks. Although CNNs provide better tracking, their data-hungry characteristics require considerable effort to collect sufficient data to train the end-to-end network. Ultimate tracking in real-time with good accuracy also needs to be considered when designing the tracker.
Several trackers have been proposed to overcome these difficulties. For example [5], [22], [23] consider tracking as a matching problem to learn similarity measures in end-to-end learning. The main advantage of similarity learning is that it employs offline end-to-end training to balance between speed and accuracy. Superior Siamese network matching can match the most analogous patch based on the target template, hence Siamese networks have shown great success in tracking. Recently, SiamFC [5] has gained enormous popularity for tracking due to its balanced performance using a simple Siamese architecture. Other siamese based trackers, for example, DsimaM [7] learns background suppression and appearance variations from earlier frames using a fast transformation learning model; whereas DCFNet [6] integrates a discriminant correlation filter (DCF) within a lightweight architecture and drives back-propagation to adjust the DCF layer using the probability heat map of the target location. However, these approaches lack robustness and are weak for handling challenging scenarios, particularly when for object appearance changes, as shown in Fig. 1.
To handle the aforementioned tracking challenges, we propose an extended underlying Siamese architecture incorporating stacked channel-spatial attention (SCSAtt) in the template branch with end-to-end trainable architecture. The SCSAtt channel attention module enhances target adaptability by utilizing different weights for channels depending on their contribution. After computing channel attention, we employ a spatial attention module to emphasize the most informative region on the feature map and hence identify the target location. The overall attention mechanism helps to improve feature representation power and discriminative ability, ensuring high tracking performance. We employ offline training to learn the similarity map, providing computational efficiency during tracking. The overall attention mechanism is extremely flexible to integrate with the Siamese architecture. We validated the effectiveness of our proposed framework using several challenging benchmarks [24]- [30] FIGURE 1. Compared our proposed tracker with siamese based trackers including SiamFC [5], DCFNet [6], and DsimaM [7] for MotorRolling (left column), Soccer (middle column), and skating2-2 (right column).
and compared performance results with other state-of-the-art trackers.
The main contributions from this study are as follows.
• We present stacked channel-spatial attention within a Siamese framework to learn effective feature representation and discrimination ability for high tracking performance.
• Rather than a single attention module, we combine multiple attention modules with residual skip connection in a specific order to enhance feature fusion training and target adaptability.
• We evaluated optimal attention module placement within fully convolutional single or multiple layers to enhance end-to-end training benefits for efficient tracking.
• We conducted extensive experiments using OTB100, OTB50, VOT2016, VOT2017/18, TC-128, and UAV123 benchmarks to validate the proposed approach, achieving 61 frames per second (fps) real-time processing speed and high accuracy compared with state-of-the-art tracking methods. To facilitate further studies, models and results are available at https://github.com/maklachur/SCSAtt.

II. RELATED WORK
Many visual object tracking frameworks have proposed over the last decade. It is inconvenient to cover a comprehensive survey of all trackers in the scope of this work. However, these survey studies [31]- [33] help to learn a detailed overview of the tracking frameworks for interested readers. This section provides short outlines for deep feature based trackers [18], [21], [34]- [36], Siamese based trackers [5], [7], [37]- [39], and attention based trackers [22], [40]- [45].

A. DEEP FEATURE BASED TRACKERS
The superior ability of the deep neural networks boost tracker performance by extracting significant features from the images. These deep features are then utilized by correlation filter tracking frameworks to improve performance, including DeepSRDCF [35], CF2 [36], and HDT [16]. Features from continuous convolution filters are also used to build trackers, such as ECO [18] and C-COT [34]. FCNT [21] selects features using regression, obtaining good accuracy but cannot perform in real-time due to high dimension convolutional feature representation. DeepTrack [46] considers tracking as a classification problem and learns feature weights by online training using iterative stochastic gradient descent (SGD) approach. Although these trackers have outstanding feature representation power, they are difficult to train offline on large benchmarks. Thus, these online approaches diminish the network richness, which affects overall tracking performance, and particularly tracker speed.

B. SIAMESE BASED TRACKERS
Siamese architecture formulates a similarity learning problem where two parallel convolutional layer streams share parameters and calculate similarity loss between two input images to train the network through back-propagation. This network was first developed for signature verification [47]. Siamese based trackers [5], [7], [37]- [39] solve tracking as a similarity learning problem between target and search images and have become popular recently within the tracking community due to their balanced performance in terms of accuracy and speed.
For example, GOTURN [37] formulates a relative motion estimation solution to encounter the regression problem. SiamFC [5] casts tracking as a template matching problem where the network learns similarity from embedded features. Although SiamFC is one of the most popular and pioneering approaches for visual object tracking, due to its steady speed and accuracy, it struggles with various challenges, including appearance changes, background clutter, and deformations. Therefore, many subsequent studies have improved SiamFC to enhance tracking performance. CFNet [39] integrates the correlation filter at the end of the template branch in a closed-form equation. SiamMCF [38] and DSiam [7] incorporate cross-correlation on multiple layers to solve the similarity problem.
We modify the underlying Siamese architecture to include input image sizes and embedded more feature channels providing an appropriate complement for incorporating attention mechanisms.

C. ATTENTION BASED TRACKERS
Attention mechanisms within neural networks has become an important approach for computer vision applications, such as image classification [48], [49], object detection [50], and segmentation [51]. Attention, or focusing on important image features, is an effective mechanism to help solve object tracking problems, and has attracted strong research attention within the tracking community, with several attention based trackers proposed [22], [40]- [45]. SA-Siam [22] integrates channel attention in the semantic branch to compute channel-wise weights around the object location. RASNet [41] combines three attention modules to enhance tracker discriminative competence and adaptability.
FICFNet [42] computes channel attention on both Siamese pipeline branches to weight feature channels. IMG-Siam [43] fuses the target foreground using channel attention and the super pixel based matting algorithm to provide enhanced target appearance with structural information. FlowTrack [44] uses temporal attention to capture target temporal information. MemTrack [40] and MemDTC [45] uses a long short term memory (LSTM) attention based controller to govern the feature map read and write operation using memory.
In contrast, we propose an attention mechanism with the end-to-end training facility where channel attention emphasizes 'what' informative part of the target image has to focus and spatial attention is responsible for 'where' the informative part is located. Therefore, combining these two attention modules learn 'what' and 'where' to focus or suppress the target information by refining intermediate features efficiently during the flow into the network.

III. PROPOSED METHOD
This section describes the proposed tracker methodology. The proposed tracking framework incorporates the stacked channel-spatial attention mechanism in the Siamese architecture target branch to improve tracker discrimination ability that helps to locate the target object in the search region efficiently. We also alter the underlying fully convolutional SiamFC [5] with different input sized images and internal architecture suitable for integrating the proposed attention mechanism to enhance target feature representation power. Fig. 2 shows the proposed tracker pipeline.

A. SIAMESE NETWORK FOR FEATURE LEARNING
The basic SiamFC framework generally includes two fully convolutional symmetric branches for learning features through weight sharing. SiamFC performs cross-correlation at the end of the feature extraction network between target and search image features to compute the similarity score map, where the maximum similarity score is taken as the predicted object location on the search image. This architecture can be expressed mathematically as where ϕ(·) represents the fully convolutional network, b · 1 is the bias value for every b ∈ R, and * represents the cross-correlation to compute response map between the target and search image feature maps. During the Siamese object tracking, the responsible target branch remains stationary after taking fridge weights from the offline trained model for the first frame of the video sequence named template (target image). The target object's location is estimated for subsequent frames by matching with the template at the highest similarity score on the response map. The generalization ability of the target branch helps to improve tracker quality because it is static. Since object location in Siamese based tracking is predicated based on similarity score, we concentrated on generating the most robust and discriminative features for similarity learning to VOLUME 8, 2020 FIGURE 2. The overall architecture of the proposed tracker. The shaded region represents the stacked channel-spatial attention block where channel and spatial attention modules are integrated after feature extractor for the target branch. The output of channel attention is forwarded as input to the spatial attention module. Finally, attention features are fused with skip connection for efficient discriminative features. A response map is constructed using cross-correlation between target and search image feature map. The red square in the response map resembles the highest similarity score that represents the target location in the search image.
build efficient tracker. However, basic Siamese tracker frameworks are unable to handle challenging tracking cases due to their reduced discrimination ability. To improve tracker discrimination ability, we used asymmetric fully convolutional branches by integrating stacked channel-spatial attention in the target branch. In particular, we altered the underlying Siamese tracking architecture as follows to ensure high tracking performance, where (·) denotes the stacked channel-spatial attention mechanism for the target feature map ϕ(z) that learns to effectively highlight appearance and refine the location feature for the object.

B. STACKED CHANNEL-SPATIAL ATTENTION
We were inspired by human visual perception, which does not require concentrating on the whole scene, but rather focuses on the specific object for perceiving informative parts to understanding the appropriate visual pattern [52]. Similarly, attention mechanism prioritize important features to understand salient object parts [41]. Since single object tracking resembles focusing on the most salient feature, it is beneficial to concentrate on crucial regions of the target image. Unlike other attention-based trackers, we integrated the attention mechanism only in the target branch to reduce the overall parameters overhead. It enables us to preserve fast-tracking speed and overall tracking process simple. Our attention mechanism is easily integrable to any convolutional layers of the network. However, during tracking, we required only a pre-trained model and the first frame of the video to track the sequence. On the other hand, the existing attention-based trackers including MemTrack [40] and MemDTC [45] maintain previous memory for the tracked object and update accordingly; IMG-Siam [43] uses super-pixel based mating to extract the target foreground; FlowTrack [44] utilizes the historical frames to model update; FICFNet [42] integrates attention module to both target and search branches.
Therefore, constructing an efficient object tracking, we propose stacked channel-spatial attention mechanism inside the Siamese framework named SCSAtt to enhance feature representation power for improving tracker discrimination ability. This attention approach linearly combines two popular attention modules, channel and spatial attention. The channel attention module measures the weight contribution of the channels, whereas spatial attention focuses on salient object regions in the feature maps.
The proposed attention mechanism first employs channel attention C A on the output feature map F M computed from the last convolution layer. The output from C A is forwarded to the spatial attention module, yielding the spatial attention feature map S A . To ensure our network efficient, we fuse the S A with F M using a residual skip connection.
We can summarize the process steps as and where (ϕ(z)) is the final stacked channel-spatial attention; φ c (·) and φ s (·) represent channel and spatial attention, respectively; and F M is the fully convolutional feature map of z.

1) CHANNEL ATTENTION
Each feature channel represents a particular visual pattern. During training, convolution feature map contributions from each channel do not represent an object equally, with some channels representing an object's visual pattern better than others and vice-versa. Therefore, most previous attention models, e.g. [22], [41], [42], and [53], use either global average or max pooling with a multilayer perceptron (MLP) to calculate their gain. In contrast, rather than a single pooling operation, we consider the global average and max pooling together to construct a channel attention module that learns fused features. The global max-pooling operation focuses on distinctive and finer object features, whereas global average pooling provides overall knowledge on the feature map for channel attention.
After computing both pooling operations, we calculate individual MLPs using an rectified linear unit (ReLU) layer to learn the non-linearity between two fully-connected layers with 128 and 512 nodes, respectively. Hence, we obtain two feature vectors F 1×1×C max and F 1×1×C avg for max and average pooling, respectively. Before applying sigmoid activation for normalization, we fused both feature vectors using element-wise summation. Finally, we calculated the product with skip connection to propagate effects on the original feature map, providing the ultimate channel attention feature map C H ×W ×C A , as shown in Fig. 3. The channel attention component can be expressed as and where σ represents the usual sigmoid function f (x) = 1 1+e −x .

2) SPATIAL ATTENTION
In contrast to channel attention, spatial attention highlights where informative features of the object in an image [48] for spotting the target location that provides a good complementary to channel attention. Previously, Qin and Fan [43] constructed a spatial mask using super-pixels to exploit target representation. Li and Yang [53] utilized global max pooling to encode the spatial attention in their model. We exploit the relationship among channels inter-spatial features to construct spatial attention. Pooling in the channel dimension highlights the informative area [54], which helps locate the desired target on the image by comparing overall weight gains. To formulate this attention, we compute global max pooling S H ×W ×1 max and average pooling S H ×W ×1 avg on the feature maps and fuse them in the channel domain. Since convolution operations consider as local operation and empirically, this approach focuses on target information.
We apply a convolution layer ψ 3×3 1 after concatenating doubly pooled features, experimentally choosing a 3 × 3 convolutional filter for best results, and down-sample the number of feature channels to 1 to obtain the single channel feature map. After broadcasting this convoluted feature map through the sigmoid operation, we compute a product with the previously acquired channel attention feature map C H ×W ×C A to obtain the ultimate effect on the spatial attention feature map S H ×W ×C A , as shown in Fig. 4. This attention feature map is calculated as and

S H ×W ×C
where ψ 3×3 1 is the convolution operation with 3 × 3 kernel and stride and padding = 1.

C. IMPLEMENTATION DETAILS
We adopted an AlexNet-like [55] backbone for the proposed tracker framework to extract the feature map, with 135 × 135 × 3 and 263 × 263 × 3 target and search image sizes, respectively. Table 1 shows network architectural details for deep feature extraction.
During data curation, we use the SiamFC strategy to crop the target and search images z and x, respectively. We consider the target object as the center of both images because it reflects the most challenging sub-windows that are influential to tracker performance. Since the tracker is fully convolutional, we need not to worry about the model learn a central bias [5]. We trained the model using GOT10k [56] and ImageNet Large Scale Visual Recognition Challenge-2015 (ILSVRC15) VID [57] benchmarks.

1) TRAINING
To train the model, we randomly selected training image pairs (z, x) from a sequence and adopted the logistic loss function, where M is the set of possible locations on the response map, f (z, x)[m] is the similarity score, and g[m] ∈ {+1, −1} is the ground truth corresponding to location m. To learn the Siamese network parameters θ, we used SGD to minimize the following function over the training sample N , We experimentally selected batch size = 32 and randomly choose 10 image pairs (z, x) from a video sequence of training benchmarks. We consider maximum distance between z and x to be 100 frames when selecting the image pairs, to ensure robustness to appearance changes. We used SGD to optimize network weights with momentum = 0.9, decayed learning rate from 10 −2 to 10 −5 exponentially, and set weight decay = 5e −4 .

2) TESTING
Similarly to SiamFC, we computed tracking treating the first video frame as a stationary template, with subsequent frames considered as search images that change. The response map was calculated independently from template matching between the fixed template and search images. The tracker predicted target position in subsequent frames from the maximum response map score. Finally, we used bicubic interpolation to estimate target location more precisely. We also considered scale penalty = 0.9745 with image scales = 1.0375 {−1,0,+1} to address target scale changes.
We implemented the proposed tracker using python with the PyTorch deep learning framework and performed all experiments on a desktop with Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz and Nvidia GeForce RTX 2080 Super GPU. We achieved 61 fps average tracker speed during testing.

IV. EXPERIMENTS
Before comparing results on the whole benchmark, we utilize the response map for computing the visualization effects of fused heatmap on the corresponding search image. This visualization results for channel attention module and spatial attention module with siamese architecture represented by CAtt and SAtt, respectively, and SCSAtt, as shown in Fig. 5. We can easily notice that the proposed SCSAtt learns well to compute the target region efficiently than CAtt and SAtt by reducing the distractor and background information significantly. Thus, SCSAtt can ensure high tracking performance than other variants of the proposed tracker.
We also found that the benefit of SCSAtt over the existing attention based trackers is the fast-tracking speed with maintaining high tracking accuracy. Table 2, illustrates the average tracking speed comparison among attention based trackers where we found that our proposed method achieved 61 fps which is superior to others. Hence, the proposed tracker would be more applicable to real-time tracking applications. Furthermore, We evaluated the proposed tracker experimentally on OTB100, OTB50 [24], [25], VOT2016 [26], VOT2017/18 [27], [28], Temple-color-128 (TC-128) [29], and UAV123 [30] benchmarks. The experimental results computed using OTB and VOT toolkit.

A. EVALUATION ON OTB100 BENCHMARK
The popular OTB100 [24], [25] benchmark comprises 100 annotated video sequences, including 11 challenging attributes illumination variation (IV), scale variation (SV), occlusion (OC), deformation (DF), motion blur (MB), fast motion (FM), in-plane rotation (IR), out-of-plane rotation (OR), out-of-view (OV), background clutter (BC), and low resolution (LR). We employed one pass evaluation (OPE) to compute success and precision plots. Success plots show the overall percentage of the overlap score, whereas precision plots show the percentage of center error distance between ground-truth and predicted bounding box. To keep our comparison fair, we accumulated various trackers types, including Siamese based trackers (SiamFC [5], SiamTri [23], SIAMRPN [58], and CFNet [39]), attentional Siamese trackers (MemTrack [40] and MemDTC [45]), correlation filter based trackers (STAPLE [59], CREST [60], SRDCF [61], DSAR-CF [62]), and others (UDT [63], DSiamM [7], and MLT [64]). Fig. 6 compares the proposed with other considered tracker success and precision outcomes for the OTB100 dataset. The proposed tracker SCSAtt achieves the best performance for both measurement criteria with beyond real-time speed. The proposed model achieved 64.1% and 85.5% score for success and precision plots, respectively, 10.14% and 10.89% superior to the baseline SiamFC tracker. The proposed model achieved 2.40% and 4.27%, and 7.19% and 8.37% increased success and precision, respectively, compared with memory attention mechanism Siamese tracker MemTrack [40] and correlation filter based tracker SRDCF [61], respectively. . We compared the similarity scored heatmap visualization results for the corresponding search images using CAtt, SAtt, and SCSAtt. The response maps between target and search images are fused to the corresponding search images to produce these visualization results. The SCSAtt framework computes the target region better than others by reducing distractor and background information significantly. The target and search image sequences are considered from the OTB100 benchmark.  We also compared our proposed tracker with the most recent trackers including DSAR-CF [62], MLT [64], and UDT [63]. The proposed tracker achieved 2.76%, 7.28%, and 12.5% improvement in precision score and 0.31%, 6.66%, and 9.20% improvement in success score compared to DSAR-CF, MLT, and UDT trackers, respectively. Moreover, DSAR-CF and MLT perform 16 fps and 48 fps, respectively, whereas the proposed tracker performs at 61 fps. Therefore, SCSAtt maintains a balanced performance in terms of speed and accuracy, which is the main objective of the proposed tracker.
Furthermore, to prove the effectiveness of our proposed tracker, we present the tracker performance for 11 challenges individually for solo comparison in Table 3 and Table 4. The proposed tracker consistently performed outstandingly for compared challenging attributes. Thus, the tracker provides consistent performance even for challenging circumstances. Fig. 10 compares frame-wise visualization, a qualitative comparison for visual understanding. The proposed tracker has significantly improved performance compared with state-ofthe-art trackers in several challenging sequences.

B. EVALUATION ON OTB50 BENCHMARK
The OTB50 benchmark is a subset of OTB100, comprising the 50 most challenging video sequences. Fig. 7 compares overall performance for the considered trackers on the OTB50 benchmark. We considered the same trackers that we compared in OTB100 benchmark for evaluating OTB50 benchmark. We observed that the proposed SCSAtt tracker secures the first place among other trackers in the OTB50 benchmark. It exhibits 16.67% and 19.65% increase from the baseline SiamFC in the success and precision score,   respectively. SCSAtt also achieved 7.31%, 5.99%, 11.69% and 7.31% progress in success score and 10.55%, 4.68%, 13.11% and 6.84% progress in precision score than the Mem-Track [40], CREST [60], SRDCF [61], and DsiamM [7] trackers, respectively.
Moreover, the proposed method has shown that the performance improvement of 5.34% and 2.56%, 4.28% and 3.03%, 10.99% and 8.66%, and 23.21% and 17.58% in precision and success than the most recent trackers including DSAR-CF [62], MemDTC [45], MLT [64], and UDT [63], respectively. SCSAtt, therefore, constantly outperform on both success and precision scoring metric that demonstrates the effectiveness of our tracker in terms of robustness.

D. EVALUATION ON UAV123 BENCHMARK
In contrast with typical visual object tracking datasets including OTB [24], [25], VOT [26]- [28], and TC-128 [29]; Unmanned Aerial Vehicle (UAV) benchmark [30] provide low altitude aerial videos for object tracking. UAV123 is one of the largest object tracking benchmarks, comprising 123 video sequences with more than 110,000 frames; whereas OTB100, OTB50, and TC128 together contain about 90,000 frames. UAV123 has become more popular recently due to its real-life applications, such as navigation, wild-life monitoring, crowd surveillance. Trackers with a good balance between accuracy and real-time speed will be more useful for these objectives. Since the proposed tracker operates in real-time with high accuracy of 54.7% success score and 77.6% precision score, which are 4.19% and 4.72% increase from one of the prominent tracker ECO for this benchmark as shown in Fig. 9. The ECOhc (60 fps) variant of ECO (not realtime) also performs in real-time, but the proposed SCSAtt tracker achieved 8.10% and 7.03% success and precision, respectively, improvement over ECOhc.

E. EVALUATION ON VOT2016 BENCHMARK
The VOT2016 benchmark [26] comprises 60 sequences. In this evaluation, the three most important aspects accuracy (A: higher is best.), robustness (R: lower is best.), and expected average overlap (EAO: higher is best.) are  computed to measure the tracker performance. We compared the proposed tracker with the top performing trackers including C-COT [34], Staple [59], DNT [74], MDNet_N (variation of MDNet) [19], SRDCF [61], SiamFC [5], SO-DLT [75], ASMS [8] and MvCFT [76] over VOT2016 benchmark. From the Table 5, we observed that SCSAtt performs well than other trackers in terms of accuracy and robustness. The proposed tracker SCSAtt ranked second for EAO, whereas C-COT ranked best but its accuracy and robustness are less than the proposed tracker. We also compared with underlying SiamFC [5], proposed tracker achieves 28.51% increase in terms of EAO score than the baseline.

G. ABLATION STUDY
The appropriate channel and spatial attention configuration is important for the proposed SCSAtt tracker. To validate the selected tracker configuration, we empirically evaluated the performance of various alternate designs. In particular, we measured solo performance for the proposed channel and spatial attention modules, and then considered the integration pattern for the modules. Finally, we investigated how best to incorporate the stacked channel-spatial attention mechanism in single or multiple convolution layers. To keep our comparison rational on the different variants of the proposed tracker, we utilized GOT10k and ILSVRC15 benchmarks to train all variants including the proposed model, and measured their performances on the OTB100 benchmark. Fig. 11 compares success and precision for these variations on the OTB100 challenging benchmark, where SAtt and Catt systems achieved 62.8% and 83.0%, and 63.1% and 84.1% accuracy and precision, respectively. For the spatial-first attention (SFAtt) case, we first computed spatial attention and then stacked channel attention on it. We empirically validate the results between SFAtt and channel first attention for concluding the stacked channel-spatial attention (SCSAtt) module.
We also validated the proposed tracker by adding the stacked channel-spatial attention mechanism in different convolutional layers. SCSAtt1-5 placed the stacked channel-spatial attention mechanism in all convolution layers since we consider every layer is significant to learn the target features and we did not want to lose any layer's important information. However this configuration performance was significantly lower than the other designs. We also experimented with integrating stacked channel-spatial attention in the third and fifth convolution layer (SCSAtt35), which achieved competitive performance because the latter layers capture the most discriminative features. Therefore, incorporating the stacked channel-spatial attention mechanism solely in the final layer, achieved the best performance.

H. DISCUSSION
In this article, we utilized Siamese tracking framework to exploit the importance of deep features to improve the robustness of the tracker. We proposed channel attentional module to re-calibrate the deep features channels for better target feature representation, whereas spatial attentional module uses to highlight the important spatial regions in each deep feature channel. We integrated channel and spatial attentional modules within Siamese tracking framework using residual skip connection (called SCSAtt), as shown in Fig. 2.
The SCSAtt learns the most discriminative features to adapt the target features from channel and spatial attentional networks. As each module of the SCSAtt has different functions, the order of the arrangement has an impact on the overall tracker performance. From the spatial feature point of view, the channel attention network applied globally, while spatial attention network responsible to work locally on the feature map. The overall attention tells where to focus, and also enhance the representation of interests. Therefore, the proposed tracker improves the representation ability by utilizing the attention mechanism: highlighting important features and reducing unnecessary ones.
We performed ablation study to show the impact of several tracker's design configurations using Siamese tracking framework. The visualization results of CAtt, SAtt, and ScSAtt as shown in Fig. 5, that represents SCSAtt learns well to compute the target region effectively than CAtt and SAtt by reducing the distractor and background information significantly.
To prove the effectiveness, we also compared our proposed SCSAtt tracker with many state-of-the-art trackers that revealed SCSAtt showed improved performance with real-time tracking facility at 61 fps for overall benchmarks including OTB50, OTB2015, VOT2016, VOT2018, UAV123, and TC128. Therefore, the proposed tracker maintains a balanced performance in terms of speed and accuracy.

V. CONCLUSION
This paper proposed a stacked channel-spatial attention mechanism inside the fully convolutional Siamese architecture to suppress irrelevant information and concentrate on object appearance with effective location feature refinement during tracking. The proposed channel attention focused on important feature channels, whereas the spatial attention module responsible for highlighting the object location. We used a cross-feature blending attention mechanism to enhance feature representation power for boosting the tracking performance. We performed extensive experiments to validate the proposed SCSAtt method effectiveness on several challenging benchmarks, including OTB100, OTB50, VOT2016, VOT2017/18, TC -128, and UAV123.