Siamese Visual Tracking with Residual Fusion Learning

Multi-stage feature fusion is pretty effective for deep Siamese trackers to promote tracking performance. Unfortunately, conventional fusion approaches, such as weighted average, are so simple that they are inappropriate to combine the features with diverse characteristics. In addition, the fusion module is generally optimized along with Siamese network module, which may result in the performance degradation of the whole tracker. In this paper, we propose a novel feature fusion network for Siamese tracker by exploiting the expression capacity of residual learning (SiamRFL). Specifically, the network employs the deep-layer features as direct input to semantically recognize the object from background, and refines the object state with local detail patterns by exploring the shallow-layer features through residual channel. The classification and the regression features can be fused respectively by deploying multiple fusion units. To avoid the degradation problem, we also present an ensemble training framework for our tracker, in which different loss functions are introduced to individually optimize the Siamese and the fusion modules. In the extensive experiments on several latest datasets including OTB100, VOT2019, UAV123, LaSOT and GOT10k, the proposed tracker achieves state-of-the-art performance, outperforming other approaches by an obvious margin.


I. INTRODUCTION
V ISUAL tracking is one of the most fundamental research directions in computer vision, which has a capacity to infer the state of an arbitrary object in a sequence, only with its initial state in the first frame as reference. The technique is required by various visual issues, such as visual surveillance [1], robotics [2], human computer interaction [3] and augmented reality [4]. Despite great progress has been realized, most of trackers still struggle with several challenging factors, such as background clutters, occlusion, illumination variation, etc.
With the development of Convolutional Neural Networks, a few more efficient tracking paradigms are gradually presented to address the above difficult factors, such as Siamese network. The network aims to match the features of template and search region patches to predict the object state, which has a significant advantage in speed and precision. Following the seminal works of SiamFC [5] and SINT [6], massive efforts are performed to furtherly promote tracking performance. Some [7], [8] expected to improve the quality of feature representation by introducing effective backbone networks, like ResNet [9] and GoogleNet [10], etc. While others devoted themselves to completing more reliable decisions by designing powerful matching modules, i.e., Region Proposal Networks (RPN) [11], [12] and Anchor-free networks [13], [14]. In addition, a number of training approaches [15], [16] and online update strategies [17], [18] were explored to achieve better tracking results.
In one neural network, the modules in different depths vary in abstract levels and receptive fields, so they are able to learn features with diverse attributes. The features from shallow layers consist of abundant local detail patterns which are valuable for perceiving the location variations of object, while deep-layer features with high-level semantic information are important to discriminate the object from background. In this circumstance, most of previous Siamese trackers [7], [13], [19] try to fuse multi-layer features to benefit from their complementary attributes. However, existed fusion approach, i.e., weighted average, is very simple and still suffers from several drawbacks. Firstly, the method is FIGURE 1. Classification resulting maps of SiamRPN++ [7] and our SiamRFL on some typical videos. The first two rows show the results of SiamRPN++, while the rest of rows illustrate our responses. The columns from 1st to 4th express the resulting maps output by three RPN modules (conv-3, conv-4 and conv-5) and fusion module, respectively. so simple that it cannot aggregate features in an adaptive way, even though the aggregated weights are trainable. Since it ignores the attribute difference between multiple-layer features and treats them equally, these features maybe disturb each other during fusion that makes trackers fail to adapt to the drastic appearance variations of object. Moreover, conventional works usually train the whole network entirely, in which both Siamese network module and fusion module are optimized by only one loss function. This manner is insufficient to ensure the training quality of every module, and degrades the performance of Siamese trackers.
In this paper, we propose a novel feature fusion framework for Siamese trackers based on the attributes of differentstage features, which is comprised by multiple residual units. When tracking an object, an appropriate tracking strategy is that the tracker first explores abstract semantic patterns to discriminate the tracked object from a global view, and then utilizes spatial detailed patterns to refine the state of object. Inspired by this idea, the features from deep layers are adopted as the direct component of our residual unit to coarsely identify the object from background, while the shallow-layer features are inputted into residual channel to eliminate the prediction deviations of direct channel. There exist two sub-networks in the proposed fusion module to combine the classification and the regression features respectively, and each sub-network is constructed by cascading the fusion units. Through introducing the fusion architecture, a Siamese tracker is able to predict the object state in a coarseto-fine manner.
Furthermore, an ensemble training approach is presented for our tracker to avoid the performance degradation in the testing phase. Concretely, several basic losses are adopted to optimize the Siamese network module including backbone and decision networks, and a fusion loss is utilized to only train the fusion module. By decomposing the optimizations of diverse blocks, our presented tracker would be trained with high-quality. Figure 1 shows several representative visual response maps, illustrating that all decision modules become more efficient under our optimizing scheme, and the proposed fusion strategy produces more robust and reliable tracking responses. The major contributions presented in the work mainly consist of the following points.
1. We propose a novel feature fusion scheme by exploiting residual learning, which has an ability to take full advantage of the attribute information of multi-layer features, and generate more reliable tracking results.
2. An ensemble training approach is designed to optimize our Siamese tracker. By using multiple loss functions to separately train different network modules, it is very effective to promote the training quality of the proposed tracker.
3. Extensive experiments on some challenging benchmark datasets manifest that the proposed tracker is superior to some state-of-the-art trackers with very promising performance.
The rest of the paper is organized as follows. We first review the related works in Section II, and then describe the Siamese tracker with our presented residual fusion network as well as its training approach in Section III. The experiments and results on several latest datasets are analyzed in Section ??sec:experiments, in which our tracker is compared with most of state-of-the-art methods. At last, the paper will be concluded in Section V.

II. RELATED WORKS
In this section, we just briefly review the recent researches related to our work, including Siamese trackers, feature fusion approaches and loss functions. More elaborate introduction about visual tracking can be found in some review literatures [20], [21].

A. SIAMESE TRACKERS
Siamese networks serve as a popular tracking paradigm which have received intensive attentions in the last few years. Inspired by the pioneering work of SiamFC [5] that presented a cross-correlation layer to compare the features of template and search patches, abundant strategies were exploited to lift the potential of the networks. Among these, a very representative direction is to study how to predict the object state. Concretely, SiamRPN [11] combined Siamese network with Region Proposal Network [22] to parallelly perform objectbackground classification and bounding box regression, realizing high-speed and impressive tracking. Following the instance, several more complicated and successful structures were developed, such as SPM [23] and C-RPN [12]. Furtherly, anchor-free networks were also explored to avoid complex hyperparameters in RPN modules. SiamBAN [13] presented a box adaptive network without anchors, which can detect the bounding box of an object in a per-pixel manner. SiamFC++ [24] described a set of guidelines for the object state prediction, while Ocean [14] designed an object-aware anchor-free network for tracking. Moreover, other decision blocks such as segmentation network [25] and corner detection network [26] were proved to be powerful, too. Another important evolution for Siamese networks is to introduce deeper backbone network for more abstract feature representation. SiamRPN++ [7] used spatial aware sampling to overcome the negative influence of padding operation, and employed ResNet-50 [9] as backbone. SiamDW [8] straightly proposed a novel residual unit without padding. In addition to design network model, both adversarial learning [16] and distractor-aware sampling [15] were utilized to improve training quality, while some online update methods [17], [18], [27] were adopted to help trackers to achieve satisfactory performance.

B. FEATURE FUSION APPROACHES
Feature aggregation is a valuable way for lifting the tracking performance of neural networks, which has been widely applied in previous works. A popular solution is to transmit multi-layer convolutional features into Discriminative Correlation Filters (DCF) [28]- [30], which were able to combine these features to form a kernel to recognize the object. In addition, FCNT [31] presented a switch mechanism to alternately select the features from diverse stages for tracking. Nevertheless, these methods are all artificially designed, which could not benefit from large-scale training datasets, as well as satisfy challenging tracking requirements. In Siamese networks, it is more meaningful to aggregate multi-stage features when utilizing deeper backbone networks such as ResNet, since the abstract levels and receptive fields varies a lot [7]. As a result, SiamRPN++ [7], SiamBAN [13] and some of the rest Siamese trackers [19], [32] attempted to accumulate the tracking responses computed on diverselayer features using fixed weight ratios. However, the linear average strategy is so simple that trackers were incapacity of taking full advantage of the features with diverse attributes, even though the weights are trainable. The drawback would limit the role of feature fusion to some extent.

C. TRAINING LOSS FUNCTIONS
Loss function plays a vital role to guide the optimization of neural networks, and a variety of losses have been proposed to train Siamese networks. In several initial studies, i.e., SiamFC [5] and CFNet [17], a simple classification loss was presented to generate the similarity confidence map. Then, Dong et al [33] described a triplet loss to find the relative relationship among exemplars, positive instances and negative instances. For some state-of-the-art Siamese networks, such as SiamRPN [11] and SiamRPN++ [7], both classification and regression losses were required to discriminate the object and predict its location state. As a result, they usually accumulated a classification loss and a regression loss as the final training loss. Although these losses are mature and effective, they are not suitable for optimizing our proposed tracker. The core reason is that all these adopted only one VOLUME 4, 2016 loss to train all network modules, which cannot endow every network module with different capability. PG-Net [32] put forward to a multi-stage loss function, where multiple sublosses were introduced to train the corresponding decision modules, while a fusion loss was adopted for the whole network. The loss is more specific, but it still can't separate the training procedure of Siamese network module and fusion module completely. In this case, we need to design a novel optimization scheme to train the proposed tracker more efficiently.

III. SIAMRFL TRACKER
In this section, we describe the proposed SiamRFL tracker in detail. After giving the overview of the overall architecture, we introduce the baseline SiamRPN++ tracker [7] and present our residual fusion network. Next, we analyze the fatal drawback of conventional training way, and illustrate our ensemble training method for model optimization. Last of all, the implementation details about offline training and online testing are explained.

A. OVERVIEW
The architecture of the proposed tracker is depicted in Figure  2. Concretely, the tracker first extracts the template and the search region features with a weight-shared backbone network, and then matches their features in different stages using three RPN blocks. Subsequently, multi-layer output response features are aggregated by the residual fusion network, which consists of two subnetworks to fuse the classification and the regression features, respectively. The fusing results would be adopted to predict the final state of object. In offline training phase, the Siamese network module, i.e., the backbone as well as Region Proposal Networks and the fusion module are optimized with diverse loss functions, which is very productive to improve the performance of our tracker.

B. SIAMRPN++ TRACKER
Siamese networks generally infer the state of object through comparing the candidate samples in search region x with the initial template z, which can be formulated as in which, ϕ represents the weight-shared backbone network for feature extraction, while G indicates the similarity matching module which is used to find the most similar candidate sample with template. b is a bias factor and f denotes the matching results of all candidate samples. Considering previous works, SiamRPN++ [7] is an important development in field of Siamese visual tracking, which exploits deeper backbone module and aggregates features from multiple stages to find the tracked object. Due to more powerful feature expression, this work could produce very promising tracking results. To describe our feature fusion network and validate its effectiveness, we take the tracker as the baseline, whose main modules, i.e., backbone network and Region Proposal Network, are introduced as follows.

1) Backbone
SiamRPN++ has declared that Siamese networks can benefit from more abstracting feature representation, and thus employs ResNet-50 [9] as the backbone. Besides, it adjusts the backbone with several extra trails to make it more appropriate for tracking. Specifically, the sampling strides in the fourth and the fifth residual blocks, i.e., conv-4 and conv-5 blocks, are first reduced to 1 pixel to improve the dimensions, while dilated convolution is introduced into these blocks to maintain the receptive fields. To boost tracking ability using features with different attributes, this tracker takes advantage of the last three residual blocks to output features, in which an additional 1 × 1 convolutional layer is appended to align the channels to 256. For template samples, only the features in central 7 × 7 regions are used to express the objects.

2) Region Proposal Network
Region Proposal Network is a typical anchor-based decision block. It is proposed for object detection [22], but has gradually become popular in visual tracking domain due to the advantage of prediction precision. There are two different task branches in the block, i.e., a classification branch for identifying the object from background as well as a regression branch for finding the bounding box of object. After adjusting input features, a depth-wise cross-correlation layer is first used in each branch to match a pair of input features. Then, a decision head is constructed to finish object classification or regression. In SiamRPN++, three RPN blocks are employed corresponded to the output layers of backbone, whose function can be formulated as where, ϕ i (z) and ϕ i (x) are the template and the search region features, respectively. a or β denotes a 1x1 convolutional layer to adjust the features. i ∈ [3, 4, 5] depicts diverse output stages. * denotes the cross-correlation operation, while H represents a classification or regression head. C i and L i are the classification and the regression results of different layers, respectively.

C. RESIDUAL FUSION NETWORK
Since the above RPN blocks can finish object state prediction using the features with different characteristics, more precise and reliable tracking results will be produced if we combine the response outputs of these blocks in a proper way. The feature expression gradually becomes more abstract with the increasement of the network depths. As a result, the deep-layer features that encode more high-level semantic patterns are suitable for discriminating the object from background globally, while the features provided by shallow layers should be used to refine the tracking results of deeplayer blocks with massive local detail patterns. Inspired by the issue, we propose a residual fusion network to utilize multi-layer features, named as RFNet. The network is composed of some residual fusion units, each of which consists of two cascading 1x1 convolutional layers. Specifically, the first one compresses the channels of features in half, and the second just adjusts the features without reducing the quantity of channels. A RELU activated layer is inserted between two convolutional layers to enhance nonlinearity. The features from two diverse stages are required simultaneously by the unit to learn how to aggregate them. In consequence, these features are first concatenated and then inputted into the unit for forward propagation. We accumulate the results with the original deep-layer features to remove the tracking errors of deep-layer modules based on residual learning, which can be formulated as where, y i and y j indicate the features from the shallow and the deep layers, respectively. R depicts the residual fusion unit, while y r is the fusion result. In our framework, there are two subnetworks to aggregate the classification and the regression features, respectively. Every subnetwork is comprised by two residual units, which is able to combine the features from three stages. The features from the first two stages are transmitted into the first unit, whose outputs are adopted as the shallow-layer inputs of the last unit. In one subnetwork, the last unit produces the final results of feature fusion: in which, R d and R s represent the first and the second residual units in a subnetwork, while i = 3 denotes the first output stage. C f and L f denote the fusion results of classification and regression features, respectively. In reality, multi-layer feature fusion is a kind of ensemble learning technique, for which one of the most important issues is to design an ensemble module to combine several weak sub-learners into a stronger learner. The technique has been widely discussed and proved to be effective in some previous trackers [34], [35]. For a Siamese network, every decision block can be regarded as a sub-learner, while the fusion approach plays the role of ensemble module. In this view, it is easy to observe that previous fusion methods [19], [32] are too simple to adaptively integrate sub-learners and maximize the advantage of ensemble learning. In contrast, the proposed fusion network is presented based on analyzing the characteristics of each sub-learner, and has an ability to benefit from the training on large-scale image datasets. Therefore, it can accomplish more efficient feature aggregation.

D. ENSEMBLE TRAINING WITH MULTIPLE LOSSES
At present, Siamese networks are generally optimized under a standard training framework, in which only one loss function is used to train the whole network model. However, the tracking performance of our proposed tracker will degrade severely if we follow the traditional training route. The core reason is that one loss function is insufficient to guide all network modules to master the corresponding capabilities. For instance, the decision and the fusion modules are directly cascaded in our Siamese tracker. If there is only one loss for training them, the optimizer may regard them as one functional block, and deliver them with the uniform capability. A possible extreme situation is that the fusion module is mistaken as a part of decision layers, which learn how to predict the object state rather than how to fuse the multi-layer features. Moreover, the depths of diverse decision blocks are unbalanced in this condition, as displayed in Figure 3, which may result in the further reduction of tracking quality.
Analyzing this problem with ensemble learning, we discover that all sub-learners, i.e., RPN blocks, and the ensemble module, i.e., fusion module are synchronously optimized using only one loss function in conventional training paradigm. This manner is inappropriate since it cannot ensure the basic performance of sub-learners and the validity of ensemble VOLUME 4, 2016 learning. To yield the problem, we present an ensemble training framework for our Siamese tracker, as shown in Figure 4. In the framework, every RPN block and its corresponding feature extraction layers are individually optimized by one basic loss function, and a fusion loss function is adopted to only optimize the proposed residual fusion module.

1) Basic Loss
The role of basic loss functions is to guide the sub-learners to learn how to track an object, so there are multiple basic losses corresponding to diverse sub-learners. In practice, we introduce the training loss presented in SiamRPN++ [7] as the basic loss function, which consists of a classification loss for identifying the object and a regression loss for estimating the bounding box of object. One RPN block and its feature extraction layers are optimized with the loss in which, C i and L i denote the classification and the regression results in diverse stages of i ∈ [3,4,5], respectively. L cls is the Cross Entropy Loss for classification, and L reg is the standard smooth L1 Loss for regression. cls represents the binary label of classification, while reg depicts the groundtruth bounding box of object. λ denotes a weight factor for balancing two kinds of losses. Then, the basic losses of all stages are aggregated where L m denotes the aggregated result of multiple basic losses. We could complete the optimization of backbone network and all Region Proposal Networks through backward propagating the gradient of the loss.

2) Fusion Loss
In addition to train sub-learners with the basic losses, an extra loss function is required to guide the fusion module to combine the decision results of sub-learners. Keeping consistent with the training process of sub-learners, we optimize the residual fusion network with the same loss where C f and L f represent the classification and the regression fusing results output from fusion network, respectively. During offline training, all network modules are optimized jointly. Specifically, two different optimizers are constructed to train Siamese network module and fusion module, respectively. In every batch, we extract several sample pairs of templates and search regions, and then forward propagate them to compute the aggregated basic loss and the fusion loss. Next, we backward propagate the gradients of basic loss and use the first optimizer to train backbone and RPN blocks. The gradients of the fusion loss are backward propagated by the other optimizer to train the residual fusion network. By combining two kinds of losses, the whole network is trained in an end-to-end manner.

E. IMPLEMENTATION DETAILS 1) Training
The proposed Siamese network is optimized on the training datasets of ImageNet VID [36], YouTube-BoundingBoxes [37], COCO [38], ImageNet DET [36], LaSOT [39] and GOT10k [40]. We extract a pair of template and search region samples from different frames of a video sequence or a still image with diverse data augmentations, where the sizes of object template and search region patches are set to 127 and 255, respectively. The anchor boxes in RPN blocks are deployed according to the way described in [11]. An anchor would be labelled as positive sample if its IOU ratio with ground-truth is larger than 0.6, while it would be viewed as negative sample if the IOU ratio is lower than 0.3. In one training image pair, we only extract 16 positive and 32 negative samples for network optimization.
After initializing the backbone module with the parameters pretrained on ImageNet dataset [36], we optimize our network using Stochastic Gradient Descent (SGD) method with a weight decay of 0.0005 and a momentum of 0.9. The network is trained 20 epochs with a minibatch of 32, and one million sample pairs are utilized in each epoch. We use a warm-up learning rate for network optimization. Concretely, the learning rate increases from 0.001 to 0.005 in the first 5 epochs, and decays from 0.005 to 0.00005 in the last 15 epochs. Moreover, the first two residual blocks of backbone network are frozen throughout the training, and only the rest of residual blocks are optimized in the last 10 epochs. The learning rate of backbone is smaller 10 times than other network modules. The hyperparameters λ of losses in Eq.5 and Eq.7 are set to 1.2.

2) Inference
Following some previous works [7], we extract the template features using backbone network only in the initial frame, and don't perform update during the tracking process for stability. In each subsequent frame, we extract the search region sample based on the object state in the previous frame, and compare its features with template features. After aggregating the response maps of multiple RPN blocks with the proposed residual fusion network, cosine window penalty and scale change penalty are adopted to re-rank the classification scores of all anchors [11]. The anchor with the highest classification score is selected to regress the bounding box of object. The target size is changed by linear interpolation to maintain the shape changing smoothly. The hyperparameters in the above penalty and linear interpolation operations are automatically computed using the tracking toolkit [13]. The classification and regression results are displayed in Figure  5, where we find that the proposed tracker can provide very accurate and robust tracking results through adaptively fusing multi-layer features.

IV. EXPERIMENTS AND DISCUSSION
To evaluate the performance of the proposed Siamese tracker, we conduct extensively experiments on several public popular benchmark datasets, including OTB-2015 [41], VOT-2019 [42], UAV123 [43], LaSOT [39] and GOT-10k [40]. Our tracker is first compared with some state-of-the-art trackers to highlight its superiority, where the comparison results with other Siamese trackers manifest the advantage of our fusion scheme. Besides, we perform the ablation experiments on LaSOT dataset to show the role of each contribution in our method. In all experiments, the evaluation protocols presented by the above benchmarks are followed rigorously.

A. COMPARISON WITH THE STATE-OF-THE-ART TRACKERS 1) OTB-100
Online Tracking Benchmark is classic benchmark for visual tracking, and the latest version, i.e., OTB-100 [41] consists of 100 fully-annotated video sequences. These sequences cover 11 kinds of challenging attributes, like background clutter, motion blur, occlusion, etc. Both center location error and overlap ratio are used to evaluate the performance of trackers in the standard protocol. Concretely, center location VOLUME 4, 2016 error indicates the relative distance between the predicted location and ground-truth center, and Precision metric could be furtherly computed by counting the percentage of frames where center location errors are within a given threshold. Overlap ratio measures the Intersection over Union (IoU) ratios of the predicted and ground-truth bounding boxes, where Success metric is used to represent the percentage of images where overlap ratios are larger than a given threshold. We conduct the evaluation in the One-Pass Evaluation (OPE) formulation.
We compare our tracker with twelve state-of-the-art trackers: TransT [44], SiamBAN [13], SiamR-CNN [45], Siam-CAR [19], SiamRPN++ [7], SiamRPN [11], UDT [46], DIMP [47], ATOM [48], ECO [30], CREST [49] and MDNet [50]. To be specific, the first six trackers belong to Siamese tracking frameworks, while others are discriminant trackers. The overall comparison results of success and precision plots are displayed in Figure 6. It is worth noticing that the proposed tracker achieves the best performance on both Success and Precision metrics. Compared to the baseline SiamRPN++ tracker, our SiamRFL framework gains a 1.3% improvement on Success with an AUC score of 0.709. For the secondranked SiamR-CNN tracker in terms of Success score, our method outperforms it by 2.8% on Precision. Among these comparison algorithms, SiamBAN and SiamCAR also take the ResNet-50 as backbone and output convolutional features from the last three residual blocks. It can be seen that our SiamRFL is superior to them because of the feature fusion ability.
To analyze the performance of all trackers more carefully, we also give success and precision plots in multiple challenging attributes, as displayed in Figure 7 and Figure 8. The results manifest that our tracker realizes very satisfactory performance in these attributes. Especially in the attributes of Illumination Variation (IV), Deformation (DEF) and Out-of-Plane Rotation (OPR), the proposed method ranks first on both Success and Precision. For the Success score, our approach exceeds the second-ranked by 1.8% in DEF attribute and 1.4% in OPR attribute. Compared with the SiamRPN++, SiamRFL obtains more than 1.0% gains in several diverse attributes, including Fast Motion (FM), Motion Blur (MB), Low Resolution (LR), Scale Variation (SV) and so on. These results demonstrate that SiamRFL tracker has an ability to adapt to all kinds of complex appearance variations. This is because the proposed fusion network can aggregate the multilayer features with diverse attributes more effectively, which helps the tracker to complete robust object classification and accurate object location.

3) UAV123
UAV123 [43] dataset consists of 123 aerial videos captured from the low-attitude UAV platform, whose average length is about 915 frames. It is pretty challenging to track the object in the dataset due to frequent distractors, such as fast motion, scale change, illumination variation, occlusion, etc. We compare our SiamRFL tracker with several recently proposed methods and present the success and precision plots in Figure 10. The proposed tracker exhibits satisfactory results and surpasses most of recent remarkable approaches on both metrics. The only exception in the comparison results is the SiamBAN tracker [13], which is top-performing among all trackers by exploring anchor-free network for object state prediction.

4) LaSOT
LaSOT [39] is a recent public large-scale tracking benchmark dataset containing 1400 fully-annotated video sequences, where 280 sequences belonging to 70 diverse classes are selected for testing. The dataset is more challenging than typical short-term tracking datasets [41], [42] due to much longer sequences whose average length is about 2500 frames. We validate our proposed tracker following the standard One-Pass Evaluation (OPE). The success and normalized precision plots are illustrated in Figure 11, in which the stateof-the-art SiamBAN [13], SiamRPN++ [7], ATOM [48], SiamMask [25], SiamDW [8], VITAL [56], C-RPN [12], MDNet [50], DSiam [18] and ECO [30] trackers are adopted for comparison. Our SiamRFL tracker outperforms all aforementioned trackers by a significant margin. In comparison with the baseline SiamRPN++, our tracker produces substantial gains of 2.5% on Success and 3.8% on Normalized Precision. These results demonstrate that the proposed fusion network is more effective than the fusion strategy in SiamRPN++, i.e., weighted average. In addition, our method performs better than SiamBAN tracker, which achieves the leading performance among all comparison methods.

5) GOK-10k
The dataset [40] is a recent high-diversity benchmark for generic object tracking including 10k video sequences for training and 180 sequences for testing. These testing videos cover 84 types of objects in the wild with diverse motions.

B. COMPARISON WITH SIAMESE TRACKERS
To highlight the potential of the proposed fusion network, we compare our SiamRFL with several typical Siamese trackers on OTB-100 dataset. Among these comparison methods, SiamFC [5], SA-Siam [52], StructSiam [57], SiamRPN [11], DaSiamRPN [15], C-RPN [12] and SPM [23] adopt the features output from the last convolutional layer of AlexNet [58], while the rest of SiamRPN++ [7], PG-Net [32], SiamBAN [13] and SiamCAR [19] employ the ResNet-50 [9] as backbone and combine the features from multi-layers for tracking. According to Table 3, our tracker achieves the lead-ing performance on both Success and Precision scores. We can discover that fusing multi-layer features of one deeper backbone is very effective to lift the tracking performance of Siamese trackers, but existed ways [7], [13], [19], [32] have no capacity to maximize the role of feature aggregation. In contrast, the proposed fusion scheme is more adaptive and powerful, whose outperformance and effectiveness have been verified by the comparison results.

C. ABLATION STUDIES
We compare four variants of the proposed tracker on La-SOT dataset [39] to manifest the impact of our contributions, which consist of Baseline, Baseline+EnsTrain, Base-line+RFNet and SiamRFL. Concretely, "Baseline" represents the original SiamRPN++ tracker [7] under standard optimizing paradigm, while "Baseline+EnsTrain" denotes that the tracker is trained using our present ensemble training method. For "Baseline+RFNet", we replace traditional fusion strategy in SiamRPN++ with our residual fusion network, but still train the network using a standard optimizer. "SiamRFL" indicates our final tracker, in which both residual fusion network and ensemble training framework are employed. The success and precision plots of ablation study on La-SOT are shown in Figure 12. Compared with "Baseline", Our ensemble training framework (EnsTrain) lifts the tracking performance by 0.5% on Success and 1.1% on Normalized Precision, which proves that the framework is also useful for some simple fusion mechanisms, like weighted average. However, "Baseline+RFNet" performs with 2.4% drops on Success and 2.3% drops on Normalized Precision. It is to say that the performance of tracker will degrade severely if we adopt the proposed residual fusion network but do not adjust the training way. The final SiamRFL tracker surpasses all other variants, which obtains 2.4% Success increments and 3.3% Precision increments compared with the baseline. The phenomenon declares that our fusion network is very powerful for visual tracking once we introduce appropriate training method, i.e., the presented ensemble training.
To further highlight the advantages of our fusion mechanism, we present the tracking results for aggregating the features from diverse layers and compare it with weighted average (WA), as shown in Table 4. When aggregating two stages, "WA" yields slight improvements on combining conv-4 and conv-5, but no improvement is gained on the other two combinations. It means that weighted average has no ability to fully reflect the effect of feature fusion. In contrast, our residual network improves the tracking performance more significantly. Taking conv-3 and conv-4 as instance, our fusion method exceeds the conventional weighted average by 2.6% on AUC score and 2.9% on Normalized Precision score. It is even better than the model that combines conv-3, conv-4 and conv-5 via weighted average. In addition, the best results can be achieved by exploiting our fusion network to combine all three stages.

D. QUALITATIVE RESULTS
The qualitative tracking results of some recent trackers on a subset of OTB-100 [41] sequences are exhibited in Figure 13. These results demonstrate that our SiamRFL tracker is able to achieve very satisfactory visual performance and performs better than other popular comparison methods. The main reason is that the presented fusion mechanism can fuse lowranked detail features and high-level semantic features in an adaptive and efficient way, which prompts our tracker to be more robust and accurate when facing all kind of interferences.
In the video sequence of Fleetface, our approach can address the great challenge of in-plane and out-of-plane rotations well, and track the object closely. In the sequences of Jump and Trans, there are severe scale and deformation variations for the objects. The presented tracker successfully adapts to these variations as well as precisely infers the bounding boxes, when other trackers suffer from significant scale and shape drifts. In video Singer2, our SiamRFL tracker accurately distinguishes the object from background, which proves that our tracker is strong to tackle the background clutters. In the sequence of Skating1, our method can identify the object more robustly although it is frequently occluded by other similar objects, which is since our approach can effectively perceive the detailed and semantic differences between two objects with the proposed fusion framework.

V. CONCLUSION
In this paper, we proposed a novel residual fusion network for Siamese tracker, which can aggregate multi-stage features in a powerful way. Specifically, the network utilizes deeplayer features as direct input to identify the object from background in a semantic view, and refines the object state by exploiting the local detail patterns in shallow-layer features through residual channel. when incorporating the network into Siamese tracker, an ensemble training approach was presented to address the degradation problem, which optimizes Siamese network and fusion network separately by arranging multiple loss functions. The experimental results on five popular benchmark datasets demonstrated the effectiveness of our residual fusion network, as well as the proposed tracker performs favorably against the state-of-the-art trackers.