FPSiamRPN: Feature Pyramid Siamese Network With Region Proposal Network for Target Tracking

Target tracking based on Siamese network has reached the state-of-the-art performance. However is still limited in semantic feature extraction. In this paper, we propose a novel method to distinguish positive and negative samples. Taking deep neural network as the backbone, we fuse the feature maps from different layers and feed it to RPN (Region Proposal Network). In addition, we use a loss term for loss function to achieve self-adjusting and learn more discriminative embedding features of target objects with similar semantics. In the tracking stage, one-shot detection is used as the reference, fix the first frame as the weight of tracking to track the subsequent frames. Our method has achieved outstanding performance on several benchmark data set, such as: OTB2015, VOT2016, VOT2018, and VOT2019 et al.


I. INTRODUCTION
Visual target tracking has received more and more attention in the past decades and has been a very active research field. It has been widely used in two-way fields such as visual monitoring [1], human-computer interaction [2], pedestrian tracking [43], and augmented reality [3]. Despite recent advances, it is still recognized as a challenging task due to a variety of factors, including changes in light, occlusion, and background clutter.
In recent years, most of the visual tracking algorithms are related to deep learning [4], [5]. Compared to the correlation filtering methods [6]- [9], the deep learning methods are more popular. Especially in recent years, the single target tracking method based on the Siamese network [10]- [15] has attracted extensive attention in society. In the initial stage of offline, siamFC tracker [11] adopts the full convolution network structure to train the deep conv network, aiming to solve the The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . more common similarity learning problem, and then conduct online evaluation during the tracking process.
In order to ensure the tracking efficiency, the Siamese similarity function of offline learning is usually fixed in the running time [10], [11]. CFNet tracker [13] and DSiam tracker [12] update the tracking model by running average templates and fast conversion modules respectively. The SiamRPN tracker [14] introduces the area recommendation network after the Siamese network and performs joint classification and regression for tracking. The DaSiamRPN tracker [15] further introduces an interference-sensing module and improves the recognition capability of the model. SiamRPN++ tracker [16] eliminates damage from translation invariance and breaks the limitation of space invariance when using deep networks. These Siamese trackers draw the visual object tracking problem as learning general similarity graph through the relationship between the feature representation learned in the target template and the search area. In SiamMask [17], the tracking of objects is actually the block of the object mask, so the object mask is extracted first, and then the object is tracked according to the object's mask. SiamDW [18] are of different kinds of backbone were studied in detail, and points out the padding is how to affect the precision in the process of training, target tracking.ATOM's work [19] is divided into two tasks, classification task and assessment tasks. The classification task is to separate the foreground and the background image, get a rough idea of target location.Assessment task is through the bounding box,which can be used to predict the state of the target. By decomposing the visual tracking task into two sub problems: the classification of the pixel category and the regression of the object bounding box at that pixel, SiamCAR [44] proposed a novel full convolution twin network to solve the end-to-end visual tracking problem in a per-pixel way. Due to the parameter complexity caused by the introduction of RPN, SiamBAN [40] avoids many super-parameters and more flexible. Re-identification [41] article adopts the combination of identification loss and triplet loss, but rather than simply adding the weight loss coefficient before two losses as the total loss, it proposes its own dynamic loss training. Zhong et al. [42] proposed a hierarchical tracker, which is based on the combination of coarse-level data-driven search and fine-level coarse-to-fine verification to learn movement and tracking. At a rough level, the data-driven motion model learned from deep loop reinforcement learning provides a rough location for their tracker.
Although these Siamese trackers have gained excellent tracking performance, especially in terms of balanced precision and speed. But even well-performing Siamese tracker, such as SiamRPN [14],whose tracking accuracy of distractors still has significant difference on similar targets. However, these trackers operate on the cross-correlation between the feature maps generated on the two branches of the network. The proposed methods ignore the influence of the feature maps generated in the middle of the network on different categories of tracking objects. Under the inspiration triggered by this observation, we analyze the existing Siamese trackers and find out that the core reason is that the convolutional layer of different levels represents the target of different aspects, the deep feature map contains more semantic features and can be used as a similar category detector. The lower level contains more discriminant information and can better separate the target from the background.
In order to solve this problem and obtain a more generalized Siamese tracker, a feacture pyramid Siamese network is proposed and it is experimentally proven that feature maps of different depths have different representation features. Experimental results are shown in Figure 1. The deeper convolutional layer captures more semantic features of objects, while the lower layer provides more detailed external features to better distinguish objects from backgrounds. The proper fusion of these different features helps to separate the target from the interference term.
In addition, few of these methods can clearly put forward very effective solutions to distinguish between targets and interferences. Based on the problem, this paper proposes a loss function to increase the differentiating effect between targets and distractors, and experiments show that our method has certain effects on differentiating distractors.
Our main contributions are summarized as follows: 1. The feature pyramid fusion method is proposed, which combined with the deep network backbone to retain more tracking target features.
2. Depth-wise cross correlation is adopted to solve the asymmetric problem, and loss items embedded in discriminating instances are introduced into the loss function for distinguishing objects with the same semantic class or similar appearance.
3.The proposed trained tracking network is tested on OTB2015, VOT2016, and VOT2018. The expected average overlap on VOT2016 improved by 2.9% compared to SiamRPN. The precision on OTB2015 improved by 5.3% compared to DaSiamRPN.

II. RELATED WORKS
We mainly introduced the recent trackers, especially those based on Siamese networks and briefly reviewed three aspects related to our works: deep network trackers based on Siamese networks, RPN detection and pyramid feature extraction, and one-time learning.

A. DEEP NETWORK ANALYSIS
Recently, Siamese networks have attracted great attention in the field of visual tracking because of their balanced accuracy and speed [11]- [13], [20], [21]. GOTURN [21] used Siamese network as feature extractor and fully connected layer as fusion tensor. By using the prediction bounding box in the last frame as the only proposal, it is a regression method. Re3 [20] used circular networks to get better functionality from template branching. Inspired by relevant methods, VOLUME 8, 2020 SiamFC [11] introduced relevant layers as tension tensors and greatly improved the accuracy. CFNet [13] added a correlation filter to the template branch, making the Siamese network shallower but more efficient. However, both SiamFC and CFNet use shallow layers network.
Since the shallow layer network cannot fully obtain the feature information of the object, it is hard for previous Siamese network tracker [10], [11] to achieve good performance, such as single and one-sided feature extraction. But some research have shown that training Siamese trackers simply by using deeper networks. Such as ResNet does not contribute to improvements of performance. SiamRPN++ [16] points out that there are two inherent limitations when utilize deep network for tracking training: 1) The contraction part and feature extractor used in Siamese tracker have inherent limitations on strict translation invariance. as (1) shows: where τ j is a translation shift sub-window operator, which can ensure effective training and inference.
2) The contraction part has inherent limitations on structural symmetry, f z, x = f x , z , which is suitable for similarity learning.

B. FP AND RPN
Feature Pyramid(FP) was proposed in FPN [22] network, which can solve the problem of similar semantic goals well, because it utilizes the context information (high-level semantic information) after top-down model. For similar semantic feature targets, FPN increases the resolution of feature maps (i.e. operating on larger feature maps to obtain more useful information about similar targets). This method is used in the tracking network to increase the feature semantic information extraction of different objects. The regional proposal network was first proposed in the faster R-CNN [23].
Compare to RPN, traditional proposal extraction methods were time-consuming. For example, selective search [24] takes two seconds to process an image. Not only that, but the recommendations are not enough to test. The enumeration of multiple anchor points and the shared convolution feature enable the proposal extraction method to achieve high quality and time efficiency. RPN is able to extract more accurate proposals due to foreground -background classification and border box regression monitoring. There are several fast R-CNN variants with RPN. R-FCN [25] takes into account the location information of components. Compared with two-stage detectors, improved RPN versions such as SSD [26] and YOLO9000 [27] are effective detectors. Because of its high speed and excellent performance, RPN has many successful applications in detection and feasibility in tracking.

C. ONE-SHOT LEARNING
In recent years, one-shot learning in deep learning has attracted more and more attention. One-shot learning is used in face recognition in the early stage, by training a similarity function, to achieve the detection and matching of one sample at a time. Bayesian statistical method and meta-learning method are two main methods to solve this problem. In [28], the probability model represents the object category, and the Bayesian estimation is adopted in the inference stage. The meta-learning method is to acquire the ability to learn, and to realize and control own learning.
Single detection is regarded as a discrimination task in [29]. Its purpose is to find the parameter W that minimizes the average loss L of the prediction function ψ (x; W ). It is calculated on the data set of n samples x i and the corresponding label i .
A meta learning process is used to learn the parameter W of predictor from a single template z. The z; W is maped to the feed-forward function ω of W. Make z i become a batch of template samples, and then express the problem as follow: As mentioned above, z represents template patch, x represents detection patch, function ϕ of Siamese feature extraction subnet and function ζ of region recommendation subnet, and then one-time detection task can be expressed as follow: The template branch in the Siamese subnet can be reinterpret as a training parameter to predict the kernel of the local detection task, which is usually the learning of the learning process. In this interpretation, the template branch is used to embed category information into the kernel, and the detection branch performs the detection using the embedded information. During the training phase, the meta-learner does not need any supervision other than a pair of border box supervision. In the inference phase, pruning the Siamese framework leaves only the detection branch beyond the initial frame and is therefore very fast. The target patch from the first frame is sent to the template branch and pre-computed to the detection kernel.A single detection can be perform in other frames.

III. OUR NETWORK: FPSiamRPN FRAMEWORK
In this section, we describe our proposed FPSiamRPN framework, as shown in Figure 2. Similar to SiamRPN, the framework includes Siamese subnets for feature extraction and regional proposal subnets for proposal generation. In our work,the deep network ResNet50 [30] is adopted with the addition of feature pyramid extraction as the backbone. It includes two branches in the area region proposal network subnet (RPN subnet). One is responsible for forestbackground classification, and the other is used to improve candidate box. A deeply separable structure(the Deep-wise cross-correlation) is adopted for classification and regression,  which uses 10 times fewer parameters than the original RPN network, and has the same performance.

A. OUR METHOD WITH SIAMESE NETWORK
The tracking algorithm [10], [11] based on Siamese network sets visual tracking as a cross-correlation problem and learns the similarity score map from a deep model with a Siamese network structure, in which one branch is used to learn feature representation of the target and the other is for search area. The target box is usually given in the first frame of the sequence and can be thought of as an example z. Purpose is found the most similar instances in the frame of x in embedded semantic space φ (·). Feature mapping relationship is shown as follows: where b is the offset of the similar value. The images can be obtained from the dataset of annotated videos by extracting samples and searching for images focused on the target, as shown in Figure 3. The images are extracted from two frames of the video, both of which contain the object, which are cropped and resized to fit the input size of our network architecture. Classes that ignore objects during training.
z represents the exemplar and x represents the search images. When a sub-window extends beyond image, the missing portions are filled with the mean RGB value. The influence of center bias is droped by overcoming translability. The deep network is used for visual tracking. ResNet50 is applied as our backbone in our work. The original ResNet [30] has a 32 pixels step size, which is not suitable for dense Siamese network prediction. The original ResNet50 network is shown in Table 1. In our work, conv4 and conv5 blocks is improved to unit space step size, the effective step length of the last two blocks is reduced from 16 pixels and 32 pixels to 8 pixels, and increasing the acceptance range by expanding convolution. Then we fuse the outputs of conv2, 3 and 4 blocks with the features of conv3, 4 and 5 up-sampling, and convoluted the channels to 256 by 1 × 1 convolution, as shown in Figure 4. In addition, we find out that careful fine-tuning of ResNet will improve performance. By setting the learning rate of the ResNet extractor 10 times smaller than the regional RPN portion, the feature representation can be made more suitable for tracking tasks.

B. A NOVEL LOSS TERM FOR DISTINGUISHING SIMILAR OBJECTS
We use depth-wise cross correlation to obtain the classification and regression channels. For k anchors, the network needs to output 2k channels for classification and 4k channels for regression. The goal of our learning algorithm is to train discriminant feature embedding applicable to multiple objects of the same category. The existing Siamese series of networks cannot extract the deeper semantic features well. Although DaSiamRPN [15]reduces the effects of similar distractors by proposing a distractor recognition model and uses NMS on box-selecting,if objects are too closed to each other, only the box with the highest score will be retained and all the boxes around will be discarded by using NMS to discard the box. Therefore, when lots of distractors are close to the tracking object, most of the distractors will be lost and the influence of distractors on target cannot be eliminated well.
At the present stage of tracking, most target tracking algorithms can't distinguish distractors and target well, which becomes a big problem in target tracking. In order to reduce the influence of distractors on target tracking, a discriminant example is proposed to distinguish the embedded loss of similar objects. Firstly, cross-correlate the template branch p of the Siamese subnet with the search branch z to get the score of target, which is represented by s (p, z). Then, m anchors are generated around the target in the search branch z, and calculated the scores of all anchor areas d with the search area z, as N i=1 s (d i , z), sending the output characteristics into softmax function for binary classification, it determines the classification of tracking target with the surrounding objects. The proposed formula is descripted as follow: In which σ inst (· ) is used to compare the positive score of the tracking target with all the resulting anchor objects (including target object). According to the definition of softmax, it shows that the bigger the value of σ inst (· ), the greater the probability of being a target. For all the data with batch N, the following discriminating instance embedding loss is proposed, in which θ is hyperparameter, for smooth the loss in training stage: From the formula, the value in the log is the softmax value correctly classified by this group of data. So we need to small the loss of this sample to make the softmax value larger.

1) ANCHOR GENERATION DETAILS
It has been proved that the candidate objects obtained from the image frame is actually the object detection to the object in the image. A lightweight network is proposed to extract the distractors on target which is similar to the target detection, for generating anchors and selecting the candidate box. Our network structure consists of two convolution layers, a pooling layer and a batch normalization layer. The kernel size used in the convolution layer is 3 × 3, padding sets as 1, and the stride sets as 1. The proportion of anchor adopts the proportion of anchor in RPN. The proposed network structure is shown as Figure 5.
By inputting the feature map into the lightweight detection network, the bounding boxes of distractors on the image are generated, then calculate the proposed cross-correlation score N i=1 s (d i , z). It implements the embedding of unique features of the tracking target and can effectively distinguish similar objects that may appear around the tracking target.
For classification and regression, two branches [φ (z)] cls and [φ (z)] reg is used through the search area in the z channel, which do the corresponding convolution operation with two branches [φ (x)] cls and [φ (x)] reg in template image x. At last, obtaining the classification score is the dimension w × h × 2k and regression score is the dimension w×h×4k. For classified loss function, cross entropy loss function L cls is used as follows.
L cls = − y log y + (1 − y) log 1 − y (8) where y represents the ground truth, y represents the estimate value. We use A x , A y , A w , A h to represent center point and the shape of the anchor, and get {δ [0] , δ [1] , δ [2] , δ [3]} through normalization. When using multiple anchor points to train the network, we still use the smooth L1 loss and regression normalization coordinates, which are shown as follows: Finally, the loss function is optimized as follows: loss = L cls + λL reg + αL inst (10) where λ, α are hyperparameter to balance the three parts, and L reg is calculated as follows:

C. ONLINE TRACK
The output of the template branch is used as the weight to track the subsequent frames. The two kernels generated in the template branch are pre-calculated on the initial frame and fixed during the whole tracking period. In the detection frame, we obtain the classification and regression output from the previous propagation, and generate multiple candidate frames.In our work,we also use SiamRPN [14] to extracte candidate boxes. Meanwhile, sine window is used and proportional change penalty to rearrange the scores of candidate boxes to get the best score. After the abnormal value is lost, adding cosine serial port can restrain large displacement.The proposed penalty item, which control the size, and proportion change are descripted as follows: where k is a hyperparameter, x represents ratio between the height and width of the proposal, and x represents the ratio of the last frame. s and s represent the overall size of the proposal and the last frame, which are calculated as (12), and the s is calculated as follow: where w and h represent the width and height of the target, p is the padding, the value is (w + h)/2. After that, the classification score is multiplied by the time penalty, the first k candidate boxes are reordered, and then non maximum suppression is performed to obtain the final tracking boundary box. After the final bounding box is selected, the target size is updated by linear.

IV. EXPERIMENT
A. EXPERIMENTAL DETAILS 1) TRAINING The backbone of our architecture is pre trained image tags on ImageNet [31]. We train the network on the training set of COCO [32], ImageNet det [31], and VID [31] datasets. The training set size is more than 150GB. In training and testing,a single scale image representation template is used with 127 pixels and 255 pixels for the search area. After using ImageNet make pre train the Siamese subnet. The random gradient descent (SGD) optimizer is used to train FPSiamRPN end-to-end. Some data enhancement is used to train the regression branch,such as affine transformation. The same object in two adjacent frames does not change much.We select fewer anchors in the tracking task than in the detection task. Therefore, only one scale of anchors with different proportions is used. In our experiment, the value of anchoring ratio is set [0.33, 0.5, 1, 2, 3].
The strategy of selecting positive and negative training samples is also very important in our framework. Here we use the criteria used in the object detection task. In our work, IoU and two thresholds [th] hi and [th] lo is used as the measurement. Positive samples are defined as anchors with IoU > [th] hi and its corresponding basic facts. Negative numbers are defined as anchors that satisfy IoU < [th] lo . The parameter [th] lo is set to 0.3 and [th] hi is set to 0.6. We also set up a training pair of up to 16 negative samples and a total of 64 samples. Our experiments are implemented using PyTorch on a PC with an Intel i7, 8G RAM, NVidia GTX 2080ti.

2) EVALUATION
We focus on short-term single target tracking of OTB2015 [33], VOT2016 [34] and VOT 2018 [35]. Each dataset has 60 videos, and OTB2015 has 100 videos. All the tracking results use the reported results to ensure a fair comparison.

B. ABLATION EXPERIMENTS 1) BACKBONE ARCHITECTURE
The choice of feature extractor is important as the number of parameters and types of layers directly affect memory, speed, and performance of the tracker. We compare different network architectures for the visual tracking. VOLUME 8, 2020 FIGURE 6. Further qualitative results of our method on sequences from the visual object tracking benchmark OTB2015. Green box represents the ground-truth and the yellow box represents our track box.
In our work, AlexNet, ResNet18, ResNet34, ResNet50, and ResNetFPN(our backbone) are used as backbones. We report performance by Area Under Curve (AUC) of success plot on OTB2015 with respect to the leading accuracy on ImageNet. Table 2 illustrates that by replacing AlexNet to our backbone, the performance improves a lot on VOT2018 dataset. Besides, experimental results show that finetuning the backbone part is critical, which yields a great improvement on tracking performance.

2) PYRAMID FEATURE AGGREGATION
To investigate the impact of pyramid feature aggregation, we first train three variants with single RPN on ResNet50. We empirically found that conv4 in ResNet50 alone can achieve a competitive performance with 0.344 in EAO. Compare to pyramid feature aggregation(combine L3, L4, L5), pyramid feature aggregation yields a 0.363 EAO score on VOT2018, which is 7.7% higher than that of the single layer baseline.

3) DEPTHWISE CORRELATION
We compare the original up-channel cross correlation layer with the proposed depthwise cross correlation layer.   . Performance and speed of our tracker and some state-of-the-art trackers in VOT2016. More closed to top means higher precision, and more closed to right means faster. FPSiamRPN is able to rank 1st in EAO. Table 2, the proposed depthwise correlation gains 2.2% improvement on VOT2018 and 1.2% improvement on OTB2015, which demonstrates the importance of depthwise correlation. This is partly beacause a balanced parameter distribution of the two branches makes the learning process more stable, and converges better.

C. RESULTS ON OTB2015
OTB2015 [33] contains 100 video sequences for tracking, and has been very perfect and authoritative. The evaluation  results mainly rely on two indicators: accuracy and success rate. The precision plot shows the percentage of frames that the tracking results are within 20 pixels from the target. The success plot shows the ratios of successful frames when the threshold varies from 0 to 1, where a successful frame means its overlap is larger than given threshold. The area under curve of success plot is used to rank tracking algorithm.
The standardized OTB benchmark provides a fair and robust testing platform. The Siamese based tracker formulate the tracking as a one-shot detection task without any online update, so its performance is inferior on this benchmark without resetting. However, the limited representation of shallow networks is the main obstacle to the Siamese tracker from exceeding the top-performing method, such as SiamFC [11].
In the experiment, we compared our method with a series of related tracking methods. Qualitative results of FPSi-amRPN for OTB2015 sequences are shown in Figure 6. As shown in the Figure 7,comepare to the GradNet [38], SiamRPN++ [16], DaSiamRPN [15], SRDCF [39], SiamFC [11], CFNet [13], FPSiamRPN can be ranked at the top in success plot and precision plot. (The result of SiamRPN++ is trained by using our training datasets). From Figure 7, the proposed algorithm can achieve high accuracy and success rate.

D. RESULTS ON VOT2016
VOT2016 [34] dataset consists of 60 sequences. Performance is evaluated based on accuracy (the average overlap at successful tracking) and robustness failure times. The expected average overlap (EAO) is used to evaluate the VOLUME 8, 2020 FIGURE 11. Further qualitative results of our method on sequences from the visual object tracking benchmark VOT2016. Green box represents the ground-truth and the yellow box represents our track box.

TABLE 4.
Comparison with the state-of-art in terms of expected average overlap (EAO), robustness and accuracy on the VOT2018 benchmark. Our trackers obtains a well performance among the top-ranked methods.Red, blue and green represent 1st, 2nd and 3rd respectively. overall performance, which considers two kinds of precision. Besides, the speed is evaluated with a normalized speed (EFO).And we compared some published state-ofart trackers, Figure 8 illustrates the EAO ranking. And further, the results of detail information about several published state-of-art trackers' performances in VOT2016 are shown in Table 3. In order to show our tracker can achieve a superior performance when operating at high speed. Figure 9 shows the performance and speed of the state-of-the-art trackers. Qualitative results of FPSiamRPN for VOT2016 sequences are shown in Figure 11. In our work,most of the sequences are ones with the distractors.

E. RESULTS ON VOT2018 AND VOT2019
We test our FPSiamRPN tracker on VOT2018 dataset [35] in comparison with some state-of-the-art methods. VOT2018 dataset is one of the most recent datasets for evaluating online model-free single object trackers, and includes 60 public sequences with different challenging factors. Following the evaluation protocol of VOT2018, the expected average overlap (EAO), accuracy and robustness are adopted to compare different trackers. The detailed comparisons are reported in Table 4. Figure 10 illustrates the EAO ranking in VOT2018. From Figure 10,in some cases, the proposed FPSiamRPN method is rank the second best campare with some state-ofthe art methods. However, experimental results of the proposed method is very good in most cases.
From Table 4, the proposed FPSiamRPN method achieves the top-ranked performance on accuracy and achieves the second-ranked performance on the expected average overlap criteria. Especially, our FPSiamRPN tracker outperforms all existing trackers. Our tracker achieves a substantial improvement over the tracker(SiamRPN) with a gain of 1.7% in accuracy. Qualitative results of FPSiamRPN for VOT2018 sequences are shown in Figure 12.
In our work, the proposed method focus on reducing the impact of distractors on the target object in target tracking, and ignore some detailed features of the target. Some exerimental results from benchmark data set on VOT2019 [46], the performance of EAO, accuracy, and failure not very good.  Our method is not enough to distinguish multiple targets in the blurred line of sight, and there is still some drift problem in distinguishing objects with high similarity. The experimental results are shown as Table 5. From Table 5, our EAO and accuracy is lower than the traditional methods.

V. CONCLUSION
In this paper, the Siamese region proposal network based on hierarchical pyramid feature fusion (FPSiamRPN) is proposed, which is end-to-end offline trained with large-scale image pairs from CoCo and ImageNet. FPSiamRPN can automatic adjust the bounding boxes and get more accurate proposal by applying box refinement procedure. In the tracking stage, one-shot detection as the reference is used. In experiment part, our method can achieve beautiful robust and good performance in OTB2015, VOT2016 and VOT2018 real-time challengers with high speed.
Due to selection box on the target detection adopt the principle of distinguishing the target and the distractors, the proposed method still has some limitations. Such as: if the selection of candidate box doesn't accurate,which will affect the calculation of the target probability with softmax. Meanwhie, the learning of the target features may be more generalized. In our work, we show some failure situations. Please see failure results in Figure 13. From Figure 13, the proposed method is not enough to distinguish multiple targets in the blurred line of sight. The results show still some drift problem in distinguishing objects with high similarity.