Object Tracking Using Siamese Network-Based Reinforcement Learning

Object tracking is a technique for tracking a specific object appearing in a video sequence while observing its features or changes. Recently, many algorithms showing high performance have emerged by applying the Siamese network to the object tracking field. A Siamese network is designed to learn the similarity between two images. In object tracking, the Siamese network tracks the object by finding the location most similar to the target image in the search image. Algorithms based on Siamese networks are vulnerable to partial and total occlusion of objects. In addition, since the object is tracked using only the similarity with the image obtained using the ground-truth bounding box of the first frame, if an object is missed once, then errors are accumulated, and a situation where the object drifts away from the object of interest frequently occurs. Therefore, in this paper, we propose a reinforcement learning model that can maximize the reward for tracking success after partial and total occlusion of an object. We also propose a dynamic template exchange method using a template that has been successfully tracked in a recent frame to solve the drift problem. When the proposed model is applied to the existing tracking models to evaluate the quantitative performance in representative object tracking benchmarks VOT2018 and OTB50, it is confirmed that the accuracy is improved, and the number of tracking failures decreases compared to the existing method. As a result, an accuracy of 0.618, robustness of 0.234, and expected average overlap (EAO) of 0.416 are achieved in VOT2018, and success of 0.673 and precision of 0.881 are achieved in OTB50.


I. INTRODUCTION
Visual object tracking is a fundamental computer vision task. In this field, it is possible to infer the correlation of target objects between frames in a video sequence. It is used as a basic work of video application in fields such as robot vision [1], [2], self-driving [3], [4], and surveillance systems [5], [6]. Although tracking algorithms are used in various applications, problems such as partial and full occlusion of objects, scale changes, and object/camera motion remain challenges to be solved [7]. That is, spatial features and temporal features must be present so that the initially selected object of interest can be tracked to the end. It is necessary to solve the occlusion and drift problems that occur during the tracking process. However, occlusion is difficult to define for annotating as ground-truth in the training dataset. Most of the training datasets constructed thus far are only annotated with 1 or 0 in the frame in which the occlusion occurs. We need data or a model that can effectively learn about the occlusion and drift. Reinforcement learning is mainly used in tasks where training data are scarce or ground-truth setting is difficult. In object tracking, reinforcement learning can experience success and failure through tracking simulation. Therefore, in this paper, we propose a reinforcement learning model that can be applied to the existing tracker by defining the state, action, and reward to solve the occlusion and drift problems. Our proposed reinforcement learning model integrates the channels divided into foreground and background into a single channel. Then, the agent learns to select where the tracking can be successful in the feature of the occlusion situation. Existing methods using reinforcement learning to be described in Section 2 are designed to move the bounding box. The model then has a prior experience of the location of the bounding box. This tends to keep track of the intact objects, and it is more likely to lose the target object in the event of an occlusion. The proposed method in this paper allows the agent to pre-experience tracking hindrances such as occlusion and drift. From the feature map at that time, a feature that can be successfully tracked is selected. This can be learned in a way that maximizes the rewards the agent can earn in a reinforcement learning environment.
Recently, the Siamese network [8] has been applied to tracking tasks, showing balanced performance in speed and accuracy, and various applications are continuing. Typically, in [9]- [12], the ground-truth of the object of interest in the first frame was maintained as a template, and the object was tracked until the end of the video sequence. These models were designed as CNNs, so the model was mainly used to capture spatial features.
It is difficult to solve the continuous tracking problem caused by temporal features within a sequence [13]. References [14]- [16] solved it by matching several templates with the object of interest during tracking. However, a model for this needs to be additionally designed. To simplify this, in this paper, the dynamic template exchange method of Yan et al. [13] is applied to a Siamese network to enable the capture of temporal characteristic information, thereby solving the temporal problem.
In general, object tracking models are trained using the coordinates of the bounding box representing the object location. The predicted coordinates have various influences, such as the starting point for inference in the next frame and the motion model. The tracking model makes inferences every frame. This is why a single inference affects the tracking until the end of the sequence. When an occlusion occurs, the tracking of a part of the object causes errors to accumulate and drift or leads to tracking success. Even if the overlap ratio between the ground-truth and the estimated result in one frame is measured to be high, it may not be a successful inference in the entire sequence. Fig. 1 shows the tracking results of the red box tracker and the green box tracker initialized with the ground-truth (cyan box) in the first column. The green box tracking results (GOTURN [17], ATOM [18], and DiMP50 [19]) in the second column have a higher intersection over union (IoU) with the ground-truth than the red-box tracking result. In the second column, the IoU value between the red box and ground-truth falls below 0.5. This means that tracking fails. However, similar to the third and fourth columns, when the occlusion of the tracking object is finished, the complete object shape is found due to the position of the bounding box, and when the sequence ends, the tracking can be successful. In this way, it can be confirmed that the inference result for every frame has a great influence on the accuracy and robustness of the tracking model. As with most tracking models, pretrained models and CNNs in the backbone network tend to track larger and intact objects. When the tracking object is obscured by other objects, it will track other objects that appear intact. Therefore, tracking results such as green boxes occur frequently.
In this paper, to solve this problem, the tracking performance is improved by learning the experience of tracking success and failure through a reinforcement learning model that rewards when tracking is successful in the last frame of the sequence and gives a penalty when it fails. By combining a Siamese network and a region proposal network, the similarity score map output from the object tracking model and the vector for the moving direction of the object are set as the state, and the selection of the location of the object on the score map is set as the action. The reward is given according to the success of tracking the last frame in the learning sequence. As a result, state-of-the-art performance is achieved by linking the reinforcement learning model and dynamic template exchange method proposed in the VOT2018 [20] and OTB50 [21] benchmarks with the existing tracker.
The main contributions of this work are as follows:  We propose a new reinforcement learning framework to solve the occlusion problem.  We propose a dynamic template exchange method applicable to Siamese network-based tracking algorithms to solve the drift problem. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  Our proposed method outperforms the state-ofthe-art methods in VOT2018 and OTB50. The structure of this paper is as follows. Section 2 introduces the study of applying visual object tracking and deep reinforcement learning to the tracking model. Section 3 introduces the reinforcement learning model proposed in this study, the problem settings in reinforcement learning, and the dynamic template exchange method in the Siamese network. Section 4 compares the performance with existing studies on the tracking benchmark and verifies that the performance is improved. Finally, a conclusion is drawn in Section 5.

II. RELATED WORKS
The purpose of this study is to improve the performance of the existing Siamese network-based object tracking model using reinforcement learning. In this section, we review the existing object tracking model and representative methods in the field of object tracking using reinforcement learning. In the object tracking section, we briefly introduce CNN-based tracking models. Additionally, the object tracking models used as the starting point of this paper, SiamRPN [10], SiamRPN++ [11], and SiamMask [12], are explained. In the reinforcement learning-based object tracking session, existing studies are described on how the reinforcement learning model is applied to the object tracking model.

A. VISUAL TRACKING
Until the Siamese network-based object tracking model was developed, many tracking models using the basic structure of CNN were developed. C-COT [22] proposed a new structure that uses a continuous convolution filter instead of a discriminative correlation filter for learning, greatly improving the tracking performance. ECO [23] improved the accuracy and speed performance by optimizing the key factors that degrade the tracking performance in C-COT. MDNet [24] significantly improved tracking performance by suggesting shared layers to obtain a general target representation and a domain-specific layer structure for a binary classifier that identifies targets in each domain. VITAL [25] proposed a new structure for applying a generative adversarial network [26] to object tracking. GOTURN [17] proposed a structure that is similar to the Siamese network but shares features extracted from CNNs from two input images in a fully connected layer. Although the above CNN-based tracking algorithms have been proposed in various network structures, SiamFC [9], a tracking model based on the Siamese network, shows a balanced performance in terms of accuracy and speed and changes the paradigm of the object tracking algorithm. SiamFC is a study aimed at proving the efficiency of the Siamese network. Without adding any additional cues, the output of the model was used without bounding box regression. Therefore, various studies (SiamVGG [27], SE-SiamFC [28], and SiamDW [29]) were conducted based on SiamFC research, and its application to thermal infrared images (HSSNet [30], MLSSNet [31], and MMNet [32]) as well as RGB images shows high performance. Among them, SiamRPN [10], which applied the region proposal network [33] to SiamFC, and SiamMask, which added a mask branch to SiamRPN, are representative algorithms that greatly improved the performance of the Siamese network-based object tracking algorithm. The Siamese network framework as above has been used as a starting point for various studies (SiamMask_E [34], THOR [16], and Siam R-CNN [35]) until recently. In this paper, SiamMask, which shows higher performance, is used as the starting point.
First, SiamRPN [10] inputs the results of the Siamese network to the classification and regression branches of the regional proposal network. Then it outputs anchor box positions and classification scores for objects and backgrounds through cross-correlation. Because only offline learning is performed, it shows a fairly high-speed performance. During training, the classification branch is output as two channels (positive and negative) for each anchor, and cross-entropy loss ( ) is used. In the regression branch, the center coordinates, width, and height of each anchor are output to 4 channels ( , , , and ℎ). The loss function is used by adopting the ℎ 1 loss function ( ). The input to the loss function is the normalized coordinates ( ) of the ground-truth box ( ) and the anchor box ( ) defined as in (1). Finally, the total loss of SiamRPN is the same as (2). Here, (≥ 0) is a hyperparameter.
SiamRPN++ [11] is an improved model that explores some disadvantages of SiamRPN. It shows the highest performance among contemporary object tracking models by removing padding to maintain spatial invariance and reducing parameters by changing the cross-correlation method to depthwise cross-correlation.
SiamMask extends the mask branch and loss function to SiamRPN to encode the features required for outputting the binary segmentation mask of an object. Since it is possible to obtain a mask for an object, the gap between Visual Object Tracking and Video Object Segmentation is reduced, and tracking performance is greatly improved. In the binary mask, a target image ( ) and a search image ( ) are output through a mask branch ( ). That is, the binary mask corresponding to the feature map obtained by the depthwise cross-correlation layer can be expressed as shown in (3) so that it is possible to generate another mask for the search image.
The loss function for the mask during training is defined in the form of a logistic loss function between the pixel-by-pixel annotated ground-truth and the predicted mask. Finally, SiamMask's total loss adds a loss function (L mask ) for the mask branch to the two loss functions used for SiamRPN training. As shown in (4), the model is trained using COCO [36], ImageNet-VID [37], and YouTube-VOS [38] in combination with the hyperparameters of Pinheiro et al. [39]. Here, λ 1 = 32 and λ 2 = λ 3 = 1 are set.
In inference, the binary mask of the object is predicted at the index that outputs the highest score in the score branch. The search region is cropped by referring to the bounding box location from the box branch of the corresponding index. Although it showed high performance in the VOT2018 benchmark, tracking performance is still poor when tracking motion blur and nonobjects. The reason is that, as the author mentioned, the training dataset focused on intact objects. Supervised learning repeatedly learns well-refined data despite the use of data augmentation. Because it learns while reducing the error between the inferred result and ground-truth on the annotated data, it self-limits the actual test data. As mentioned in Section 1, this was overcome by performing tracking without ground-truth through tracking simulation in reinforcement learning. Yan et al. [13] presented the problem that if only convolutional operation is used, training on temporal features is difficult and vulnerable to large-scale changes in objects. To solve this problem, Transformer [40], which is mainly used in the natural language processing (NLP) task, is used. In this paper, the concept of a dynamic template proposed by Yan et al. is applied to a Siamese convolutional network to overcome the problem of capturing temporal features.

B. REINFORCEMENT LEARNING
ADNet [38], the most representative algorithm applying reinforcement learning to object tracking, was a great inspiration for this study. Yun et al. [41] pointed out the inefficiency of the search algorithm of MDNet. This is because MDNet selects the best candidate by matching the tracking model after searching the region of interest. In addition, the problem that labels should be annotated on all frames to train the model was presented. To solve this problem, an algorithm combining supervised learning and reinforcement learning was proposed. Silver et al. [42] showed a study result that the performance of the reinforcement learning policy network can be significantly improved if it is pretrained through supervised learning. Similarly, in ADNet, the parameters of the network were updated through reinforcement learning after supervised learning by annotating labels on actions according to states. ADNet tracks the object by controlling the bounding box being expressed in the sequence through successive actions selected by the model. Action is an 11-dimensional vector, and the movement and scale adjustment of the bounding box are defined as shown in Fig. 2. A state is defined as a tuple of vectors containing the image patch within the bounding box and the previous 10 actions. When the model chooses a stop action, the agent is rewarded and then transitioned to the initial state in the next frame. The parameters of the model are updated through stochastic gradient ascent (SGA) [43] to give rewards by comparing the results of sequential actions and IoU with ground-truth and maximizing the rewards. Additionally, even if the ground-truth exists intermittently in the video sequence, we need to reward only the frame where the ground-truth exists. Due to this advantage, semi-supervised learning can overcome the limitations of test data.
Zhang et al. [44] proposed a method to learn spatial and temporal information by applying reinforcement learning to a network combined with CNN and LSTM [45]. Similar to ADNet, they used the reinforcement learning algorithm proposed by Williams [43] and designed a CNN to encode the features extracted from the frame and an RNN to regress the position of the target object in time steps.
TRASFUST [46] designed a model by combining knowledge distillation (KD) [47] and reinforcement learning. TRASFUST defines a state as a patch of two images in a bounding box. Different from ADNet for action, the amount of change for the motion of the bounding box was set as a vector. Using KCF [48], MDNet [24], ECO [23], and SiamRPN [10], which have significantly improved performance in the tracking field, as a Teacher network, the teacher learns the movement of the bounding box predicted by the teacher, and the student transitions the state. This showed state-of-the-art performance against benchmarks such as GOT-10k [49] and UAV123 [50] but showed low performance in VOT2019 [51]. This is because the VOT2019 benchmark was built to evaluate which algorithm estimates the best bounding box by defining the center of an object rather than an intact object as a ground-truth bounding box. However, since TRASFUST tends to track whole objects, it has the same effect as having better performance than other object tracking algorithms in qualitative evaluation.
As mentioned above, object tracking algorithms using reinforcement learning have been actively studied. Most object tracking models using reinforcement learning are designed to refine the location of bounding boxes. However, in this paper, it is designed to select a better feature rather than refine the position of the box. The complexity of the input image is reduced by using the feature map of the score branch output by the Siamese network-based tracking model as input. Similar to ADNet, the performance of the tracking model was improved by designing to maximize the reward for tracking success.

FIGURE 2. ADNet's action set.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. In this section, we first describe the model structure of the reinforcement learning framework proposed in Section A. It takes the score map of the Siamese-network-based tracker as input, passes through two convolution layers, and outputs the action to succeed in tracking until the end of the sequence. Section B describes how to define the problems of state, action, and reward in reinforcement learning. Finally, the detailed implementation, learning method, and reasoning process are described.

A. PROPOSED MODEL
The Siamese network-based tracking algorithm tracks the object only with similarity to the ground-truth of the first frame. As a result, if the model misses a tracking object once, then errors accumulate and tend to drift in the wrong place. To compensate for these shortcomings, this paper proposes a tracking model by applying reinforcement learning to the Siamese network-based RPN's score branch. In addition, by applying a dynamic template exchange to the Siamese  In the next frame, the result of cropping around the output bounding box is used as the search image. When the value of the selected score is less than 0.1, the tracker determines that the tracking object has been lost and uses the bounding box previously tracked with the highest score as a template. After that, when the template is input to the tracker, a higher similarity score is used as an input for reinforcement learning by comparing the similarity between dynamic template and search image, and between target image and search image.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and network-based tracking model, it is designed to ensure accurate tracking by increasing the robustness to temporal variation of objects. Fig. 3 shows the proposed model structure. The reinforcement learning model in this paper follows the Markov decision process (MDP) strategy. The state of the MDP is defined by ∈ , the action as ∈ , the state transition function as , and the reward as . Reinforcement learning performs well in games with similar backgrounds, similar objects, and set rules. However, in object tracking, numerous objects and backgrounds appear in the input image. Therefore, if the input image is used for reinforcement learning as it is, then an infinite state is created, so it is difficult to determine an action according to the state. To solve this problem, the input of the reinforcement learning model should be simplified as much as possible. Therefore, the features extracted from the score map are used as input to the reinforcement learning model. Fig. 4 is an example of a score map used as an input.
The state is set by the score map and the movement direction of the object. Here, the movement direction of the object is used to weight the score map. Furthermore, as shown in Fig. 5, the action is defined to select whether the character of the game will move in one of eight directions from the initial position. If the tracking is successful in the last frame of the sequence, then it is designed to abstract the object tracking process like a simple game by giving rewards as if we obtain a score when we clear the stage of the game. The final bounding box is output by passing the index to the box or mask branch according to the selected action. The background in Fig.  5 is a mask according to the selected index. Here, it can be seen that when a low score index is selected, a mask containing a background is obtained rather than a human-shaped mask that was a tracking object.
In Section B, the action, state, state transition function, and reward are described in detail.

1) ACTION
Action is defined in 9 discrete spaces as in (6). (5) is a position where the highest score is output in the score branch when the target image and the search image are input. Action A is defined as a 9-dimensional vector with the position of (5) and 8 adjacent positions of the same channel as in (6). The feature map output from the score branch is input to the softmax function to select an action according to the scorebased probability and use it for training. By passing the selected index to the box or mask branch, the bounding box that can express the position of the object at the corresponding index can be predicted.

2) STATE
The state is defined as (7) in the form of a 2-tuple with the score map and the vector for the moving direction of the object.
A score map is used to minimize the information appearing in the actual image. The direction of the object can be inferred using the bounding box estimated by the previously selected action. The unit vector for the movement direction extracted as the position of the bounding box is used. The movement direction from the previous 10 frames to the current frame is set in a vector form. Finally, in (7), is the similarity score map, and is a vector containing the moving direction of the bounding box. Therefore, the state includes the location information of the part most similar to the target

3) REWARD
In most offline learning-based tracking algorithms, if tracking fails once within a sequence, then errors accumulate and drift to another target or background. It can be assumed that the tracking algorithm performs good job tracking when the tracking is successful in the last frame. Therefore, reward is defined through the IoU between the ground-truth bounding box ( ) of the last frame and the estimated bounding box ( ). There are several ways to reward this work. For example, there are methods of comparing with ground-truth every frame, a method of giving the output score value as it is, and a method of giving a position difference between two boxes. However, if an overly accurate value is given as a reward using groundtruth, then the difference with supervised learning becomes ambiguous. The purpose of this study is to effectively learn in the section where an object occlusion occurs. To achieve this purpose, like ADNet, the reward is defined as in (8) so that if the IOU is 0.7 or more in the last frame, a reward of 1 is given; otherwise, a penalty of -1 is given.

4) STATE TRANSITION FUNCTION
After an action is selected in an arbitrary state, the state is changed to the next state, as shown in (9), by a state transition function based on the action.
If an action is selected in the current state, then the bounding box is estimated by the mask or box branch at the location of the action. After that, the similarity score of the next state is obtained. The next state is created by including the movement direction of the object obtained by the previously selected action in .

FIGURE 7. Expected Average Overlap Rankings at VOT2018.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

C. IMPLEMENTATION
The reinforcement learning model is designed to extract the features of the score map by placing two 3x3 convolution layers and then output the nine previously defined actions through the fully connected layer. The average direction of 10 frames can be obtained as the average of the moving direction vectors ( ). Actions are selected by elementwise multiplication of the weights ( , ∈ [0,8]) for the average direction on the score map. It is expressed as (10), and an example is shown in Fig. 6.
As shown in Fig. 6, if the average direction of is right, the right side ( 6 ) is given a higher weight than the remaining element values, and the average direction of the previous 10 frames in the state when selecting an action on the score map is considered.
The dynamic template exchange method uses the result of tracking with the highest score in the previous frame as a template when it is determined that tracking has failed. The template is correlated with the search image in a correlation layer with the target image. Here, the result of tracking with a higher score is used as the final output. Since the probability of missing the target is low at the beginning of tracking and there is no significant change in the object, it is designed to use the dynamic template exchange method after a certain frame interval.

D. TRAINING
The pretrained SiamMask model is used to output the score map during training. TrackingNet [52] is used for the dataset, and approximately 2,000 sequences, including object occlusion, are used for training. As mentioned earlier, Silver et al. [42] stated that the accuracy of reinforcement learning models can be improved through supervised learning. Therefore, before reinforcement learning, supervised learning is initially performed so that a more accurate action can be selected. First, for supervised learning, the data are customized so that the proposed model can be trained using a part of the training data. By inputting the sequence into SiamMask, a total of 9 inferences are performed on the adjacent index, including the max score index per frame. The score index with the highest overlap ratio in the last sequence is set to the same class as (6) as the ground-truth. If there are no significant factors that hinder tracking by supervised learning, the max score index is mostly selected as an action. However, when occlusion or motion blur occurs, the probability of selecting the max score index is drastically decreased. This situation has been experienced in reinforcement learning, and when an obstacle to tracking appears, an action that can succeed in tracking is selected.
Reinforcement learning models are trained by rewards obtained through actions in a specific environment state. In this research, the environment is set to a randomly selected sequence. The learning parameters are updated by the reward given in the last frame of the sequence. As mentioned earlier, The reward is obtained in the last frame of the sequence belonging to the environment during training. Therefore, as in (11), the training parameters of the reinforcement learning model are updated using the SGA used in ADNet.
E. INFERENCE   Fig. 3 shows a flowchart of the inference process. First, the search image and target image are input to the Siamese network-based tracker, and the score map is input to the reinforcement learning model. The reinforcement learning model predicts the final bounding box by transferring the state of the score map and search image and delivering the selected index to the box/mask branch. At this time, when the output of the score map is less than 0.1, it is determined that the model has missed the target object. The tracked object in the frame with the highest score of the previous frame ( ) is used as a dynamic template. Here, is set to 10 in the same way as the number of action storage in Yun et al. [41]. We experimentally confirm that it takes approximately 3 frames when the tracker misses the object. Therefore, if it is set to a small number of 10 or less, there is a risk of performance degradation because there is a high possibility of using a template at the moment of missing a tracking object. When is set to 5, the EAO of SiamMask_R decreases by approximately 0.03 by the VOT performance evaluation method. If is set to be as high as 20, the tracking object within 10 frames after initialization is mainly used as a template. This drastically reduces the use of the target image initialized with the ground-truth, greatly increasing the number of missing objects. EAO is 0.04 lower when is 20 compared to when is 10. We use a dynamic template after 50 frames. This is because it is assumed that the first 50 frames will be well tracked by initialization using ground-truth.

IV. EXPERIMENTS
In this section, experiments to verify the performance of the proposed algorithm and an analysis of the results are conducted. First, the experimental environment and the dataset used for performance evaluation will be described. Then, the experimental results are analyzed.

A. SETUP
The operating system of the experimental environment is Ubuntu 18.04, and the computer has the specifications of Intel i7-10700K CPU, Geforce RTX 2080 Ti (x2), and 32 GB of RAM. All proposed algorithms are written in Python, and PyTorch is used as the framework for deep learning.

B. DATASET
As a dataset for the objective quantitative evaluation of design methods, there are many benchmarks such as OTB50/100 [21], [60], VOT2016/2018/2019, LaSOT [61], and UAV123. However, the benchmark VOT2018, which has been used in many studies, and no-reset-based performance evaluation are used. OTB50, which can obtain various evaluation results, is adopted. VOT2018 was built with a total of 60 sequences considering many factors that interfere with tracking such as illuminance, occlusion, motion, and scale, and the groundtruth was annotated with a rotated bounding box. The models applying the proposed method to SiamRPN, SiamRPN++, and SiamMask are denoted as SiamRPN_R, SiamRPN++_R, and SiamMask_R, respectively.

1) VOT2018
In the VOT Challenge, the tracking algorithm is evaluated using accuracy, robustness, and EAO [62].
First, in Table 1, by applying the proposed framework to SiamRPN, SiamMask, and SiamRPN++, the performance before and after application is evaluated. Accuracy, robustness, and EAO are all adopted to compare performance. Additionally, to evaluate the one-pass evaluation (OPE), average overlap (AO) is adopted for performance comparison. In reset-based evaluation, performance is improved in all except robustness of SiamMask. Based on EAO, SiamRPN achieves a performance improvement of 2.6%, SiamMask 1%, and SiamRPN++ 0.2%. In the no reset-based evaluation, there is a performance improvement of 1.3% only in SiamRPN++, but the lowered performance is analyzed together with the qualitative results in Fig. 8. Table 2 refers to the results of VOT2018. We compare our models with 12 state-of-the-art trackers [20], including DaSiamRPN [53], SA_Siam_R [54], CSRDCF [55], STRCF [56], DLSTpp [57], CPT, DRT, RCO, UPDT, MFT [58], LADCF [59], and SiamFC [9]. For accurate evaluation, the official VOT Toolkit is used, and the proposed framework is applied to SiamMask, SiamRPN, and SiamRPN++ and compared with 12 latest object trackers. When the proposed method is applied to SiamRPN++, as shown in Table 2, it surpasses the performance of all existing trackers, including the tracker evaluated with the highest rank in the VOT2018 Challenge based on EAO and accuracy. Compared with LADCF, which had won the challenge, it is 9.7% higher in accuracy and achieves a performance improvement of 2.7% in EAO. Although the EAO of SiamMask_R, which applies the proposed method to SiamMask, is lower than that of SiamRPN++, it shows higher performance than the existing tracking algorithm and has the highest accuracy. It achieves a performance improvement of 4.9% compared to DaSiamRPN, which achieves the best performance based on the existing accuracy.
In Table 1, all performance except for the robustness of SiamMask_R is improved in the reset-based evaluation. As seen from the book and helicopter sequence in Fig. 8, the bounding box of SiamMask_R is larger than that of other trackers. The reinforcement learning model passes the selected index to the mask branch to estimate the final bounding box based on the mask. At this time, if the index with a low score is selected, the mask includes the background, as shown in Fig.  5. The final bounding box is output large enough to include the background. Therefore, due to the accumulation of errors, the tracking fails and shows low robustness. However, it can be seen that the accuracy performance is improved by tracking more tightly to the ground-truth than the existing SiamMask in the frame in which the tracking is successful.
In the Flamingo1, soccer2, wiper, and helicopter sequences, it is confirmed that the reinforcement learning model robustly copes with occlusion by selecting an index different from SiamRPN and SiamRPN++.
Although the speed (fps) decreases due to the increase in computational cost by adding the framework, it still shows performance beyond real-time performance.

2) OTB50
The Object Tracking Benchmark (OTB) adopts success and precision to evaluate performance. Here, success is the overlap ratio between the tracking result and ground-truth, and it is the percentage of successful frames according to the threshold. Precision is an index indicating the percentage of the tracking result and the center distance of the ground-truth in pixels. In addition, we can check the performance of each attribute by evaluating the performance with the success performance index for 11 attributes. Performance is evaluated based on a one-pass evaluation. Hyperparameters are the same as those used in VOT2018.
First, as shown in Table 3, SiamMask_R shows performance improvement of 1% in success and 2.2% in precision compared to the existing model and performance improvement of 0.8% and 0.6% in SiamRPN++_R compared to the previous model. Additionally, as shown in Fig. 9, SiamRPN++_R using the proposed method shows the highest performance in all attributes except low resolution, motion blur, deformation, and scale variation. Although the performance of SiamRPN++_R in the above four attributes decreases, SiamMask_R shows higher performance than the existing SiamMask model. In particular, both of SiamRPN++_R and SiamMask_R show high performance in

FIGURE 10. Example of performance degradation at low resolution and motion blur, where the green box is ground-truth and the yellow box is SiamRPN++_R
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182792 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ occlusion and out of view. This is because the reinforcement learning model selects the action to succeed in tracking when occlusion occurs in consideration of the moving direction of the object and the score map. In the occlusion attributes, there is a performance improvement of 1.4% for SiamRPN++_R and 1.6% for SiamMask_R compared to the existing model. Because the dynamic template has the template of the frame tracked with the highest similarity recently, there is a substantial performance improvement of 9.3% for SiamRPN++_R and 2.6% for SiamMask_R compared to the existing model in the out-of-view attributes where the object disappears from view.
In Fig. 9, our method degrades the performance of SiamRPN++ for the low resolution and motion blur attributes. In Fig. 10, the first row is a sequence with low resolution attributes, the second row is a sequence with both low resolution and motion blur attributes, and the last row is a sequence with motion blur attributes.
Our method aims to successfully track the last frame of the sequence. In the sequence of low resolution attributes, we choose a location that completely misses the tracking object when the object moves quickly or when occlusion occurs. However, it continues to take an action to find the tracking object, and the tracking succeeds at the end of the sequence through the dynamic template. In the sequence of motion blur attributes, as shown in the third column of the frisbee sequence in Figure 8, the bounding box is predicted ahead of the object in the moving direction of the object. In the second row of Figure 10, we can see that the bounding box is visible at the end in the direction of the object's movement. In the third row, the upper part of the object is tracked, and the target object is tracked again even in the worst case of occlusion with other objects.
In the process of finding the target object again, it takes slightly longer for a sequence with motion blur and low resolution attributes than for a sequence in which a clear object appears. Therefore, a section with a low overlap with the ground-truth frequently occurs. Although the performance decreases in some sequences, the object is tracked again without reinitialization according to the design intention without the drift problem.

V. CONCLUSION
In this paper, we proposed a reinforcement learning model and dynamic template method to improve the performance of existing Siamese network-based trackers. Our proposed reinforcement learning models solve the occlusion problem by taking an action with a higher expected reward through experience of tracking successes and failures. The dynamic template exchange method prevents the drift problem by updating the template when the tracking model determines that the tracking object is lost. The proposed method outperforms existing state-of-the-art methods in VOT2018 and OTB50.