Visual Tracking by Adaptive Continual Meta-Learning

We formulate the visual tracking problem as a semi-supervised continual learning problem, where only an initial frame is labeled. In contrast to conventional meta-learning based approaches that regard visual tracking as an instance detection problem with a focus on finding good weights for model initialization, we consider both initialization and online update processes simultaneously under our adaptive continual meta-learning framework. The proposed adaptive meta-learning strategy dynamically generates the hyperparameters needed for fast initialization and online update to achieve more robustness via adaptively regulating the learning process. In addition, our continual meta-learning approach based on knowledge distillation scheme helps the tracker adapt to new examples while retaining its knowledge on previously seen examples. We apply our proposed framework to deep learning-based tracking algorithm to obtain noticeable performance gains and competitive results against recent state-of-the-art tracking algorithms while performing at real-time speeds.


I. INTRODUCTION
Visual tracking, which is one of the fundamental computer vision problems, has seen practical applications in robotics, automated surveillance, and autonomous driving. Given the initial video frame with a bounding box label of the target object, the goal of the visual tracking problem is to track the target object throughout the subsequent video frames without losing the target object. However, conventional tracking algorithms face several challenges in various circumstances such as scale change, occlusion, illumination change, deformation, background clutter, and motion blur.
Recently, with the advances in the application of deep convolutional neural networks (CNN) to image classification and object detection tasks [1]- [4], visual tracking algorithms have also achieved large improvements in performance, owing to the representation power of their deep backbone networks [5], The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . [6] and the object detection framework [7], [8]. However, there is a misalignment between goals of object detection and visual tracking problem, where object detection aims to locate all objects of same semantic class whereas visual tracking aims to locate a specific object instance. To overcome this gap, visual tracking algorithms employ some form of domain adaptation process to the object detection framework, such as online network finetuning using stochastic gradient descent (SGD)-based methods [5], [9]- [11] or Siamese network structure [6], [7], [12]- [14] which generates a target-specific convolutional kernel from the initial frame.
While recent tracking algorithms were successful in achieving high performance metrics on several visual tracking benchmarks [15]- [17], the importance of the online adaptation process was often overlooked despite their crucial role in visual tracking. In particular, the tracker may need to update its model since the appearance of the target object constantly changes and similar distractor objects can appear in a given scene. Moreover, these aspects are further emphasized in long-term tracking scenarios [18]- [20] where the target can be absent for a prolonged time interval. Online update was often achieved by incorporating hand-crafted regularization schemes and meticulous hyperparameter selection, due to the lack of training samples and label uncertainty. In cases of most Siamese network-based trackers, online adaptation was completely ignored to achieve faster and real-time speeds. To address these aforementioned issues, several recent trackers employed meta-learning-based adaptation schemes [21]- [28] in order to learn the adaptation process itself. However, a majority of them either focused only on finding a good initialization for the tracker [21], [23] or only regulating the online update process [22], [25]- [27].
In this paper, we introduce a more generalized visual tracking framework, in which we model both adaptation and continual processes under our adaptive continual meta-learning framework. We adopt an adaptive scheme where hyperparameters can dynamically change to deal with various tracking scenarios. In addition, our continual meta-learning scheme employs adaptive knowledge distillation-based update strategy to help the tracker adapt to newly obtained examples while retaining the necessary knowledge on previously seen examples. During offline training, our integrated framework trains (1) network weights that are good for both initialization and future online updates; (2) hyperparameter generator network which adaptively generates the learning rates, instancewise weights, and loss hyperparmeter for controlling the adaptation process; and (3) knowledge distiller network for regulating the balance between learning new examples and retaining the knowledge from the previous step.
To validate the effectiveness of our proposed framework, we apply our method to Siamese network-based tracking algorithm TACT [29], which is a two-stage detector-based tracker. We compare our method to other state-of-the-art trackers on test splits of large-scale visual tracking datasets, including LaSOT [18], OxUvA [19], TLP [20], Track-ingNet [16], and GOT-10k [17]. We further demonstrate the effectiveness of our framework by component-wise ablation analyses on the LaSOT dataset. Our framework requires minimal computational overhead, achieving real-time speeds. Motivation for our tracking framework is shown in Figure 1.

II. RELATED WORK
A. DEEP NEURAL NETWORK-BASED TRACKING ALGORITHMS Contemporary visual tracking algorithms solve visual tracking via tracking-by-detection, where they attempt to locate the target by finding the position where the classifier produces the highest classification score for the target class. With the powerful representation capacity of deep nerual networks, recent trackers employ deep neural networks for feature extraction and classification. While features from denoising autoencoder are used in [30], MDNet [5] used VGG [3] features with multi-task learning, which RT-MDNet has accelerated to real-time speed by using ROIAlign [31]. Correlation filter-based trackers [32], [33] have also been widely used on top of pretrained deep features, such as C-COT [34] and ECO [35], in which they use continuous convolutional operator for the fusion of multi-resolution CNN feature maps. Other approaches include spatially regularized filters [36], the fusion of multilevel features [37], group feature selection [38], and spatial transformers [39].
Recently, Siamese network-based trackers have gained traction due to their simplicity and high performance [6]- [8], [14], [40]- [47]. SiamFC [6] introduced a fully convolutional end-to-end approach with increased speed and accuracy. SiamRPN [7] added a region proposal network for more accurate localization and size estimation while DaSiamRPN [40] enhanced its discriminability by introducing negative pairs during training to suppress distractors. Both [41] and [42] utilized deeper and wider feature extractors based on [1] and [48] for further performance gains. Other works include general transformation learning model [43], local pattern detection for structure-based prediction [44], cascaded region proposal for sequential refinement [45], and recurrent optimization based model [46]. Recent approaches employ inverted residual networks [49], attentional cascade keypoints [13], sailency information [50], and relation networks [51] for Siamese networks. There also have been approaches to automatically find lightweight networks for efficient matching such as [52] and [53], inspired by network architecture search (NAS) methods.
Introduction of transformers [54] and self-attention [55] to computer vision applications also enabled utilization of additional temporal information for visual tracking. Recent approaches include end-to-end fully-convolutional networks [56], incorporating rich scene information [57], use VOLUME 10, 2022 of space-time memory networks [58], and feature fusion with transformers [59].

B. META-LEARNING FOR VISUAL TRACKING
To overcome the issues of conventional model adaptation in visual tracking, noteworthy methods have been proposed to improve the tracking performance by employing meta-learning based adaptation at test-time. Among meta-learning algorithms, model-agnostic meta-learning (MAML) [60] based approaches [61]- [65] recently gained attention owing to its simplicity and versatility. MAML aims to find good model weights that can be trained to generalize well with a small number of SGD steps and a small amount of training data. Meta-Tracker [21] was one of the first to apply MAML on MDNet [5] for fast adaptation, reducing the number of SGD iterations. MetaRTT [24] extended the idea by simultaneously finding the learning rates for initialization and online update. In addition, [23] used MAML to convert a modern detection network into a tracker. However, all above methods used fixed learning rates for all tracking scenarios and thus lack the adaptiveness to deal with diverse individual scenarios, which are accompanied by various training examples with varying degree of label uncertainty. Additionally, they do not address the erroneous updates performed with uncertain and mislabeled examples. On the other hand, our proposed method is able to adaptively change the hyperparameters to deal with these scenarios.
Other meta-learning-based tracking algorithms introduce a separate meta-learner network to regulate the adaptation process. [27] and [22] used loss gradient information obtained during tracking to update the target feature representation. Moreover, [25] used a separate update module to acquire the updated accumulated template. An optimization-based architecture with a model predictor was used in [11] to predict the filter weights while a similar approach using recurrent neural optimizer is proposed in [46]. However, a majority of the aforementioned methods are mainly focused on short-term tracking scenarios, and are not designed for long-term tracking scenarios. With the exception of [26], where they used a meta-updater network that takes multiple cues as input to make the binary decision whether to update or not to update the baseline tracker.

C. INCREMENTAL OBJECT CLASSIFICATION AND DETECTION
Conventional training setting for the classification problem assumes that abundant labeled training examples are always available for all classes at any point in training time. By contrast, incremental/continual setting assumes new examples or new classes are given in a sequential manner, thus the model has to be trained incrementally to prevent the model from catastrophic forgetting, which is a phenomenon where the performance of the model on previously seen examples significantly degrades over time. Recent approaches for deep neural networks include iCaRL [66] which learns the classifier and representation simultaneously based on replay memory; EWC [67] which selectively slows down learning of weights based on their importance; and LwF [68] where task-specific parameters from previous tasks are utilized with knowledge distillation loss to prevent the network from forgetting, while improving the performance on a new task. Related to LwF, incremental learning of object classification and detection models based on the knowledge distillation [69] scheme to prevent catastrophic forgetting have recently emerged [70]- [73].
Inspired by aforementioned approaches, we employ a knowledge distillation-based continual meta-learning scheme for our visual tracking framework. Different from conventional continual learning settings where new examples are given in a sequential manner along with their corresponding ground truth labels, these labels are not available under standard visual tracking setting. Since labels for new examples have to be obtained in a self-supervised manner, chance of adapting the model based on mislabeled examples persists. To alleviate this issue, we introduce two solutions. (1) When performing an online update at a certain time step, we always start from initially adapted weights where previous weights are used for knowledge distillation. This reduces error accumulation and overfitting to small number of training examples, while increasing the flexibility of the tracker. (2) We introduce an adaptive knowledge distiller network that predicts the importance weights for each previous frame where the magnitude of weights determine the degree of knowledge distilled from a certain frame. By controlling these weights, the tracker can choose between learning new examples and retaining the previous knowledge.

III. PROPOSED METHOD
Our proposed framework consists of two large components, which are the baseline tracking algorithm and the adaptive continual meta-learner module. In the following subsections, training procedure for our proposed adaptive continual meta-learner module is delineated.
Assuming a baseline tracker f θ 0 with its default weights θ 0 , the meta-learner network g controls the learning process through adaptively generating the hyperparameters to modify the direction and magnitude of the loss gradients. The meta-learner network g contains four sub-networks, g α , g β , g γ , and g δ , where each sub-network generates the learning rate α, instance weight β, focal loss hyperparameter γ , and knowledge distillation hyperparameter δ, respectively. Our objective is to train the default weights θ 0 and network weights for g. To train both parameters, we construct a simulated tracking episode to perform the initial and online adaptation processes and assess how well these adaptations are conducted by evaluating the loss on the future frames. Our training scheme extends on the basic meta-learning formulation of dividing the training set into support and query sets, then performing inner-loop and outer-loop optimizations for meta-training analogous to such as in [60].
where each dataset D i contains frame images I i and GT box labels B i . Initial adaptation is performed using the initial frame and label in D 1 , and online adaptations are conducted using self-supervised labelsB 2 ,B 3 inD 2 ,D 3 . Afterwards, outer loop optimization is performed to evaluate all adapted weights θ i on D i +1,··· for meta-training.

A. META-TRAINING WITH SIMULATED EPISODES 1) TRACKER SETTING
The baseline tracking algorithm f θ with weights θ takes a video frame image I t as an input and outputs K candidate Tracking is performed by choosing the bounding box with the highest confidence value as an output. Online updates are conducted by training the tracker using this chosen output box, in which other boxes are labeled as positive if they have high overlap with the output box (IoU> τ p ) and negative otherwise (IoU< τ n ). We chose overlap threshold values τ p = 0.5 and τ n = 0.3 for training and testing.

2) EPISODE SETTING
Given a training video sequence V of length T with its frame images {I 1 , I 2 , . . . , I T } and ground truth target bounding box annotations {b 1 , b 2 , . . . , b T }, the video sequence is divided into four time-ordered video segments I 1 , I 2 , I 3 , and I 4 with corresponding bounding box label sets B 1 , B 2 , B 3 , and B 4 . Then, each video segment and label set are paired to form four Given a baseline tracker f θ 0 with the default weights θ 0 , the initial adaptation is first performed using the dataset D 1 = (I 1 , B 1 ) to obtain adapted weights θ 1 . Then, using the initialized tracker f θ 1 on images in I 2 , we can obtain estimated labelsB 2 to form the dataset for self-supervised online updateD 2 = (I 2 ,B 2 ) where the tracker is updated from θ 1 to θ 2 . Lastly, using the adapted tracker f θ 2 , online update is performed again with datasetD 3 = (I 3 ,B 3 ) to obtain θ 3 . After simulating tracking episodes, we obtain intermediate weights θ 0 , θ 1 , θ 2 , and θ 3 for the tracker. To train our overall framework, we evaluate the tracker on different combination of datasets based on each intermediate weight, then perform outer-loop optimization on the loss to train the default weights θ 0 and network weights for meta-learner g. The overview for the training process of our proposed framework is depicted in Figure 2.

3) INITIAL ADAPTATION
Our tracker first performs the initial adaptation process using the initial frame and label, D 1 . Starting from θ 0 , the adapted VOLUME 10, 2022 The meta-learner g receives these information as input and generates the learning rate α i , instance weight β i , focal loss hyperparameter γ i , and knowledge distillation hyperparameter δ i , where these hyperparameters control the adaptation process of θ i → θ i +1 .
weights θ 1 are obtained by, where α 0 = g α (τ 0 ) is the predicted per-parameter learning rate and the input τ 0 is the learning state based on the layer-wise mean of gradients and kernels [∇ θ 0 L,θ 0 ], as defined in [65]. The loss function for the initial adaptation L init is defined as, where FL denotes the focal loss [74] evaluated using the initially given bounding box label B 1 , β 0 = g β c 1 θ 0 is the instance weight for the initial frame, and γ 0 = g γ c 1 θ 0 is the focusing hyperparameter used in focal loss to control the balance between losses of easy and hard samples. β 0 and γ 0 are both scalar values.

4) ONLINE ADAPTATIONS
Using the initially adapted tracker f θ 1 and frames in I 2 , online adaptation is performed usingD 2 to obtain updated parameters θ 2 as in, where α 1 = g α (τ 1 ). Loss function for the online update is defined as, where FL is evaluated using self-labeled bounding boxes inB 2 as labels, β i 1 = g β c i θ 1 , and γ i 1 = g γ c i θ 1 . KD is the knowledge distillation loss equivalent to the standard binary cross entropy loss and is used to measure discrepancy

Algorithm 1 Visual Tracking With Meta-Learner
Input: Tracking algorithm f θ with default weights θ 0 Trained meta-learner network g Tracking sequence of length L, {I 1 , I 2 , . . . , I L } Initial target bounding box coordinates b 1 Output: Target bounding box coordinates for each frame # Initialization at t = 1 Form dataset D 1 = I 1 , b 1 for initial adaptation Model initialization from θ 0 using D 1 as in Eq. (2) and (3), updating θ ← θ 1 # For later frames in tracking sequence for t = 2 to L do Obtain candidate boxes b t θ i and confidence scores c t θ i from input frame I t as in Eq. (1) Choose box with the highest confidence score as outputb t If an output is confident (PSR < τ on ), store corresponding frame and output box (I t ,b t ) to dataset for online updateD on = (I on ,B on ) if t mod U = 0 and |I on | ≥ N then Online update from θ 1 using θ i and N training samples fromD on as in Eq. (6) and (7), updating θ ← θ i+1 and i ← i + 1 Clear buffer for datasetD on end if end for between predictions made by the model with parameters θ 0 and θ 1 . To control the degree of knowledge distilled from a certain example, the knowledge distillation hyperparameter δ i 1 = g δ c i θ 1 , c i θ 0 is predicted, where δ i 1 is a scalar value. Afterwards, further online adaptation is performed usinĝ D 3 , whereB 3 is obtained by evaluating the tracker f θ 2 on frames in I 3 . Updated parameters θ 3 can be acquired by evaluating the equations analogous to the previous step as in Eq. (4) and Eq. (5) where, where online adaptation is performed from the initially adapted parameters θ 1 rather than previous-step parameters θ 2 to reduce the effect of erroneous updates. For our proposed meta-learning framework, knowledge distillation is used in the temporal domain to enforce long-term memory on the tracker. We use the tracker after k-th online update f θ k as the teacher network and utilize this network to generate soft labels for frames inD k . When updating the tracker to obtain f θ k+1 , knowledge from f θ k can be transferred to f θ k+1 , where the amount of knowledge distilled can be controlled by scaling the KD loss term in equation (5) and (7) using   δ k , which is generated by the meta-learner. Also, updating from θ 1 reduces error accumulation and overfitting to small number of training examples, while increasing the flexibility of the tracker. By controlling δ k , the tracker can choose between learning new examples and retaining the previous long-term knowledge.

5) OUTER-LOOP OPTIMIZATION
After completing the simulated tracking episode on input video sequence V, we obtain intermediate tracker weights θ 0 , θ 1 , θ 2 , and θ 3 . We perform meta-training on our overall tracking framework by evaluating and performing outer-loop optimization for the tracker with each intermediate weight, using different combinations of held-out datasets with ground truth annotations to encourage reduced overfitting and better generalization performance of each adaptation process. Overall outer-loop loss function L outer for outer-loop optimization is given as following, where λ 0 , λ 1 , λ 2 , and λ 3 are stage-wise weighting hyperparameters with sum of 1 and superscript on D indicates a combination of respective datasets (i.e., D i,j = D i ∪ D j ). Each individual loss term L θ i with respect to each weight θ i is defined as, where focal loss FL is evaluated for binary class predictions c j θ i obtained from the tracker with weights θ i , using the ground truth bounding box b j and γ outer is fixed to 0.5. Each loss L θ i , except for L θ 0 , is evaluated with dataset D i+1,··· to measure the generalization performance on unseen future frames, assessing the quality of the adaptation process conducted from the previous weights θ i−1 using the meta-learner network g. It also serves to facilitate the tracker with weights θ i to make more accurate predictions for subsequent frames in I i+1 for better future self-supervised update using the estimated labels,D i+1 = (I i+1 ,B i+1 ). Note that for all aforementioned focal loss terms FL, additional IoU loss term evaluated on b t θ for bounding box regression is omitted for simplicity.
The process of outer-loop optimization is identical to the gradient-based bilevel optimization process of MAML [60] and its variants [61]- [65], where MAML aims to find good model weights that can be trained to generalize well on unseen tasks with small amount of training data. Key difference between the original MAML and our proposed method is that in addition to finding the good model weights θ 0 for generalization, our method incorporates the meta-learner network g, where g is also trained at the outer-loop optimization stage since it participated in generating the intermediate weights θ 1 , θ 2 , θ 3 . By obtaining gradients ∇ θ 0 L outer , ∇ θ 1 L outer , ∇ θ 2 L outer , ∇ θ 3 L outer , ∇ φ L outer where φ represents the weights for the meta-learner network g, we can train our framework using an off-the-shelf optimizer. VOLUME 10, 2022

B. VISUAL TRACKING WITH META-LEARNER
Herein, we propose the visual tracking with a novel adaptive continual meta-learner. The tracking process is purposely made simple to retain the speed of the original backbone tracking algorithm while requiring as small memory overhead as possible. Given an input tracking sequence of length L, the proposed initial adaptation process is performed using the initial frame I 1 and bounding box b 1 to obtain initial weights θ 1 for update. During the tracking process, frames that yield output confidence values with peak-to-sidelobe ratios (PSR) smaller than τ on = 0.7 are considered as confident frames and stored to the dataset for online updateD on . Online update is performed every U = 100 frames by employing N most confident frames from the datasetD on and initial weights θ 1 , updating the weights θ i−1 to θ i , and the buffer forD on is cleared after every update. The overall tracking procedure is organized and described in Algorithm 1, and Figure 3 shows the diagram for our proposed online adaptation process.

IV. EXPERIMENTS
In this section, we elaborate on the implementation details for the backbone tracker and the proposed meta-learning framework, followed by the experimental results to validate the performance gains obtained by our framework on five large-scale visual tracking benchmark datasets. We also demonstrate the results for attribute-wise and module-wise ablation experiments to further analyze the effectiveness of our proposed method.

A. IMPLEMENTATION DETAILS 1) BACKBONE TRACKER
We employ Siamese network-based backbone tracking algorithm based on TACT [29], which is a variant of a two-stage detection network. TACT is based on GlobalTrack [12] and (1) is a long-term oriented tracker which fits our purpose, (2) is a full-frame search-based tracker where we can consider all potential distractors in a scene for update, and (3) has no hand-crafted motion smoothness constraints.
Considering the aforementioned aspects, we can validate that the performance changes come solely from our proposed meta-learning framework without any influence from other potential variables. While freezing the weights of the feature extractor layers, the region proposal layers, and context embedding layers, we perform meta-training on the last ROI classification and refinement layers, starting from the original weights of TACT. We refer the modified trackers as ConTACT-18 and ConTACT-50 which are extensions of TACT-18 and TACT-50, respectively, with adaptive continual meta-learners.

2) META-LEARNER ARCHITECTURE
For the meta-learner g, its sub-networks g α , g β , g γ , and g δ are all 3-layer MLPs with group normalization [88] and ReLU activation between the linear layers. The number of intermediate hidden units for each sub-network are 128, 256, 256, and 512, respectively. Assuming L is the number of layers that are involved in the adaptation process, the adaptive learning rate generator g α takes 2L-dimensional learning state as an input and returns L-dimensional layer-wise multipliers which are then multiplied to the per-parameter base learning rate α base , similar to [61], and can be applied to each layer for SGD. Elaborating on the 2L-dimensional learning state τ , its dimension is determined by the number of layers L of the backbone tracker network f θ . The backbone tracker f θ TACT [29] is based on a two-stage object detection framework, and we chose to use the final ROI classification and refinement layers of TACT for the adaptation process, which consist of 5-layer CNNs, resulting in L = 5 for our implementation of ConTACT. The learning state τ is constructed in a similar manner as in [65], where we concatenate the L-dimensional layer-wise mean of kernel weights and L-dimensional layer-wise mean of kernel gradients. L is fixed throughout the process of adaptations.
The adaptive instance weight generator g β and adaptive focusing hyperparameter generator g γ both take confidence values c t θ i ∈ R K obtained from a given frame as input and returns scalar values β t i and γ t i . The adaptive knowledge  distiller g δ takes two confidence vectors obtained from two different models as an input, ·] denotes concatenation, and outputs a single scalar value δ t i .

3) TRAINING DETAILS
Dimensions of input images are the same as in [29] and the overall framework is trained with training splits of Ima-geNetVID [89], GOT-10k [17], LaSOT [18], and Track-ingNet [16], from which a video sequence V is randomly chosen. T = 13 frames, in turn, are uniformly sampled inside a time window of 500 frames inside V, along with their bounding box annotations. Among sampled frames, the first frame is used as D 1 . As for the remaining 12 frames, N = 4 frames are assigned to each of D 2 , D 3 and D 4 in a sequential order. For all frames and annotations in V, random image augmentations, such as Gaussian noise, blur, horizontal flips, and bounding box jittering are applied. For online adaptation, we choose a box with highest confidence from K = 64 candidate boxes for a given frame and use this box as self-supervision. Self-supervision is performed based on the estimated bounding box, where a best candidate box is chosen for a single image and can be considered as the pseudo groundtruth box. Based on this pseudo GT box, K = 64 candidate boxes estimated in an image can be labeled for classification (positive or negative) by calculating the IoU scores between the pseudo GT box. Candidate boxes having scores larger than τ p = 0.5 are labeled positive, and boxes with scores less than τ n = 0.4 are labeled negative, where classification VOLUME 10, 2022 losses are enforced. For positive candidate boxes, bounding box regression loss (IoU loss) is calculated and used for adaptation. Our online training scheme is identical to the training scheme of the original tracker [29] and Faster R-CNN [4]. For a single datasetD i with N = 4 images in I i , corresponding N = 4 best candidate boxesB i are first estimated using the tracker, where these boxes are used as the pseudo ground-truth boxes for each frame. UsingD i = (I i ,B i ), we can train our tracker with self-supervision, where classification loss and regression loss can be enforced on N × K = 256 estimated candidate boxes following the aforementioned procedure.
For both initial and online adaptations, per-parameter base learning rate α base are initialized to 10 −3 and single-step SGD update is performed for faster speed. For the outer-loop optimization, Adam [90] optimizer with learning rate of 10 −5 is used with weight decay of 10 −5 and trained for 5 × 10 5 iterations with batch size of 4.

B. QUALITATIVE AND QUANTITATIVE EVALUATION 1) EVALUATION DATASETS AND METRICS
We conducted evaluations for our trackers on test splits of five large-scale visual tracking benchmark datasets: LaSOT [18], OxUvA [19], TLP [20], TrackingNet [16], and GOT-10k [17]. LaSOT, OxUvA, and TLP are long-term visual tracking benchmarks with average sequence lengths longer than 1 min., whereas TrackingNet and GOT-10k have shorter sequence lengths with larger number of sequences with more various semantic classes of objects. LaSOT [18] dataset is a large-scale and long-term tracking dataset with 1, 400 video sequences for training and testing, with an average of 2, 512 frames (≈ 83 secs) in length, and are annotated with target bounding boxes. We evaluated our trackers on the test split (Protocol II) of 280 video sequences, and report the performance metrics of area-under-curve (AUC) of the success plot, location precision, and normalized precision for comparison. OxUvA [19] dataset is focused on long-term tracking performance of a tracker where its dev and test splits have 200 and 166 sequences, respectively, with an average length of 4,260 frames (≈ 142 secs). Since target can leave and reappear in a frame under the long-term tracking scenario, trackers must report the target bounding boxes as well as whether the target is present or absent in a given frame. The performance metrics are the maximum geometric mean (MaxGM) over the true positive rate (TPR) and the true negative rate (TNR), with IoU thresholds of 0.5. TLP [20] dataset also evaluates the long-term tracking performance where it contains 50 HD real-world videos, with average sequence length of 13,500 frames (≈ 450 secs). AUC of the success plot is used as the performance metric. TrackingNet [16] is a large-scale tracking dataset with more than 30, 000 videos gathered from YouTube, of which 511 sequences assigned as the test split. Similar to the other tracking benchmarks, location precision, normalized precision, and AUC of the success plot are used as performance metrics. GOT-10k [17] is a dataset composed under the one-shot experiment setting where the training and test splits have disjoint set of object classes. It contains 10, 000 video sequences of which 420 are used in the test split. Performance metrics are calculated by the success rate (SR, with thresholds 0.5 and 0.75) and average overlap (AO).

2) COMPARISON TO OTHER TRACKERS
Results for evaluation of our trackers on the LaSOT test set are provided in Table 1. Applying the proposed adaptive continual meta-learner on both variants of TACT, denoted as ConTACT-18 and ConTACT-50, show consistent and noticeable gains on all performance metrics on both variants, while retaining real-time speeds of 52 fps and 38 fps. Both variants outperform many recent ResNet-based tracking algorithms, GlobalTrack [12], ATOM [10], DiMP [11], SiamRPN++ [41], SPLT [75], and Ocean [8]. For further evaluation of the long-term tracking capabilities, we evaluated our tracker on the OxUvA test set and presented the results in Table 2. To detect the absence of the target, we simply used confidence threshold value of 0.97 to label target as absent if confidence is below this threshold. The proposed method shows substantial performance gains in MaxGM and TNR metrics compared to TACT, where the performance gains are more pronounced under long-term sequences. Evaluation on relatively short-term, large-scale tracking benchmarks TrackingNet and GOT-10k are shown in Table 4 and 5.
Both of our trackers show consistent performance gains on all metrics for both datasets, validating the effectiveness of our proposed meta-learner on both long-term and short-term tracking applications where performance improvements are more pronounced in the long-term tracking applications. The baseline tracker of our algorithm is TACT [29], which is based on GlobalTrack [12] where GlobalTrack is a full-frame search-based tracker with no hand-crafted motion smoothness constraints (local search, cosine window penalty, linear interpolation between bounding boxes, etc.) commonly used in other tracking algorithms, and GlobalTrack requires minimal hyperparmeter tuning. Due to the aforementioned characteristics, TACT and GlobalTrack perform better on long-term tracking benchmarks such as LaSOT and OxUvA and the performance gains made by our proposed algorithm are less pronounced on short-term tracking benchmarks such as TrackingNet and GOT-10k. Despite these differences in characteristics, our proposed ConTACT-18 and ConTACT-50 successfully improves the baseline tracking algorithm TACT by noticeable margins, with competitive performance even compared with other recently published tracking algorithms. Qualitative comparison between other trackers, TACT [29], GlobalTrack [12], ATOM [10], SiamRPN++ [41], and SPLT [75], are shown in Figure 6.

C. ANALYSIS 1) ABLATION STUDY a: ATTRIBUTE-WISE ABLATION
To further analyze the effectiveness of proposed continual meta-learner, we show attribute-wise AUC performance on the LaSOT test set in Table 6, with comparison on six different challenge attributes of LaSOT. Displaying performance gains in all attributes, largest improvement comes from BC (background clutter) attribute, which validates the effectiveness of our initial and online adaptation strategy on eliminating the hard negatives while tracking. Additional attribute-wise success plots with comparison between other tracking algorithms are also shown in Figure 7. Both variants of the proposed algorithm show competitive performance on multiple challenge attributes compared to other state-of-the-art trackers.

b: COMPONENT-WISE ABLATION
To verify the contribution of each component in our metalearning framework, component-wise ablation results are shown in Table 7, where we sequentially remove each adaptive learning component in (2)-(5). The results suggest that every component contributes to performance gain, where adaptive instance weighting contributes the most. Results in (8) show that our adaptive learning approach is effective even without any online adaptation, where only initial adaptation is adaptively performed. Regarding the online adaptation, results in (6), which are obtained with naïve online finetuning with learning rate of 10 −3 on TACT, show reduced performance possibly due to erroneous updates and overfitting. Also, results in (7) suggest that online adaptation from the initial weights θ 1 instead of previous weights θ i−1 contributes to a large performance gain, owing to reduced error accumulation.

2) VISUALIZING THE ADAPTIVE LEARNING
In Figure 4, we show five video examples of online adaptation with self-labeled training samples, where erroneous predictions in the future frames are corrected after the adaptation. During the adaptation process, β, γ and δ values predicted by the meta-learner for each training sample dynamically change. The meta-learner assigns relatively lower β and δ values to examples with less confident, uncertain predictions while the negative γ value consistently directs to focus more on maximizing the class margin for confident examples, giving less attention to ambiguous examples that may lead the tracker to fail in the future.

V. CONCLUSION
In this paper, we proposed a novel adaptive continual meta-learning framework for visual tracking that dynamically generates the hyperparameters needed for initialization and online update with self-labeled examples. Also, our continual meta-learning approach based on knowledge distillation scheme helps the tracker adapt to new examples while retaining its knowledge on previously seen examples. We apply our proposed framework to deep learning-based tracking algorithm, where our ConTACT-18 and ConTACT-50 achieve noticeable performance gains and competitive results against recent state-of-the-art tracking algorithms on all five large-scale visual tracking benchmarks, while running at realtime speeds.