OneShotDA: Online Multi-Object Tracker With One-Shot-Learning-Based Data Association

Tracking multiple objects in a video sequence can be accomplished by identifying the objects appearing in the sequence and distinguishing between them. Therefore, many recent multi-object tracking (MOT) methods have utilized re-identification and distance metric learning to distinguish between objects by computing the similarity/dissimilarity scores. However, it is difficult to generalize such approaches for arbitrary video sequences, because some important information, such as the number of objects (classes) in a video, is not known in advance. Therefore, in this study, we applied a one-shot learning framework to the MOT problem. Our algorithm tracks objects by classifying newly observed objects into existing tracks, irrespective of the number of objects appearing in a video frame. The proposed method, called OneShotDA, exploits the one-shot learning framework based on an attention mechanism. Our neural network learns to classify unseen data samples using labels from a support set. Once the network has been trained, it predicts correct labels for newly received detection results based on the set of existing tracks. To analyze the effectiveness of our method, it was tested on the MOTchallenge benchmark datasets (MOT16 and MOT17 datasets). The results reveal that the performance of the proposed method was comparable with those of current state-of-the-art methods. In particular, it is noteworthy that the proposed method ranked first among the online trackers on the MOT17 benchmark.


I. INTRODUCTION
Multi-object tracking (MOT) is considered one of the most challenging problems in computer vision research. Recently, tracking-by-detection methods have attracted significant interest, because they can isolate the problem of object detection from object tracking, which helps them focus on the tracking tasks, such as track management, initiation, and termination, as well as data association.
There are several methods available for track initiation/termination. Several studies [1]- [4] have adopted a straightforward rule wherein tracking starts if there is a The associate editor coordinating the review of this manuscript and approving it for publication was Victor Sanchez . detection result, and ends if there is no detection result. A subtle difference between such approaches is the number of repeated detections (misdetections) used for track initiation (termination). In [1]- [4], a new track hypothesis was generated at every frame for each detection result that was not associated with an existing track. A track was terminated if the number of consecutive misdetections exceeded a predefined threshold. To eliminate false trajectories, tracks shorter than a predefined threshold were deleted from the track set after the tracking process was completed.
Another strategy involves optimizing an objective function over the space of trajectories [11]- [13]; it is necessary to perform both track management and data association simultaneously, based on the optimization results. Zhang et al. [11] proposed a network-flow-based global optimization method for MOT. They constructed a network using a set of detection results from a video, and computed the global best trajectories by identifying the min-cost flow of the network. Initialization and termination of trajectories were handled intrinsically after the solution had been computed. Pirsiavash et al. [12] used an approach similar to that of Zhang et al. , except that they adopted a greedy algorithm (shortest path) for a flow network. In [13], the authors used the multiple-hypothesis tracking (MHT) for track management. The MHT saves track proposals in a tree structure that grows with new detections for each frame, that is, the tree describes all the possible data association results originating from a single detection result. Track initiation and termination and data association in the MHT are treated as solving an optimization problem, that is, the maximum weighted independent set (MWIS), within a certain time window.
In the tracking-by-detection paradigm, data association entails connecting detection outputs across video frames and screening misdetections. This problem can be considered a form of statistical estimation such as the likelihood estimation of p(Z|T ), where Z is the set of detections and T is the set of trajectories. The distribution of likelihood determines the probability of associated detections belonging to the same object when the track proposal T has been satisfied. Online tracking methods recursively estimate the likelihood based on the detection set up to the current frame [1]- [4], [13]. In contrast, offline/batch methods [11], [12] use the detection results for an entire video sequence.
Recently, MOT has been accomplished by identifying and distinguishing between objects appearing in the sequence. Therefore, many recent MOT methods have focused on re-identification and distance metric learning to distinguish between objects by computing the similarity/dissimilarity scores between them. However, it is difficult to generalize such approaches for arbitrary video sequences, because some important information, e.g., the number of objects (classes) in a video, is not known in advance. In this study, we propose a novel data association strategy called OneShotDA that exploits one-shot learning frameworks such as those in [14]- [16]. In such frameworks, the class of a query sample is determined by the samples in a gallery set. For example, in [16], predictions for the samples in a query set are obtained based on a relation module that computes the distance between a query feature and the features in a gallery set. By following the protocol of the one-shot framework, our method classifies a newly received detection result (query sample) into an existing track (gallery set), or vice versa. 1 Specifically, our model can predict the label of a query sample based on the labels of the gallery set by using an attention mechanism that indicates a corresponding sample in the gallery set. 1 The use of detection results as a query set or the use of existing tracks as a query set is possible.
Let Q be the query set, and x (i) q be the i-th query sample. G is the gallery set.
g and y (j) g are the j-th sample and label in the gallery set, respectively. In Figure 1(a), T k−1 , which is the track set at frame k − 1, represents the query set, and Z k , which is the detection set at frame k, represents the gallery set. The feature embedding network (FEN) in OneShotDA takes a sample from both Q and G as an input, and generates a feature vector f (·). Note that the FEN processes samples in Q and G using the same weights. Next, conditional embedding networks (CENs) are used to embed feature vectors to generate more robust features. CEN_Q and CEN_G are the CENs for the query set and gallery set, respectively. Additionally, we include a network called TD_clf that estimates the probability of accurate detection for a given input (detection response). We shall detail each component of the OneShotDA in the following sections.
As shown in Figure 1(b), the OneShotDA maps the class of a query sample to labels in the gallery set, which is defined by p(y g |x q , G). The left table in Figure 1(b) contains the probability distribution of data associations for the scenario depicted in Figure 1(a). For example, q 3 , an embedded vector of T (3) k−1 , has a low probability of data association, because the object is completely occluded in frame k. The right table in Figure 1(b) contains the probabilities of accurate detection for the gallery set, which are estimated by TD_clf. We also demonstrate that the proposed data association mechanism can be easily integrated with any online MOT system. In the experiments section, we track objects using a combination of the MHT framework [7] and the proposed OneShotDA. Tables 1 and 2 summarize the notations and abbreviations used in this article.   The contributions of this study can be summarized as follows: • We propose a novel data association mechanism called OneShotDA that can classify newly generated detection outputs into existing tracks using one-shot classification.
• We adopt a training strategy that is customized for one-shot learning and suitable for data association tasks. We also demonstrate the way training samples are generated using MOT datasets such as the MOTChallenge datasets [8].
• We demonstrate that OneShotDA can be easily integrated with any online MOT system (MHT in this study). 2 • We demonstrate that the effectiveness of the proposed MOT system can match those of the state-of-the-art methods. It is noteworthy that the proposed method ranks first among online trackers when evaluated on the MOT17 benchmark.

II. RELATED WORKS
Modern MOT methods based on the tracking-by-detection paradigm can be categorized into offline and online methods. Offline methods utilize the detection results from all the video frames to construct robust trajectories, whereas, online approaches process videos sequentially in a frameby-frame manner by recursively updating existing tracks with new detection results. The offline methods are commonly set up by representing the problem as a graph wherein each detection result represents a node, and the edges represent possible links.
Ma et al. [17] formulated the problem as a hierarchical correlation clustering (HCC), a modified form of the correlationclustering-based tracking method [18]. Tang et al. [19] used a combination of the lifted multicut problem (LMP) formulation and body pose information. Henschel et al. [20] also utilized body pose information; however, they formulated the problem as a min-cost graph labeling problem [21].
It is known that online methods are often impeded by long-term occlusion issues, because only the current frame and previous frames are available [22]. The MHT [7], a robust online tracking method, attempts to resolve this issue by constructing a track tree that describes all the possible data association results within a particular time window. Even if this time window causes the MHT to produce delayed tracking results, it is still considered an online tracker because only the current detection set is used to update the scores of previous tracks [7]. In other words, the MHT recursively estimates the likelihood of the previous tracks reoccurring based on current observations. Recently, various MHT algorithms [13], [22], [23] have been proposed for MOT tasks. In [13], long-term appearance modeling was incorporated into the MHT, where the tracker estimated the online appearance features for each track. In [23], an LSTM network was adopted to score track proposals in the MHT. The authors also proposed a bilinear LSTM model, a modified version of the original LSTM model [6] for a gating network. In [22], the authors proposed an iterative MWIS algorithm for the MHT, making it possible to solve the MWIS problem based on the solutions of previous frames.
There have also been many recent studies based on MOT. Sun et al. [24] tracked objects using a novel deep network that can infer object affinities across different frames by analyzing exhaustive permutations of the extracted features. Their network also accounts for the appearance and disappearance of objects between video frames. Yang et al. [25] proposed an online MOT algorithm that uses two-step data association combined with an improved sparse-based appearance affinity model and rank-based motion affinity model. They tracked objects by fusing trajectory dynamics information, and proposed a novel two-step data association framework. He et al. [26] proposed a tracking-by-animation framework to achieve both label-free and end-to-end learning for MOT, unlike tracking-by-detection frameworks, that isolate the detection task from the tracking task. Their differentiable neural network first tracks objects in input frames, and then animates the tracked objects in reconstructed frames. Learning is driven by reconstruction error based on backpropagation. Zhang et al. [27] tracked objects in multi-modal scenarios by adopting a deep architecture that can be trained in an end-to-end manner, thereby enabling the joint optimization of the base feature extractors of each modality and an adjacency estimator for cross-modality. Wen et al. [28] proposed an MOT algorithm based on a non-uniform hypergraph that can model different degrees of dependency among tracklets for a unified objective. Their method can model higher-order dependencies among objects and tracklets. Voigtlaender et al. [29] extended the MOT to MOT and segmentation. They also presented a tracking method that jointly addresses detection, tracking, and segmentation using a single convolutional network. Long et al. [30] tackled unreliable detection by selecting candidates from the outputs of both detection and tracking. They also demonstrated that the identification ability of their tracker could be improved by using appearance representations trained on a person re-identification dataset. In [31], the authors tracked the objects using recurrent neural networks (RNNs), and demonstrated that RNNs can effectively address the problems of trajectory estimation and data association.
One(few)-shot learning aims to train a classifier to recognize unseen classes during training using only a single (few) labeled example(s). Because many deep learning systems require hundreds or thousands of samples, one(few)-shot learning has attracted significant interest [14]- [16], [32]. In such frameworks, the class of a query sample is determined by the samples in a gallery set. In [32], the authors proposed a novel strategy for one-shot classification using Siamese neural networks for verification. In [14], the MatchingNet model was proposed to map a small labelled gallery set and an unlabeled query sample to the correct label. MatchingNet compares the cosine distances between a query feature and each gallery feature. Snell et al. [16] proposed ProtoNet, which also predicts the class of a query sample based on distance; however, ProtoNet uses the Euclidean distance between a query and the gallery. Sung et al. [15] presented RelationNet, which, apart from replacing distance with a learnable relation module, is based on a similar concept.
In OneShotDA, we adopt the one-shot learning framework to classify new samples based on known examples, which solves the data association problem in MOT. There are multiple available options for this task, including existing frameworks such as MatchingNet, RelationNet, or ProtoNet. Therefore, we consider the one-shot framework as a data association solver for MOT.

III. MULTIPLE HYPOTHESIS TRACKING
In this section, we briefly review the MHT model presented in [7] and discuss how we combine MHT with the proposed OneShotDA.
MHT maintains a track proposal by constructing a track tree that describes all possible data association results originating from a single detection result (i.e., root node) and its branches. Each node in the track tree can either correspond to a real detection result obtained from an object detector or be a dummy detection result representing a misdetection. Let T (i) k be the i-th track proposal at frame k maintained by the MHT and let t be its length. Then, T l is a detection result chosen by the i-th track proposal at frame l. As mentioned previously, z [i] l can be a real detection result from Z or a dummy detection result. Note that z (j) l represents VOLUME 8, 2020 the j-th detection at frame l, meaning z (j) l ∈ Z l in which Z l is a set of detections at frame l (i.e., Z l ⊂ Z).
Following the original formulation in [7], the score for each track proposal is defined as the likelihood ratio (LR) between the true-track (H 1 ) and false alarm (H 0 ) hypotheses (Eq. 1).
In Eq. 1, P 0 (H 1 ) and P 0 (H 0 ) are the prior probabilities of the true target and false alarm hypotheses, respectively. Based on the chain rule, the likelihood is factorized as where we assume that detection results are conditionally independent given the false alarm hypothesis and t is the k . This equation can be further factorized based on the independence assumption between kinematic and appearance terms as follows: where, app(·) is a function that returns the appearance feature of a given input and kin(·) returns the kinematic component of an input (e.g., the coordinate of the bounding box). The kinematic term of LR(T (i) k ) at frame (k − l + 1) under the true-track hypothesis is assumed to be a Gaussian and is estimated via Kalman filtering. Constants are set for the false alarm hypotheses of both the appearance and kinematic terms (p A (app(z [i] k−l+1 )|H 0 ) and p K (kin(z [i] k−l+1 )|H 0 ) are set to C app and C kin , respectively).
We use the log likelihood ratio for track scoring by taking the logarithm of Eq 3. Additionally, the track initiation score is defined as ln[ P 0 (H 1 ) P 0 (H 0 ) ] and a constant C β is set. Track termination is performed after a track is updated with dummy detection results for τ miss consecutive frames. To maintain the size of the feasible track proposals, the track tree is pruned such that it does not exceed a tree depth of τ D . Tree pruning can be performed after finding the best set of proposals. Once the best set is identified, we select an ancestor of the best proposal with a distance τ D and prune the subtrees diverging from that node.
Finally, the likelihood for the appearance term at frame l under the true-track hypothesis is estimated by OneShotDA. The true-track likelihood at frame l is written as where cls(z, Z) is a function that retrieves the label of a detection result z determined by Z. Therefore, the right side of Eq. 4 estimates the probability that a track T (i) l−1 is classified as cls(z [i] l , Z l ).

IV. DATA ASSOCIATION WITH ONE-SHOT LEARNING
For the data association task of MOT, we decided to adopt the one-shot architecture of MatchingNet [14] because its contextual embedding provides robust input features, particularly when two difficult examples are very close to each other in the feature space [14]. However, simply applying MatchingNet directly to our domain is not possible because the data association problem does not match a detection to a track and the proposed system must identify false alarms. Let Q and G be a set of query samples and a set of gallery samples, respectively. Each sample consists of image-label |Q| and |G| represent the sizes of sets Q and G, respectively. The labels y (j) g in the gallery set are |G|-sized vectors, each of which is one-shot encoded such that the j-th component is set to 1.
OneShotDA estimates the probability distribution of the label of the i-th query sample for the labels in the gallery set (Eq. 5).
where ·, · is the inner product between two vectors. q i and g j are the corresponding embedding vectors of x (i) q and x (j) g , respectively. In Eq. 5, the probability distribution of the query sample's labelŷ (i) q is computed by applying the softmax function over the gallery set's labels. Therefore, the class of the query sample is computed from the gallery set with the maximum probability (i.e., arg maxˆy(i) ). This process can be viewed as an attention mechanism pointing to a corresponding sample in the gallery set. It is important to note that the estimated label of the query sample is not the same as the label in the query set (i.e., y . This is because the label of query sample only represents its class [14]. In addition, as mentioned previously, the labels in the query set are used for network training, meaning we can train the network twice per query-gallery pair by swapping the two sets. In Eq. 5, q i and g j are embedding vectors mapped from the image space into a latent space. One potential method for performing mapping for each sample is to train an embedding network and apply the network to each sample independently (e.g., f is a CNN). We made f an FEN and searched the ResNet family [33] to find the optimal FEN structure. However, embedding each sample independently means we cannot encode information regarding the entire set, so the classification function in Eq. 5 is simply nearest neighbor classification based on an inner product. To resolve this issue, we train a CEN that embeds the feature vector further by incorporating all other samples. This can improve the accuracy of classification, particularly in cases where some samples are very close to each other (i.e., hard samples).
Specifically, CEN_Q is the CEN for set Q, which reads samples in G through the softmax function over the cosine similarity measures. 3 Therefore, the conditional embedding vectors q i are defined as f is the FEN (ResNet [33]). The last layer in f is activated by the tanh function. In Eq. 6, f q is a fully connected network whose output has the same size as f (x (i) q ). [·, ·] is a concatenation operator between two vectors. The output of f q is also activated by the tanh function. In Eq. 7, a ij represents the j-th component of a i . Therefore, the conditional embedding vector q i incorporates all elements in the set G based on the weighted average g j .
Next, we present the CEN for the set G, which is denoted as CEN_G. g j is generated by embedding an additional sample with the samples in set G using bidirectional LSTM (Bi-LSTM) [6].
where − → h and ← − h are the outputs of forward and backward LSTM, respectively. − → c and ← − c are the cells of the corresponding LSTM networks. CEN_G is similar to the network in [14] as we also add a skip connection between the input and output.

A. TRAINING OneShotDA
To train the network, we generated training samples from the training sets in the MOTchallenge datasets [8]. Specifically, we used subsets of the MOT16 and MOT17 datasets (Table 3). We used public detection methods and classified the samples as true detection results and false alarms. True detection results are detection results whose intersection over union (IoU) is greater than the threshold τ IoU , where each ground truth bounding box has at most one true detection result. This is an assignment problem based on the maximum total IoU score. We solved this problem using the Hungarian algorithm [34]. The remaining detection results that were not chosen by the Hungarian method were classified as false alarms. Note that because objects in the dataset are frequently occluded by each other, we filter out small ground truth bounding boxes using non-maximum suppression with an IoU threshold τ GT IoU prior to identifying true detection results. Next, a training sample is constructed using two consecutive video frames from the video sequence. Let l be a particular frame, then Q and G are the detection sets from l and l +1, respectively. Additionally, Q and G can be detection results from l +1 and l, respectively. In this manner, we compute loss twice using one query-gallery pair by swapping the two sets. Let L CE be the cross entropy loss, where Q consists of the detection set at l, and L CE be the loss, where Q consists of the detection set at l + 1. These losses measure the classification error for the query set. If a query sample is either a false alarm or missing in the gallery set, that sample is not used to compute loss. Note that the size of the query set |Q| will be the batch size and the size of the gallery set |G| will be the number of classes. Therefore, the class size and batch size are not fixed. This can be achieved based on the attention mechanism we adopted.
Additionally, identifying false alarms is crucial because many false alarms are present in detector outputs. To that end, we attach a fully connected layer (TD_clf) with a size of one following the FEN. Therefore, the network takes a feature vector f (·) from the FEN and outputs a prediction p TD that indicates whether or not the input is a true detection result. We then define L TD as the binary cross entropy loss, which measures the classification error between the label of a true detection result and the predicted probability of the true detection result (i.e., p TD ).
Finally, the final loss L is defined as where λ TD is the weight for L TD . Therefore, training OneShotDA, including the FEN, CEN, and TD_clf, is accomplished using a single training sample by minimizing L in one step.

B. CORE COMPONENTS OF OneShotDA
As discussed above, we use a one-shot learning framework for MOT because it has the ability to unravel the association problem between an unseen object and existing tracks.
In this work, our one-shot framework was derived from MatchingNet [14] with many modifications. In this section, we summarize the role of each component in OneShotDA, namely the FEN, CEN (CEN_Q and CEN_G), and TD_clf.
• FEN: This network generates a feature vector for a given input (i.e., x VOLUME 8, 2020 We use the ResNet family for this network and investigate the performance of each residual network in terms of tracking accuracy. (section V-C) • CEN: This network further embeds a feature vector that is outputted by the FEN to generate more robust features. The features outputted by this network are sufficient for distinguishing samples from each other, even when they are close to each other. (section V-C) -CEN_Q: CEN for the set Q. This network helps a sample in Q read entire elements of G and outputs a conditional embedding vector for a given input. -CEN_G: CEN for the set G. A conditional embedding vector for each element in G is generated using bidirectional LSTM with the set G itself.
• TD_clf: This network takes a feature vector f (·) and outputs a prediction p TD , indicating whether or not the input is a true detection result.

V. EXPERIMENTS AND ANALYSIS
In this section, we analyze the tracking performance of the OneShotDA tracker on the MOTChallenge datasets (MOT16 and MOT17). Additionally, ablation analysis is performed to identify the best hyperparameter settings.

A. IMPLEMENTATION DETAILS
In all the experiments, the values of the parameters τ D , τ IOU , and τ GT IOU were 30, 0.334, and 0.5, respectively. τ miss was set to 2.3fps, where fps is the frames per second for each sequence. Additionally, we have exhaustively searched C app , C kin , and C β to find good parameters, and decided to set to 0.1, 0.1, and 2.0, respectively. Input images were resized to 288 × 96 pixels and normalized to the range of 0 to 1. We also augmented the training set with uniform random rotation in the angle range of [−8.5, 8.5], random horizontal flipping, and random brightness changes. As mentioned previously, the ResNet family was used for the FEN. In the ablation analysis section, we investigate the performances of the ResNet family, ResNet34, ResNet50, and ResNet101, as well as their output sizes (feature vectors). Note that the size of a feature vector determines the size of the Bi-LSTM in CEN_Q because the output cell of Bi-LSTM and the corresponding feature vector are added inside CEN_Q. The ResNets were initialized with weights pre-trained on the ImageNet datasets, except for the final fully connected layers. The final layer was replaced with our feature embedding layer for various output sizes (see the ablation analysis section). The stochastic gradient descent optimizer with a momentum 0.9 was used for training all the networks, and the initial learning rate was set to 10 −4 . The learning rate decreased after every 3000 iterations based on exponential decay with a decay rate of 0.95 until the minimum learning rate of 10 −7 was achieved. λ TD was set to 1.

B. MOTCHALLENGE DATASETS AND METRICS
In this study, we used the MOTChallenge datasets [8] to train our OneShotDA network and test the tracking performance of the OneShotDA tracker. The training and validation dataset separation is detailed in Table 3. The test set of MOT17 includes a total of seven sequences, each of which comes with three sets of public detection results. These three public sets come from different detectors, namely the deformable part model (DPM) [5], faster-RCNN [9], and scale-dependent pooling (SDP) [10]. The MOT16 test set consists of the same sequences as those in MOT17, but it only contains the DPM detection set. It must be noted that the ground truth labels are not shared for the same sequences across MOT16 and MOT17.
The metrics used for measuring tracker performance are the same as those used in [8]. MOT accuracy (MOTA) measures performance by aggregating three error sources, namely false positives, missed targets, and identity switches. IDF1 [35] computes the ratio of correctly identified detection results over the average number of ground truth and computed detection results. MOTA and IDF1 are considered the main criteria for tracker performance. We also report mostly tracked (MT) objects, mostly lost (ML) objects, the total number of false positives (FP), false negatives (FN), and identity switches (IDsw), and the total number of times a trajectory is fragmented (Frag).

C. ABLATION ANALYSIS
We conducted ablation studies with different hyperparameter settings to achieve optimal performance on the validation set and its subsets. In Figure 2, we consider the ResNet family and corresponding output size for the subsets of our validation set (i.e., MOT17-05-{F,S} and MOT17-09-{F,S}). It is important to note that not only does the architecture of the FEN affect the tracking performance, but the size of the feature vectors is also a crucial aspect for performance. In this study, all networks were trained for 5 epochs. The CENs were consistently initialized for each setting and retrained from the beginning. The sizes of the feature vectors were sampled from a logarithmic scale ranging from 64 to 1024 (i. e., {64, 128, 256, 512, 1024}). The analysis results for various FEN settings are presented in Figure 2. According to the results in Figure 2, we selected the ResNet50 with 512 outputs for our FEN. Our model achieves the maximum MOTA (59.3) using ResNet50-512. Furthermore, it is important to note that the model seems to suffer from overfitting when the parameter size is greater than that in ResNet50 (25.6 M) or the feature size is greater than 512. To determine if this assumption was correct, we trained a ResNet101-1024 network with additional epochs because the low performance of such a large model could potentially result from a low convergence rate based on its large parameter size. However, we found that the MOTA of the large model continued to decrease or remain constant while its training loss consistently decreased during continued training. Next, we investigated the contribution of the CEN by measuring its performance in terms of MOTA and IDF1 on the validation set. Figure 4 . This is because OneShotDA, without the CEN, struggles to associate objects identified by the DPM detector whose outputs are much noisier in comparison with those of Faster R-CNN and SDP. Additionally, performance is consistently improved by the CEN for the MOT17 dataset (Figure 4(b)).
The performance in terms of predicting true detection results (p TD ) was also investigated. This analysis helped us in selecting a good threshold value for identifying true detection results in the detection outputs. Figure 3 presents the average-precision (AP) score for each threshold value. The values are evenly distributed at intervals of 0.15. We achieve an AP of 0.981 at a threshold value 0.45. Therefore, we chose 0.45 as the threshold value for p TD when testing our OneShotDA tracker on the test set.

D. MOT PERFORMANCE ANALYSIS
In this experiment, we used a ResNet50-512 network as the FEN and trained the network with additional epochs. Our network was trained for a total of 8 epochs.
We first present the performance analysis of our network as a binary classifier. Each prediction is considered the output of a classification representing how likely it is for two objects to be assigned the same identity. As shown in Figure 5, the average precision of the precision-recall curve is 0.8957. The classification results for the validation set indicate that OneShotDA is trained properly and makes precise association predictions. Finally, we present performance comparisons between the OneShotDA tracker and existing state-of-the-art methods such as HCC [17], LMP [19], GCRA [36], KCF16 [37], MOTDT [30], JBNOT [20], eHAF17 [39], TLMHT [22], EAGS16 [38], MHT_DAM [13], MHT_bLSTM [23], and EDMT17 [40]. These methods were evaluated on the MOTChallenge server. 4 To provide a reasonable comparison, only officially published and peer-reviewed entries in the MOT16 and MOT17 benchmarks were considered. Additionally, we collected MHT-based trackers, categorized as online  [20] and in the bottom row, (d)−(f) present our results for the same sequence. The man wearing a black jacket in the 100th frame is consistently tracked by our tracker. However, JBNOT fails to track the man because he is occluded by other objects.  [20]. In the bottom row, (d)−(f) present our results for the same sequence. After the occlusion occurs, many ID-switches take place in the JBMOT tracker, but our tracker consistently tracks objects, even during occlusions (e.g., ID-64, ID-65, ID-69, and ID-72).
trackers in this study, for the purpose of simple comparisons. Trackers were grouped according to their tracking mode (offline and online). In Tables 4 and 5, our method exhibits performance comparable to those of existing state-of-the-art methods. It is noteworthy that OneShotDA ranks first among online trackers on the MOT17 benchmark. Our tracker outperforms all other online trackers in the MOT17 group by 0.5% if we compare it with MOTDT. However, our methods did not outperform the JBNOT, the state-of-the-art offline method in MOT17.
Our tracker seems to prefer DNN-based detectors to traditional detectors based on the fact that it ranks first among online trackers on the MOT17 dataset but ranks lower on the MOT16 dataset. We determined that our simple implementation of the function app(·) for tracks could degrade model performance on the MOT16 dataset. The function app(T (i) l ) simply retrieves an image patch of T (i) l , which is an image of the latest update with a detection result. Because the detector outputs in MOT16 are comparatively noisy, we believe this function is insufficient for returning an image feature TABLE 4. Results on the MOT16 dataset. We grouped methods according to their tracking mode (offline and online). The red numbers for each metric represent the best performance (offline/online) and the blue numbers represent the second best performance (online). The methods marked with * are MHT-based trackers. (Accessed on August 1, 2019.) TABLE 5. Results on the MOT17 dataset. We grouped methods according to their tracking mode (offline and online). The red numbers for each metric represent the best performance (offline/online) and the blue numbers represent the second best performance (online). The methods marked with * are MHT-based trackers. (Accessed on August 2, 2019.) typifying a track. Because the function app(·) can be of any type, incremental updates of appearance features can resolve this issue.
We further examined the performance of our method by incorporating qualitative analysis. In Figures 6 and 7, the robustness of our tracker against ID-switches is analyzed via frame-by-frame investigation. We compare the results to JBNOT [20], which achieved the top rank on the MOT17 dataset in terms of MOTA but with inferior ID-switch performance compared to our tracker (Table 5). In Figure 6, the top row presents partial results for JBNOT on the MOT17-09-FRCNN dataset, and the bottom row presents the results for our method on the same sequence. This figure suggests that our tracker consistently tracks the man with the black jacket (ID-6 in our results). However, this object is lost by JBNOT, which initiates a new track with ID-13 after the object is occluded by other objects. Figure 7 presents the results on the MOT17-11-SDP dataset, where the top row represents JBNOT and the bottom row represents our model. These results demonstrate the robustness of our tracker against ID-switches. Our tracker consistently tracks objects, even during heavy occlusions (e.g., ID-64, ID-65, ID-69, and ID-72 in our results), while many switches occur for the JBNOT tracker. Finally, we present the qualitative results in Figure 8.

VI. CONCLUSION AND FUTURE WORK
In this study, a novel data association mechanism called OneShotDA was presented and integrated with MHT to perform online MOT. The proposed network classifies existing tracks by pointing to corresponding detection results using an attention mechanism. OneShotDA can solve the data association problem of MOT and identify false positives in detector outputs. To train the proposed network, we employed a novel training strategy tailored for one-shot learning that is suitable for data association tasks. We also demonstrated how training samples can be generated from MOTChallenge datasets. In a series of experiments, our OneShotDA tracker delivered performance comparable to the performances of existing state-of-the-art methods. Additionally, our tracker ranked first among online trackers on the MOT17 dataset. For future work, we plan to devise an incremental learning method to learn the appearance function of a track (i.e., app(T (i) l ) in this work). In online tracking mode, detection results come in a sequential order, meaning incremental learning can help our tracker in updating learned appearance features based on newly associated detection results.