Efficient Online Tracking-by-Detection with Kalman Filter

Fast and reliable visual tracking of multiple objects in videos has a promisingly broad area of application in manufacturing, construction, traffic, logistics, etc., especially so in large-scale applications where it is not feasible to attach markers to many objects for traditional marker-enabled tracking methods. This paper presents a new approach, Kalman-intersection-over-union (KIOU) tracker, for multi-object tracking in videos that integrates a Kalman filter with IOU-based track association methods. The performance of the proposed KIOU tracker is quantitatively evaluated with UA-DETRAC, an open real-world multi-object detection and tracking benchmark. Experimental results show that the KIOU tracker outperforms the leading tracking methods. Additionally, the KIOU tracker has speed comparable to simple area overlap-based track association and quality comparable to methods with much higher computational costs, demonstrating its potential for online, real-time multi-object tracking.


I. INTRODUCTION
Tracking of assets and related personnel are of practical value in a variety of usage cases, including traffic, healthcare, storage, logistics, construction and manufacturing sites, and efficient tracking methods are constantly deemed as a significant topic of research. Different tracking technologies, such as barcodes and radio-frequency identification (RFID) tags [1], Ultra WideBand (UWB) technology [2], global and indoor positioning system [3], ultrasonic sensors [4], RGB [5], RGBD [6], or depth-only cameras [7] as well as 3D structures calculated from Structure-from-Motion (SfM) algorithm [8], have been developed, deployed and commercialized at various sites. Among these, the RFID and UWB systems see the most success, with some solutions entering commercialization [9] thanks to their accuracy, reliability, and relatively low cost. These solutions are usually enabled by a marker, either passive (e.g., RFID) or active (e.g., UWB and wireless local area network) that are attached to valuable assets in the scene.
In contrast, vision-based tracking was usually deemed as impractical in earlier reviews [10]- [12]. Reviewers value marker-enabled indoor tracking method, including RFID, and UWB, for their reliability and localization range over camera tracking enabled by basic machine vision applica-tions including barcode and QR-code.
However, vision-based tracking systems provide application opportunities that other alternatives fail to offer. A vision-based tracking system does not need a marker attached to each tracked object, which is a prerequisite for almost all other alternative tracking methods. Adapting a vision-based tracking system spares the user from the daunting task of instantiating, documenting, and maintaining the numerous passive or active markers for each tracked asset. Although advancements in mobile sensor technology allow easier housekeeping of the makers, users are limited to marking only the most valuable assets, in some other use cases it may become impossible to maintain a marker for each object, i.e., a parking lot or a logistics/distribution center, and it is generally unacceptable to track personnel/employees with the aforementioned markers. In such cases, a vision-based, marker-less tracking system provides potential for tracking significantly more objects at no additional cost, and may see application in high-flow rate applications like traffic analysis.
Following the the rapid improvement of neural networkenabled image classifiers including Fast/Faster/Mask R-CNN [13]- [16], single shot multibox detector (SSD) [17], and fully convolutional networks (FCN) [18], [19], tracking of general objects in uncontrolled environments has also seen VOLUME 4, 2016 increased interest, particularly in construction and manufacturing sites. Recent advances in computer vision, including point cloud analysis [8], tracking-by-detection [20], [21] and object re-identification across views [22] have all been found valuable in these usage scenarios.
Initial literature formulate the multi-object tracking-bydetection task as identifying optimal association of detections into tracks through global optimizations [23], [24]. For example, the Simple Online Realtime Tracker (SORT) [25] solves the global track association problem using the Hungarian algorithm. As an extension to SORT, Wojke et al. [26] incorporates visual information to aid in making associations over a longer time gap. Although the above tracking methods involving global optimization provide good performance, they do not permit on-line operation, which is preferred given the proliferation of fast and highly accurate image classifiers and detectors that run in nearly real-time. Many on-line trackingby-detection approaches have been proposed [25], [27], [28] in pursuit of real-time, on-line tracking performance. These approaches perform frame association using simple spatiotemporal overlap measure to maximize frame rate, while relying on high-quality detections from a separate detector to achieve preferable tracking performance. An overlap is usually measured by the IOU of the detection bounding boxes between consecutive frames.
Among the various on-line tracking-by-detection methods, Use of probabilistic models [29], [30], particle filters [31], [32] and probability hypothesis density (PHD) filters [33]- [36] have been extensively discussed. Instead of relying on probabilistic models, the IOU tracker [27] achieves high speed by using only detection bounding boxes as the input while delivering performance improvements over previous optimization-based approaches. An extension to the IOU tracker, named V-IOU [28], implements a single-object visual tracker for each track in an effort to repair fragmented tracks due to intermittent missed detections across frames. This approach helps alleviate track fragmentation when minor, intermittent missed detections are present, but when tracks fail to associate with new detections for an extended period, i.e., due to occlusion, more costly techniques including visual re-identification must be performed in hope to reacquire the missed object.
In this work, we extend the approach of [27] to further address the problem of track fragmentation by instantiating a Kalman filter [37] for each tracked object so that object dynamics are learned and tracked in a prediction-and-correction manner. A recursive Bayesian estimator for linear Gaussian problems, the Kalman filter has seen extensive application in general position tracking since more than 40 years ago [38]- [41] and more recently, in video-based single [42] or multiple [43] object tracking. The Kalman filter assumes that object dynamics are linear and the error have a Gaussian distribution, and is theoretically limited compared to non-Gaussian filtering methods like particle filter [44]. However, we see value in Kalman filter's low computational cost as an effective and inexpensive remedy to track fragmentation in tracking-by-detection problems. We evaluate our approach-KIOU tracker-on the UA-DETRAC dataset in comparison with [27], [28] and other leading methods in the benchmark and demonstrate salient performance advantage over the state-of-the-art. We further show that this approach is suitable for real-time applications as it achieves comparable performance in multi-object tracking accuracy (MOTA), track fragmentation (FM) and identity switches (IDs) metrics with the current leading result while saving up to 80% of computational resources by using less input frame rate.
The remainder of this paper is organized as follows. Section II presents the proposed methodology. Main results are presented and discussed in Section III. Finally, Section IV concludes the paper. The source code for the proposed method has been made available online 1 .

II. METHODOLOGY
The KIOU tracker is an extension to the IOU tracker [27], which provides preferable performance on the UA-DETRAC dataset [45] at minimal computational cost. In this section, we provide an overview to the IOU tracker and demonstrate the improvements made by the proposed method.

A. IOU TRACKER
With continued performance improvements achieved by object detection and semantic segmentation methods, it is idealistically claimed that bounding boxes for generic objects in an image frame can be generated in a reliable and consistent manner, such that track association for multi-object tracking tasks become trivial [27].
Based on such belief, the IOU tracker operates on a list of bounding box locations in each frame, and performs track association in a greedy manner. Namely, for each existing track, the tracker calculates IOU metric between last known location of the object and all detections in the current frame. If the highest-overlapping detected bounding box also passes the pre-set IOU threshold, the said object is associated to the corresponding track. Given a detection d f and a track whose last frame is t f −1 , the IOU measure is defined as follows: For speed and simplicity, the IOU tracker does not tolerate missed detections. Tracks that fail to match with a current detection are terminated immediately, and all detections that are not matched with an existing track will be initiated as a new track at the end of the current iteration.
When paired with a modern object detector that provides high-quality detections, the IOU tracker provides preferable performance at surprisingly high speed. On dense traffic scenes featured in the UA-DETRAC dataset, the IOU tracker achieved leading performance while maintaining a frame rate of more than 100,000 frames per second (FPS).
The published results of the IOU tracker in a tracking-bydetection task are based on two assumptions-(1) a detector must generate as few missed detections as possible; and (2) the detector and tracker must collectively run at high frame rates such that detections of an object in consecutive frames have an high IOU rate. Nevertheless, these assumptions cannot be easily fulfilled when the algorithm is applied to on-line tracking.
It is at first unrealistic to assume that a perfect detector is used in tandem with the tracker. Missed detections and false positives exist in the output of nearly every detector and therefore a good tracker must address these flaws. Additionally, the high frame rates at which many trackers are evaluated may not be available in an on-line tracking setup. The EB detector [46] for vehicle detection reports 9-13 FPS on a high-end Nvidia TITAN X GPU, well below 25 FPS at which the UA-DETRAC dataset [45] is recorded. Therefore, the tracker may face at least 50% reduction in available frames, which invalidates the high FPS assumption set above.

B. KALMAN FILTER
The Kalman filter [37] has long been treated as the goto solution for various tracking and navigational tasks [47], and is also theoretically well poised for the visual tracking problem. We may assume that a tracked object's movement has the form: where x k is the state vector containing position and speed of the object at time k; Φ is the state transition matrix of the object movement from time k to k + 1, assuming a constant speed; and w k is the process noise that follows a normal distribution.
Observations on the position of the object can be modeled in the form: where d k is the observed coordinate of the object at time k; H is the observation matrix connecting the state vector and coordinate vector; and v k is the measurement noise that follows a normal distribution. The covariances of the process noise and the measurement noise are given by: The goal of the Kalman filter is to estimate location and speed of the object,x. The error covariance matrix at time k is given by: The Kalman filter is useful in visual tracking as one can obtain a prior estimate of the object coordinated from process knowledge, at least in the short term, without the current measurement: Whenever a new measurement has been obtained, i.e., the coordinate of the object at time t has been found, the estimated statex can be updated as follows: where K k is the Kalman Gain for time k, which can be obtained given the current measurement: and the error covariance matrix can be updated by: The Kalman filter recursively updates K k , P k andx k and can be used to provide a smoothed trajectory for a track t = {d 0 ,d 1 , ...,d k }. When that track fails to match with a detection at time k, process knowledge can be used to project the trackd k+1 in the short term, reducing track fragmentation caused by occasional missed detections as shown in later part of this paper.

C. KIOU TRACKER
In observation of the aforementioned challenges faced by the IOU tracker, we propose the KIOU tracker by extending the IOU tracker in seek of addressing the challenges of missed detections and low FPS, while preserving the merits of the original approach. Namely, the proposed approach leverages the Kalman filter's ability to consider a series of measurements with presence of noise. By instantiating a separate Kalman filter instance for each track, the KIOU tracker estimates object dynamics along with location information, and use the estimated speed to perform track association in a prediction-and-correction manner.
As is with the original implementation in [27], [28], track association is done by evaluating IOU measure of a track with candidates from each frame. Searching for detections with highest overlap with existing tracks is expected to lead to correct track association, provided that the frame rate is high enough such that the difference between two consecutive frames is as small as possible. At lower frame rates, the IOU threshold σ IOU must be significantly relieved, from which the results may be prone to identity switches and false track associations. This complication can be alleviated though, as a Kalman filter may estimate the trajectory and project the track to next time steps. It takes two associated frames for the Kalman filter to start making predictions, after which the IOU measure can be evaluated between current detections d k and projected tracksd k .
Incorporating a Kalman filter to the KIOU tracker brings pronounced benefits. First, a track remains current even with the presence of intermittent missed detections: the Kalman filter may continue the track without an observation should the detection is missed in the current frame. Thanks to this feature, the KIOU tracker can keep current tracks pending for extended periods before termination, resulting in less track VOLUME 4, 2016 fragments and identity switches. Gaps in a track resulting from missed detections can be easily filled by interpolation if needed.
The ability to predict next frames further allowed for opportunities to save computational cost and meet the performance requirements for on-line tracking. This is however not to claim that our method has a lower computational complexity than that of [27]; in fact, the KIOU tracker runs at 160.85 FPS on the DETRAC-Train dataset, which is no comparable with IOU tracker that achieves 10329 FPS on the same hardware setup. Rather, we claim that when the Kalman filter is combined with a suitable state transformation matrix that correctly characterizes the objects' motion pattern, the tracker can achieve little performance loss while using a fraction of the frames, effectively mitigating the computational burden from the detector. We believe this merit is of practical value considering that the computational complexity of an object detector is much larger than that of the KIOU tracker. The EB detector which is used for detecting vehicles in the DETRAC-Train dataset runs at 9-13 FPS as reported in [46], while more versatile detectors may run below 10 FPS on a high-end GPU [15]. When the KIOU tracker is supplied with detections from the EB detector, skipping 50% of the frames brings the duo to real-time tracking range, and tracking performance does not suffer from significant degradation until more than 70% of frames are dropped. We present detailed results in the Section III.
The KIOU algorithm is presented in detail in Algorithm 1. Detections from each frame D k are first filtered by detection confidence threshold σ l . Track association is facilitated by evaluating the IOU measure between detections d and active tracks t k , which is further guarded by IOU threshold σ IOU . Tracks that fail to associate with a current detection are kept pending for up to σ f frames, after which obsolete tracks longer than σ t are terminated.
We use a constant speed linear movement model for the Kalman filter to adapt to object movements in the UA-DETRAC dataset. In this case, the Kalman filter projects object location by assuming that the object moves under a constant speed and same direction as estimated from previous observations. Other motion models can be used where appropriate. Although no significant impact to tracking results can be seen when there is few or no missed detections, unsuitable motion models that does not capture object dynamics will eventually lead to deviations in position estimation and false track association when a track remain unobserved for extended periods.

III. RESULTS
In this section, we demonstrate the performance of the KIOU tracker by quantitatively evaluating the algorithm with publicly available object tracking datasets, and compare the results with major open-source implementations.

A. DATASET
We evaluated the performance of the proposed tracker on the DETRAC benchmark [45]. Consisting of over 10 hours of dense traffic scene recordings under various weather and lighting conditions, the DETRAC benchmark targets at multi-object detection and tracking tasks. The videos were recorded at 25 FPS with a resolution of 960 by 540, and baseline detections from DPM, ACF, R-CNN and CompACT are available.

B. EXPERIMENTAL ENVIRONMENT
The performance of a tracking-by-detection algorithm is highly dependent on the upstream detector. In general, the higher the performance of the detector to generate bounding boxes in a high-quality and consistent manner, the more likely an track association algorithm will generate high quality tracks. Although more advanced detectors, including Faster R-CNN, SSD, and Mask R-CNN may lead to higher performance on the DETRAC dataset, we opt not to report them based on the belief that our goal is to compare the performance of the downstream tracking algorithm instead of the upstream detector. As the performance of a tracker is dependent on that of a tracker, using a more advanced tracker gives us an advantage over the other methods being compared. Therefore, we report results generated with detections from CompACT, EB and Mask R-CNN detector; the former is a tracker with reasonably good performance and was used by all methods across the leader board, while the current best results were generated with the latter two as of time of writing. We do not report results based on DPM, ACF and R-CNN as their detections are generally of low quality that effective track association cannot be performed. The behavior of the proposed tracker can be controlled by a series of tunable parameters. These parameters are determined by performing a comprehensive search in the parameter space using the training data provided by the DE-TRAC dataset. It is necessary to tune these parameters if the algorithm is applied to a different detector, as the output from different detectors can exhibit vastly different behaviors. For example, it may be necessary to use a low σ IOU if a detector fail to generate bounding boxes of consistent sizes, and the confidence threshold σ l may be increased should a detector that generates many low-confidence false positive detections. Parameters used to generate the reported results are listed in Table 2  To consider the performance of object detection and tracking jointly in evaluation, multi-object tracking literature [48]- [50] have introduced a set of performance metrics including identity switches (IDs), mostly tracked (MT), mostly lost (ML), false positives (FP), false negatives (FN), multi-object tracking accuracy (MOTA) and multi-object tracking precision (MOTP). IDS reflects the accuracy of tracks by describing the number of times that the identity of object being tracked changes in an track, and a track generated by the evaluated method is labeled as mostly tracked if it contains at least 80% of frames in the groundtruth track. Similarly, a track is labeled as mostly lost if it contains less than 20% of frames of the ground-truth. FP and FN are obtained by comparing tracker output with groundtruth on by-frame basis. Additionally, two high-level metrics are provided. The MOTA metric provides a summary of the performance of the evaluated method by considering a collection of metrics over the evaluated video sequences: t is the number of false negative track frames, F P v,t is the number of false positive track frames, IDS v,t is the number of identity switches, and GT v,t represents the total number of ground-truth objects at time t of sequence v. The MOTP metric on the other hand is calculated as the overlap between correctly assembled tracks and the groundtruth.
The DETRAC dataset used custom versions of the metrics mentioned above, by evaluating the tracker under increments of detection confidence thresholds and integrating along the detector precision-recall curve to obtain the final metrics score. Pronouncedly, the number of detections available to the tracker will be decreased by the increasing detection confidence threshold, thus the leading performance numbers VOLUME 4, 2016 are lower than that of other datasets of similar difficulty levels.

C. EVALUATION
We present the results generated by the KIOU tracker when evaluated on the DETRAC-Test dataset in Table 3. The algorithm is implemented in Python as a single-thread program. Per guidelines of the UA-DETRAC dataset, only the location, size, and confidence score of detection boxes in each frame is available as input to the tracker. Note that although a tracker may be evaluated with multiple detectors, only the best tracker-detector combinations are reported to the benchmark for evaluation on the DETRACthe detector will run at only 9-13 FPS.-Test set.
In summary, the KIOU tracker outperforms the other available approaches in tracking accuracy (PR-MOTA), percentage mostly tracked (PR-MT), percentage mostly lost (PR-MT), and false negative detections (PR-FN). These results indicate that the proposed method maintains the highest number of valid tracks among the methods compared. The results reaffirm our observation that performance of a tracker is highly dependent on the underlying detector, as the results from the IOU tracker vary greatly depending on the detector used. Namely, the results from EB detector has a higher tracking accuracy but relatively low precision when compared with that from the R-CNN detector, which indicates that EB detector generates more detections but are generally less precise than the R-CNN detector. In this case, it is more reasonable to compare tracker performance on the same detector. On the EB detector, the KIOU tracker also achieves lower identity switches (IDs) and fragmentation (FM) than the IOU tracker, a benefit brought by the Kalman filter's ability to continue tracks correctly under occasionally missing detections.
To further inspect the effect of the Kalman filter in track continuing, a comparison between the original IOU tracker, the V-IOU tracker, which is an extension of the IOU tracker and at the time of writing (January 2021) is the best publicly available tracker on this dataset, and our method on the DETRAC-Train dataset is presented in Table 4. Two detectors are used in the comparison: EB is evaluated in DETRAC-Test official results; and Mask R-CNN that allows overall higher performance than the former. Our method is evaluated in multiple runs with the detectors while dropping a variable portion of the frames, with the most difficult run using only 20% of the available frames. Considering that the UA-DETRAC evaluation metrics pose penalties for each missed frame, the bounding boxes in the skipped frames are generated off-line by simple linear interpolation for a fair comparison. One may claim that if a tracker is capable of tracking the location of an object at every other frame, then the true location at the skipped frames are more or less unimportant, especially when the frames are only separated by 0.04 seconds (at 25 FPS). Therefore, the tracker can be used on-line, without post-interpolation, while delivering the same level of performance as reported as the state-of-the-art.
The comparison in Table 4 shows that the KIOU tracker provides a significant performance gain in MOTA, IDs and FM metrics over the baseline method. When using detections from the EB detector, the IDs metric was reduced by 89.8%, and FM was reduced by 85.7% when compared to the IOU tracker. In comparison, the reduction in IDs and FM metric are 46.5% and 29.1%, respectively, which indicates that Mask R-CNN generally has a lower number of missing or incorrect detections compared to EB. This also suggests that with a sufficiently sophisticated detector that generates highquality detections one may resort to simpler frame association methods in favor of higher frame rates.
Visualized results in Table 5 also show that the Kalman filter aids in reducing track fragmentation and identity switches resulted from missing detections or occlusion. We used best settings reported in [28] for the IOU tracker, and the settings reported in Table 2 for the KIOU tracker. The detector used in this visualization is the Mask R-CNN, which has the lowest number of missing or incorrect detections among all detectors evaluated in this paper. For each missing detection caused by occlusion, the IOU tracker must initiate a new track when the object is detected again in the following frames away from its last known location, while the KIOU tracker can estimate and predict the location of the object, thus increasing the likelihood of continuing the track.
Another notable observation from the test results is that the performance of our method is not significantly impacted by the reduced number of frames available. Skipping 1/2 or 2/3 of the frames offers a significant performance gain in effective frame rates at a cost of minor performance down in MT, ML and MOTA metrics. Some other metrics, most significantly the IDs and FM actually benefits from skipped frames. The most probable cause for this performance improvement is that some erroneous/misleading detections are skipped among the majority of frames.
The above observations do not necessarily suggest that an unlimited portion of frames can be skipped. Skipping increasing number of frames offers diminishing performance improvement but in turn magnifies performance deterioration in terms of tracking accuracy and other major metrics. Most notably, the overlap threshold, which is the primary means for associating frames, must be lowered to accommodate increasing number of missing frames. From Table 2 one may observe that for Mask R-CNN detections, the σ IOU threshold is decreased to as low as 0.1 from the normal threshold of 0.6 when no frames are skipped. Continuing reducing number of frames input may prevent the tracker from making any correct track associations. Judging from the results from Table 4, the KIOU tracker outperforms the baseline in most metrics when up to 2/3 or 3/4 frames are dropped from results of EB and Mask R-CNN detectors, respectively, and dropping more frames beyond that may result in a severe performance loss, most significantly in mostly tracked (MT) and tracking accuracy (MOTA) metrics.
Although highly dependent on the nature of the data, it appears that dropping 1/2 or 2/3 frames is a reasonable choice in practice as it provides a significant improvement in effective frame rates at a modest cost of reduced performance. At the same time, dropping some of the frames enables realtime processing on the videos within the DETRAC dataset. We demonstrate in Table 6 a brief runtime analysis with EB detector on the 25 FPS videos available in the DETRAC dataset. EB detector is selected as it is relatively light-weight and runs at 9-13 FPS as reported in [46] which is significantly faster than the Mask R-CNN [15]. We claim that the KIOU tracker can outperform the baseline while using only 1/3 of the frames. At this configuration the proposed tracker can deliver the reported performance at 25-36 FPS, even though the detector will run at only 9-13 FPS.the detector will run at only 9-13 FPS.

IV. CONCLUSION
Simple track association approaches by spatial-temporal overlap measure provide decent tracking performance at high speed but are generally unable to reliably process missing detections. We present an innovative method to perform track associations under missing detections through Kalman filter. We further exploit this characteristic by manually introducing missing frames in favor for significantly higher processing speed. Our approach outperforms the state-of-the-art on the evaluated tracking dataset while using up to 80% less computational cost in real-time operation. This high-speed marker-less tracking approach may prove suitable for many use cases in manufacturing, traffic, logistics, among others, where tracking are predominantly performed with the use of markers.