e-TLD: Event-based Framework for Dynamic Object Tracking

This paper presents a long-term object tracking framework with a moving event camera under general tracking conditions. A first of its kind for these revolutionary cameras, the tracking framework uses a discriminative representation for the object with online learning, and detects and re-tracks the object when it comes back into the field-of-view. One of the key novelties is the use of an event-based local sliding window technique that tracks reliably in scenes with cluttered and textured background. In addition, Bayesian bootstrapping is used to assist real-time processing and boost the discriminative power of the object representation. On the other hand, when the object re-enters the field-of-view of the camera, a data-driven, global sliding window detector locates the object for subsequent tracking. Extensive experiments demonstrate the ability of the proposed framework to track and detect arbitrary objects of various shapes and sizes, including dynamic objects such as a human. This is a significant improvement compared to earlier works that simply track objects as long as they are visible under simpler background settings. Using the ground truth locations for five different objects under three motion settings, namely translation, rotation and 6-DOF, quantitative measurement is reported for the event-based tracking framework with critical insights on various performance issues. Finally, real-time implementation in C++ highlights tracking ability under scale, rotation, view-point and occlusion scenarios in a lab setting.


I. INTRODUCTION
S TANDARD video cameras struggle to capture crisp images of scenes characterized by high dynamic range and motion, returning blurred or saturated images. To overcome these limitations, event cameras aim to emulate the important asynchronous property of the human retina, thus earning themselves the name "silicon retinas". Hence, an event camera has no global clock or shutter to record images in the traditional sense. Instead, each pixel individually adapts and responds to temporal changes in log intensity, and outputs an asynchronous event with the pixel address which gets a precise timestamp in the order of microseconds.
An event is characterized by a spatial location (x, y), timestamp t and a binary-valued polarity p, i.e., on-events (p = 1) are caused by a positive change in log-intensity and vice-versa for off-events (p = 0). In both cases, events triggered by brightness changes are likely to occur at the edges that delineate the structure of the scene, and thus removing redundancy with a much lower data rate compared to a standard VGA resolution video at 30 fps. Although redundancy is absent in the event stream, the higher time-resolution should in principle contain all the information of a standard video without bounds on frame-rate and dynamic range. The image reconstruction from a pure event stream lends support to this idea [1].
Despite object tracking being a major research topic in computer vision, applicability has been limited by the low camera dynamics of standard vision sensors. Increasing the frame rate only burdens the computation techniques [2]- [4], preventing a dynamic, low-latency formulation of object tracking. On the other hand, the discontinuous motion information captured using a standard 30 fps video camera is an obvious disadvantage for frame-based object tracking algorithms. This paper introduces a simple and efficient object tracking framework, consisting of a local tracker and a global detector, by taking advantage of the sparsity and higher temporal resolution of the event camera. In other words, the position of the object in the field-of-view of the event camera changes with negligible spatio-temporal discontinuity (5-10 µs). Therefore, the key idea is to spatially limit the search region of the tracked object while the temporal limits are imposed by the rate at which events arrive within the search region. In particular, the tracker search is modeled as a discriminative classification scheme (object vs. background) using the eventbased descriptor proposed in [5]. Therefore, given the initial location of the object within a short time-interval, the training phase of the tracker learns a binary classifier. Subsequently, an object detector is learned using the training samples of the tracker.
The proposed framework is similar in spirit to the trackinglearning-detection (TLD) system for frame-based cameras [6], nonetheless significantly different in the methods suited to event-based vision. We term our event-based object tracking framework as e-TLD. This is one of the first works to introduce a general purpose method to track object data from event cameras, which can be efficiently implemented in software at least, in contrast to the ever-growing neural network paradigms that potentially require hours of re-training for online learning. The e-TLD online learning process is the incremental SVM update stage that is well-documented to be a very efficient process [7]. Apart from the online learning ability of e-TLD, the core training process is the codebook learning step that requires under a minute for 500ms worth of data on a standard PC using efficient sampling strategies [8]. This also requires significantly lower resources in contrast to Siamese deep neural network object tracking paradigms [9] that require ASIC implementations for real-time inference on each video frame. In summary, the objective is to develop a real-time long-term tracking system to achieve: 1) Continuous, long-term robust tracking under background change, illumination change and scale change. 2) Re-capturing the target after temporarily occluded by other objects or when it re-appears after exiting.

II. EVENT-BASED PROCESSING
We use the commercial event camera, the Dynamic and Active-pixel Vision Sensor (DAVIS) [11] shown in Fig. 1(a). It has 240 × 180 resolution, 130 dB dynamic range and 3 microsecond latency, and communicates with a host computer using USB 2.0. It concurrently outputs a stream of events and frame-based intensity read-outs using the same pixel array. As mentioned earlier in Sec. I, an event consists of a pixel location, a binary polarity value for positive or negative change in log intensity and a timestamp in microseconds. The event camera output can be visualized as shown in Fig. 1(b) by accumulating events within a short time-period (40ms in this case). In our work, polarity of the events are not considered, so both on-events and off-events are shown in white and the black regions correspond to inactive pixels. Note that only the event data of the DAVIS is used in this work.

A. Related Work
The recent deep learning revolution in computer vision has also influenced neuromorphic vision with many works primarily in machine learning [12]- [16]. Besides learning, simultaneous localization and mapping (SLAM) is a trending robotics application using silicon retinas [10], [17]- [21]. On a larger front, these revolutionary cameras allow new perspectives in reformulating traditional vision problems, such as object detection and tracking, which largely remains an unexplored area of research.
A few works have used the event camera for object tracking with focus on specific application scenarios. One of the first object tracking applications demonstrated using the commercial dynamic vision sensor (DVS) was to track and control the position of a pencil balanced on a robot arm using a fast event-based Hough transform [22]. Other works focused on event-based algorithms for traffic monitoring [23], [24] from a static sensor point-of-view, and consequently, tracking can be treated without background modeling as only dynamic objects are picked out by the static event camera. Similarly, the robot goalie application [25] also takes advantage of the stationary DVS camera for tracking multiple balls. Recently, [26] proposed an event-based algorithm that can perform tasks such as detection and tracking designed specifically for space situational awareness applications.
A handful of works have attempted to tackle tracking of objects from a moving event camera. Using the DAVIS, [27]- [29] use a convolutional neural network (CNN) to detect likely target locations for tracking from a moving platform. However, a hybrid approach with frames and events naturally loses the advantages of a low-latency, purely event-driven approach, although can provide energy savings in hardware implementations [30]. On the other hand, [31] uses a parametric model to motion-compensate for the camera, without explicit feature tracking or optical flow computation, and subsequently moving objects that do no confirm to the model are detected in an iterative fashion. Nonetheless, data association and redetection for long-term tracking remain missing components in these approaches.
Compared to above works, general purpose object tracking works [32], [33] using event-based approaches have been proposed to track incoming blobs of events based on local shape properties. Although these methods are capable of adapting its shape and position to the distribution of incoming events, they carry motion assumptions such as a bivariate Gaussian distribution. Thus, the algorithm parameters were defined experimentally according to the target to track, as acknowledged in [32]. Moreover, the previous systems are not suited to track a set of patterns/object as a whole. Finally, [32], [33] are not suited for long-term tracking, because there is no detector to re-initialize the tracker after a failed track.
In contrast to the above works, the event-based long-term object tracking and detection framework proposed in [34] has the following limitations. As acknowledged in [34], our prior method works only when there is a clean background surrounding the target object. Secondly, the training phase of the detector [34] is not data-driven (less reliable in practice) and uses computationally expensive image processing approaches for locating the most probable object candidate. Thus, our previous work is suitable only for simple shapes, as shown in the accompanying video results.
In this paper, we propose a general purpose discriminative tracking system using a local sliding window approach, whose parameters are intuitive and can be easily generalized for a wide variety of objects having different shapes and sizes in cluttered settings, as shown in Fig. 1. The classifier used is a support vector machine (SVM) with an additive χ 2 kernel. For efficient implementation, finite dimensional linear approximations of the kernel are used, as introduced in [35]. Such maps are efficient linear representations of popular ones, such as the intersection, χ 2 , and Jensen-Shannon kernels. Moreover, with a computationally easier online update, SVM is preferred over other classifiers and deep learning approaches. Lastly, we propose a data-driven approach for training the object detector and a global sliding window method for locating the object, which allows detecting even small objects, like the drinking cup near the monitor in Fig. 1(b).

B. Contribution
This paper is an extended version of the preliminary work [36]. Novel contributions over [36] include quantitative analysis on the full-length recordings of the event camera dataset [10], tested for the first time using event-based sensors to the best of our knowledge (Sec. IV-B), and robustness analysis using hand-held experiments (Sec. IV-D) with critical insights into the system performance for various hyper-parameters (Sec. V). We also release full-length annotations for the dynamically captured data, i.e., moving objects captured with a moving camera setting. Additionally, this work includes a comprehensive comparison to existing state-of-the-art event-based tracking method e-LOT [34] (Sec. IV-C) and further provides new implementation details in Sec. III, including a free-running mode implementation capable of a detection output at any point in time, as opposed to periodically operating on a set of events [34], [36]. Finally, we have tightly integrated the tracker and detector with parameter sharing and in the process also obtain better performance compared to [36].

III. METHODOLOGY
The proposed e-TLD framework integrates a tracker and detector, as shown in Fig. 2. The event-based object tracker (Sec. III-A) is a local search that requires initialization and outputs smooth trajectories. However, it cannot recover from failure on its own. The event-based object detector (Sec. III-B), on the other hand, is a global search that does not assume anything about the previous position of the object, and is relatively slower compared to the tracker. However, we can achieve real-time processing by activating the detector only when tracker failure happens.
During the tracking process, online learning is needed to account for the changes in object appearance. In particular, the binary classifier used by the tracker is updated when the region-of-interest (ROI) is classified as the object. Updating the tracker mitigates the drifting issue, but only done when the  . Local sliding window for object tracking using event cameras. A small padding ensures the sliding area contains the object in the next instance of classification. As shown in the example above, a padding of two pixels in x and y directions creates 25 candidate windows (best viewed on monitor). tracking confidence is higher than a percentage of the mean tracking score. If tracking failure happens, a higher confidence value is needed to re-activate the tracker. In other words, the target will be re-tracked only when it can pass both the detector and a more "strict" tracker. The following subsections describe the e-TLD framework to jointly track and detect the object.

A. Event-based object tracker
Each time a ROI is classified as object, a small padding ensures the search area contains the object at the next instance of classification, as shown in Fig. 3. The position of the object is then updated with the candidate ROI with the highest classification score. This process is extremely simple, but works extremely well in challenging cluttered conditions due to the high-temporal resolution of the event camera. Note that we set the classification period in terms of the number of events received within the ROI, instead of explicitly choosing a timeperiod. In particular, this number is chosen as a small fraction of the bounding box size and thus allows a dynamic classification rate for different object shapes and sizes.
We employ the feature descriptor proposed in [5], and thus, each event is encoded as a local descriptor. The notation e i = (x i , y i , t i , p i , x T i ) T denotes an event with pixel location x i and y i , timestamp t i , polarity p i and the feature vector x i .
We denote by N the number of candidate windows, and by X j = (x 1 , x 2 , · · · , x ni ) the collection of event descriptors contained within a candidate window W j where x l ∈ R d , l = 1, 2, · · · , n i is a descriptor in feature space S.
Inspired by the bag-of-words (BOW) model in computer vision [37], each feature vector x l is quantized into one of K different visual words that are obtained from the training phase. The mapping to a visual word v k ∈ S is achieved using a quantization function f k (x) : S → {0, 1}. Each quantization function f k (x) is essentially computing the distance of the feature vector to v k and allowing the assignment if it is minimal.
where indicator function I(z) outputs 1 when z is true or 0 otherwise; ρ is the Euclidean distance, arg min k ||x − v k || .
Given K visual words, or K quantization functions {f k (x)} K k=1 , a codeword representation is computed as, The tracker representation for W j is expressed by the vector, Each incoming event in a candidate window W j is then used to update the tracker representation h j ∈ R K . The scalarvalued discriminant function D(h j ) indicates the presence (class ω 1 ⇒ +1) or absence of the object (class where h j ∈ {ω 1 , ω 2 }. During the training phase, where the user specifies a tight ROI in space-time that contains the object, all the events including ones outside the ROI are used to obtain the parameters of D, which is the problem of constructing a classifier for two classes -object vs background. 1) Training phase: When the user specifies the spatiotemporal position of the object, the first step is create the visual words {v k } K k=1 ∈ S, which are the cluster centers generated using k-means clustering of the event descriptors inside and outside the ROI. In other words, the codebook is an unsupervised learning step. Then the events within the ROI, represented by the set of descriptors X ω1 = (x 1 , x 2 , · · · , x C1 ) can be used to generate a tracker representation h ω1 , given by eq. (3). Similarly, the events outside the ROI X ω2 = (x 1 , x 2 , · · · , x C2 ) can be used for obtaining h ω2 . However, training a classifier with just one sample from each class (h ω1 and h ω2 ) is pointless.
To solve the low sample problem, statistical bootstrapping [38] can be used to generate new subsets of descriptors {X ω1 1 , X ω1 2 , · · · , X ω1 n1 } and {X ω2 1 , X ω2 2 , · · · , X ω2 n2 }. Specifically, bootstraping X ω1 is the process of random sampling of a subset out of the C 1 descriptors belonging to the ROI, one at a time such that all descriptors have an equal probability of being selected, i.e., 1/C 1 .
However, storing a set of events or descriptors for bootstrapping (X ω1 and X ω2 ) is impractical for online learning on an event-by-event basis [34]. Thus, we propose bootstrapping to be interpreted in a Bayesian framework [39] that re-weights the histogram representations (h ω1 and h ω2 ). Let P ∼ U ([0, 1]) be a uniformly distributed random variable. Mathematically, where the above clipping operator is a floor operation. Thus, the first bootstrapped histogram representation for the ROI is expressed by the vector, It is to be noted that eq. (6) is not a true bootstrap procedure since the maximum values of h k 1ω1 need not be clipped to the corresponding maxmimum values of h k ω1 , as seen in eq. (5). However, the Bayesian bootstrap is operationally and inferentially similar to the true boostrap [39]. Let N 1 and N 2 denote the number of samples after bootstrapping belonging to class ω 1 and ω 2 respectively. Then, the collection of the bootstrapped representations {h 1ω1 , h 2ω1 , · · · , h N1ω1 } and {h 1ω2 , h 2ω2 , · · · , h N2ω2 } is used to train the SVM classifier D(·) with a χ 2 kernel [35].
2) Tracking Phase: The candidate windows {W j } N j=1 each output a tracker representation h j . The best candidate window is chosen to be the tracker state B t when D(h j ) is maximized.
arg max given for all Y j , the discriminant function satisfies ). The number of events for the ROI update ("waiting time" between two instances of classification) is set as τ × height × width of the ROI X j , where τ ∈ [0, 1] is set to 0.05 in our experiments. Thus, when the sliding area contains 5% of the events relative to the number of pixels contained within the ROI, the next instance of classification is triggered (see Fig. 3). The average SVM score after several instances is used to determine whether the next instance of tracking is successful. In case, the SVM score falls below a fraction of the average score, τ t , then the object detector is instantiated to globally search for the object.

B. Event-based object detector
Once the tracker has lost the object, detecting the object is the problem of obtaining a candidate ROI and continuing the tracking process. Therefore, detection is a global sliding window search compared to the local sliding window search of the tracker. Fig. 4 illustrates the detection process that is described in detail below.
1) Training phase: Similar to the training phase of the tracker (Sec. III-A2), the object detector uses the ROI initialization by the user. Let o denote the number of quantized clusters to which the object samples X ω1 are frequently mapped, and the corresponding cluster indices be O ω1 = {k 1 , k 2 , · · · , k o } where o K. The objective of the proposed detector training phase is to obtain O ω1 in a data-driven fashion without relying on ad-hoc threshold parameters.
The main idea is to deduce clusters that are important to X ω1 while rejecting quantization results that are common to X ω1 and X ω2 . By making use of the Bayesian bootsrapped representations, a new vector h diff ω1ω2 ∈ R K is used to obtain object clusters for the detection process, The positive values in h diff ω1ω2 represent codewords that have been assigned to the object more times than it has been assigned to background. Therefore, these codewords are simply chosen to be O ω1 . This data-driven approach of training the detector ensures that the ROI events have the highest probability of detection compared to choosing cluster centers that have a high percentage of ROI events, as was done in [34]. In other words, cluster centers that are selected as detection landmarks do not ensure ROI events belong to the codewords.
2) Detection phase: Algorithm 1 outlines the proposed event-based object detection approach. If the event camera output contains h rows and w columns, a detection matrix M ∈ R h×w + is used to keep track of events that may belong to the object. For every incoming event, the quantization function, defined in eq. (1), determines whether it belongs to the detector clusters {k 1 , k 2 , · · · , k o } and the corresponding location of the event is used to increment M . The detector threshold τ ×h×w, determines if enough events have been accumulated within the detection matrix M , and represent a percentage of the pixels from the ROI. The parameter τ is the same as the one for the local search tracker, set to 0.05, meaning at least 5% of the events have occurred globally for the detection process.
A global sliding window process is then performed on M to determine the region with maximal activation due to the presence of the object (if any). If the previous successful object state B t has m rows and n columns, then the size of the activation map after the global sliding window operation will be h − m + 1 rows, and w − n + 1 columns.
In the case of dynamic objects, the detection matrix M accounts for both the camera motion and object motion. This results in a trail of object events rather than a crisp detection as shown in the heat map of Fig. 4. In these cases, our previous system [36] detected very large bounding boxes around the object due to having a different threshold (τ d set as 0.25) for the detector that decoupled its behavior with the tracker. This is a seemingly innocuous issue, but one that results in heavy performance loss as shown in the experiments. In this work, the parameter τ is shared by the tracker and detector, which tightly couples their overall performance. As shown in Fig. 2, the event-based tracking-learningdetection (e-TLD) framework combines the tracker (Sec. III-A) and the detector (Sec. III-B) to track a desired object indefinitely. This novel framework integrates tracking and detection together and benefits one from the other to solve the long-term tracking problem.
There are three advantages for our proposed framework. First, since the global sliding window update of the detector is relatively time-consuming compared to the local sliding window update of the tracker, it is activated only when the local search tracker fails, which reduces the computational complexity significantly. Second, the robustness of the longterm tracker is benefited from treating normal tracking and recovery from detection independently. In particular, when the detector outputs a candidate location, the tracker confidence needs to be above the mean tracking score (τ t = 1) instead of a fraction of it (τ t < 1). Third, drifting on the tracker is prevented by only updating when the current tracker score is greater than the mean tracking score.
The main premise in tackling object appearance changes is that the tracker representation (eq. (3)) obtained using the spike context descriptor [5] is robust to gradual scale and rotation changes. Specifically, the spike context descriptor uses a logpolar grid that tolerates moderate scale and rotation variations. Therefore, as with the case of object tracking scenarios using event cameras with high temporal resolution, the change in appearance from one instance of tracking to another is smooth and thus online learning enables accurate tracking. In the current e-TLD setup, the detector is not updated on-the-fly, as it requires online dictionary learning, which remains a future direction of research.

IV. EXPERIMENTS
For testing the proposed e-TLD system, the dynamically captured data in the event-camera dataset [10] was used. For each object, the training ROI was manually specified during the  first 500ms of the recording and the testing was done until the end of the recording (60s). Using the ground truth annotations we created, it is possible to quantitatively evaluate the tracking performance and this sets up one of the first realistic tracking benchmarks for the neuromorphic vision community. The object location is specified as a bounding box within a short timeinterval of 10ms for the full-length of the data. The ground truth annotations for quantitative performance evaluation are available online 1 .
In general, tracking algorithms are evaluated by two metrics [40], which are center location error (CLE) and overlap success (OS). The first metric, CLE, indicates the average Euclidean distance between the ground-truth and the estimated center location (in pixels). The second metric, OS, is defined as the number of times (%) the tracker output overlaps with the ground truth annotations while having a minimal overlap of 50%. We use OS as the primary metric for our evaluation and we report the results at a threshold of 50%, which correspond to the PASCAL evaluation criteria. In addition, we also report 1 https://github.com/nusneuromorphic/Object Annotations CLE when there is an overlap success to show the closeness of ground truth match.

A. Parameters
For each object, an ROI was manually specified during the first 500ms of the recording (training data) and the rest of the recording was used for testing. A codebook size of K = 500 is used to build the object and the background representation. For the local search tracker, an important parameter is the tracker confidence τ t ∈ [0, 1], which is typically set to a value close to the average tracking score required for successful track. Nonetheless, we report the tracking performance for various thresholds in the range [0. 5,1]. The SVM training is performed with Bayesian bootstrapping that outputs equal number of samples as the initial number of descriptors. For example, if there are N 1 = 840 ROI descriptors at the user initialization state, N 1 samples having N 1 /2 descriptors in each sample are obtained after bootstrapping. The parameter τ of the detector is the same as the tracker threshold, although the window size is the whole image plane instead of the tracked region. The system performance is also reported by varying τ in steps.

B. Results on Event Camera Dataset
As shown in Figure 5, e-TLD is able to track and detect objects of various sizes and shapes. In these results, the overlaid markers indicate the position of the tracked object in the field-ofview of the event camera. Although the appearance of the object changed considerably during the translational camera motion, rotation was intentionally kept minimal in these recordings. Separate recordings of the same scene are available for the general 6-DOF camera motion, which induces drastic viewpoint change of the object. Figure 6 shows the tracking of the drone object under drastic view-point variations, showing robustness of the e-TLD system also to induced scale and rotation changes. For a qualitative viewing of the results, the web video 2 clearly shows the fine grain monitoring of the object as long as it is in the FOV where rotation and scale changes can be monitored progressively because of the online SVM learning and micro-second sensor resolution.
As seen from the video results, although the local sliding process is dependent on the event activity rather than a timebased tracking process, the tracker loses the object during the initial stages when abrupt changes in location and appearance happen due to rotation and viewpoint changes. Nonetheless, towards the end of each recording, especially for the rotation and 6-DOF motion, the tracker becomes tolerant as the object under different variations has been included by the online learning step. Also worth noting is since we are processing a fixed number of events, as a percentage of events inside the tracking window or the global field-of-view for the detector, the faster speed of motion as the recording progresses only results in faster processing, and does not affect functionality. Note that the static drone object has a good performance for all three motion profiles compared to the rest of the objects, being closer to the camera and larger in relative size. Table. I shows the performance of e-TLD using the publicly available event-camera dataset [10]. The translation motion profile results in the best average OS compared to rotation and 6-DOF motion. This is partly due to the local sliding tracker update that is looking for a rectangle within a search space (Fig. 3), so naturally the translation motion entails that objects can be fully captured within the candidate bounding boxes. However, it is interesting to note that the average center location error is highest for the translation case because of incomplete overlap with the actual object. On the other hand, there are no clear indications on how the size of the object affects CLE even though tracking and detecting smaller objects, such as the cup, was difficult. In fact, the OS score for the cup is lowest among the objects, especially for the 6-DOF case.
In addition to the tougher rotation and 6-DOF motion profile, the dynamic human moves farther away from the camera in the recordings after taking the cup in his hand. This induces tracker fail for a longer period of time and also unable to precisely detect the tracked object (lower OS). It is worth stating that the low-resolution of the DAVIS240C further reduces the event 2 e-TLD demo (updated): https://youtu.be/kkw69aVOoJY  density for such objects and makes detecting farther away objects challenging. Neuromorphic vision sensors with higher resolution [41] could alleviate these issues to a considerable degree. Finally, using the same parameters for the tracker representation and thresholds, [36] obtains a mean overlap score of 0.5841 for the objects ('Head' 0.4591, 'Monitor' 0.7645, 'Drone' 0.8217, 'Cup' 0.2390 and 'Books' 0.6364) compared to mean OS of 0.7010 in Table. I using the dynamic translation data. In other words, the tracking has been improved by 10% compared to our previous framework. This has been made possible by parameter sharing between the tracker and detector as outlined in Sec. III-B2.

C. Comparison to state-of-the-art
The descriptor proposed in [34] for event cameras was demonstrated with promising results for four different vision problems, namely object classification, tracking, detection and feature matching, as also noted in [42]. However, the eventbased long-term object tracking (e-LOT) [34] solution assumes a clean background surrounding the objects for tracking with a less reliable detection approach. Nonetheless, e-LOT is the only comprehensive work for event cameras addressing the problem of long-term object tracking and thus we make a comparison to e-TLD. Table. II compares e-LOT with the general purpose e-TLD framework for dynamic object tracking on the event camera dataset using the main OS metric. It is clear that e-LOT does not generalize well to cluttered and more generic data. In all three motion cases, e-TLD comprehensively outperforms e-LOT using the average OS score while performing slightly underpar for the 'cup' object. We attribute this anomaly to the tailor-made e-LOT system for tracking objects with cleaner background, which the 'cup' object encounters due to its unique placement in the scene compared to the other objects.

D. Real-time testing
The e-TLD framework was implemented in C++ using Visual Studio IDE for Windows 10 with several practical design considerations. For instance, the feature descriptor encodes the distribution of the events using a fixed log-polar lattice [5]. Thus, instead of generating the log-polar grid at every new event position, the event is transposed to a known location (say top left of the image), and the features are obtained using a pre-computed log-polar grid to save computational time. In addition, the local sliding window operation is efficiently accomplished by maintaining distinct tracker representations for each candidate window. Then, using a look-up table that is computed offline for ascertaining whether an event belongs to a rectangular candidate window, the respective tracker representations are updated. After an instance of successful track, the tracker representations are reset and updated according to incoming events. Similarly for the detector, the global sliding window can be implemented by counting the detected events using a look-up table as they arrive instead of waiting to detect after the threshold τ is reached.
Real-time testing was carried out using a DAVIS camera, interfaced and powered by a workstation running an Intel Core TM i7 3.6GHz processor. Our implementation running on a single thread achieves an average latency of 45µs per incoming event, which is about 140× faster than [31]. A standard global shutter camera is likely to generate images with motion blur artifacts while the handheld camera is constantly in motion. However, the event camera and our algorithm are able to track the object, as shown in this video 3 .
There are three parts to the above demo video. Firstly, handheld testing of the detection system (without tracking) was done to showcase real-time performance with the computationally more intensive global sliding window step. This is to highlight the possibility of a tracking-by-detection approach, and mainly to showcase that even in extreme cases where local sliding window is inhibited, the global sliding window detector can still output object location in real-time. Subsequently, we show that e-TLD still works for simple shapes data, followed by real-time performance on the dynamic data. We would like to point out that for the shapes and dynamic dataset recordings, they were input to the C++ system by simulating the camera interface, and thereby enabling real-time output. 3 Real-time demo (updated): https://tinyurl.com/ske6nk7

V. DISCUSSION
In this section, we report the e-TLD system performance by varying crucial algorithm parameters and draw insights for setting them for different object and motion scenarios. The translation data is used to uniformly study the system performance in the experiments reported below.
The dimension of the object representation has a direct impact on the tracker performance (eq. (6)). In general, a higher dimensional discriminative representation is expected to provide better background vs. object separation, as noted for the object classification task in [5]. Fig. 7 confirms the trend with increase in overlap success for higher dimensions. In particular, objects such as the books, drone and cup exhibit further increase beyond 1500 dimensions although saturation is expected. It is worth noting that all the reported results in the previous section was with 500 visual words as a compromise between tracker performance and running time. Our intention is to foster new research in this niche domain instead of reporting the best OS values on the dynamic scenes of the event camera dataset.
Another crucial parameter is the "waiting time" of the tracker and detector, τ , set as a percentage of the events received  within the tracker state or the whole image plane in the case of the detector. Fig. 8 shows a steady drop in the overlap success for τ beyond 0.1 up to 0.3 except for the drone object. This parameter was set to 0.05 universally for the reported experiments in Sec. IV-B. The detector plays a crucial role in re-capturing the object and in some cases we found that objects uniform in intensity, though large in size, like the computer monitor, could only be partially detected on most occasion due to the corresponding low event generation. However, objects that are not "hollow" like the drone can tolerate even high values of τ as the density of events is high. Next, the tracker threshold τ t that determines track success or failure is varied in Fig. 9. It is expected that a value very close to the average tracker score, τ t > 0.9, is very strict compared to allowing background objects to be tolerated by setting τ t < 0.6. Therefore, τ t ∈ [0.6, 0.8] is practical in many scenarios where a smooth track is expected rather than frequent switching to the detector. In the experiments in Sec. IV-B, a value of 0.8 was used to report the OS and CLE measures. Objects like the head and drink cup benefit from having low thresholds as detecting them is difficult compared to static objects like the drone.
Finally, it is of interest to examine the robustness of the e-TLD system to object initialization offset, which is a possible scenario in practical applications. In other words, since relative motion from object boundaries would generate a lot of events, if an initialization is imperfect and does not perfectly capture the entire object extent, the tracker may not perform as expected. Figure. 10 shows a gradual decreasing performance of the e-TLD system up to 25% offset from the ground truth initial location for the monitor object in the translation setting. From this experiment, it is safe to conclude that a reasonable system performance can be expected for up to 10% drift in bounding box coordinates during system initialization.
While the object representation needs to be learnt within a reasonable 10% drift limit for system initialization, it is not imperative that the initial object location has to be labeled by a user. As the neuromorphic vision community matures in developing generic object detectors for real-world objects, detecting instances of commonly seen objects without prior annotation will be feasible. In which case, e-TLD will be initialized by a detector, internal or external to the framework. In other words, the e-TLD framework is not inherently limited by the need for user intervention for initialization. Similarly, the binary classification scheme is also less of a limitation in the long haul when we can develop capabilities and implementations that can simultaneously run multiple trackers, specific to each object, in hardware implementations (e.g. [43] runs up to eight trackers concurrently) or in pure software with sufficient parallel processing power.

VI. CONCLUSION
This paper presented a long-term object tracking system for event cameras, showing how an event-based tracker and detector permits the application of an event camera to the important problem of long-term object tracking, and hopefully this opens the door to similar approaches for other related vision problems. The tracker uses an event-based local sliding window technique that performs reliably in scenes with cluttered and textured background. In addition, Bayesian bootstrapping is used to assist real-time processing and boost the discriminative power of the object representation. On the other hand, when the object re-enters the field-of-view of the camera, a datadriven, global sliding window based detector locates the object under different view-point conditions for subsequent tracking. Extensive experiments on a publicly available event camera dataset demonstrates the ability to track and detect arbitrary objects of different shapes and sizes under various motion profiles. Using the ground truth locations we created, quantitative measurement is reported for the event-based tracking method with critical insights on various performance issues. Finally, we showcase the real-time object tracking performance of e-TLD using a C++ implementation for scale, rotation, view-point and occlusion scenarios in a lab setting. It is worth restating that the data rate of the DAVIS event camera used in our experiments is typically in the order of 150 KB/s while a standard grayscale VGA camera outputs frames at 30Hz or about 10MB/s. The only information that is important for tracking and detection is how edges move, and the event camera naturally outputs this information while sidestepping problems of blur, low-dynamic range and limited motion information that standard cameras create. Shihao Zhang is currently an undergraduate at National University of Singapore, studying under the double degree program of computer engineering and economics. He is also under a research intern in Temasek Lab, with focus on event-based object tracking and dealing with real-time problems concerning event-based visual odometry.
Hong Yang received her Bachelor's degree at University of Electronic Science and Technology of China (UESTC). She was a master student of NUS and under a working scheme in Temasek Lab to perform research on event-based cameras, dealing with realtime pattern recognition problems.
Andres Ussa received his MSc from TU Kaiserslautern and University of Southampton in Embedded Computing Systems. His research experience has been focused on embedded systems design and machine learning applications. He had a short experience as a Software/Hardware Developer for consumer electronics.
Matthew Ong is currently an undergraduate at National University of Singapore, studying in the department of electrical and computer engineering.
His final year project focuses on event-based vision, dealing with tracking issues under dynamic camera motion profile.