Fast Event-Based Optical Flow Estimation by Triplet Matching

Event cameras are novel bio-inspired sensors that offer advantages over traditional cameras (low latency, high dynamic range, low power, etc.). Optical flow estimation methods that work on packets of events trade off speed for accuracy, while event-by-event (incremental) methods have strong assumptions and have not been tested on common benchmarks that quantify progress in the field. Towards applications on resource-constrained devices, it is important to develop optical flow algorithms that are fast, light-weight and accurate. This work leverages insights from neuroscience, and proposes a novel optical flow estimation scheme based on triplet matching. The experiments on publicly available benchmarks demonstrate its capability to handle complex scenes with comparable results as prior packet-based algorithms. In addition, the proposed method achieves the fastest execution time ($>$10 kHz) on standard CPUs as it requires only three events in estimation. We hope that our research opens the door to real-time, incremental motion estimation methods and applications in real-world scenarios.


I. INTRODUCTION
E VENT cameras [1], [2] have led to rethinking visual pro- cessing for various computer vision tasks because their operating principle and output data are fundamentally different from those of conventional, frame-based cameras.These bioinspired sensors naturally respond to the scene dynamics and offer advantages, such as low latency, high dynamic range (HDR) and data efficiency, which need to be unlocked with new algorithms [3].Neuromorphic principles have been a major source of inspiration for such novel algorithms and hardware, especially in motion estimation tasks [4]- [6].
Event-based optical flow estimation methods can be broadly classified as packet-based or event-by-event-based depending on how events are processed and update the estimator's output.Packet-based methods process a batch of events (e.g., events in a fixed time window, say 10-100 ms, or a fixed number of events, typically 30k-1M), hence they require some waiting time before processing (inference) starts [7]- [10].They trade off the high-speed advantages of event data for accuracy.Prior work has proposed adaptations of classical frame-based methods (block matching [10], Lucas-Kanade [11]), spatiotemporal plane-fitting [8], [12], time-surface matching [13], and contrast-maximization methods [7], [9], [14].While the above methods are model-based (optimization) methods, Artificial Neural Networks (ANN) [15]- [19] are also batch-based, 1 Department of Electronics and Electrical Engineering, Faculty of Science and Technology, Keio University, Kanagawa, Japan.
On the other hand, event-by-event methods process every event incrementally as it occurs (without waiting time), aiming to leverage the camera's low-latency advantage [5], [22].Many event-by-event methods, such as Spiking Neural Networks (SNNs), are inspired by the brain (i.e., neuromorphic), since the neural circuits of visual processing are thought to be eventdriven.While previous work propose SNN architectures [6], [23], [24], they comprise low-level physiological parameters of neurons (e.g., membrane potentials) that are difficult to interpret, validate and adjust to improve the estimation accuracy.Indeed, insects and mammals have different low-level underlying mechanisms, while they have similar algorithmic steps to transform light into motion [25].Hence, it is important to find abstracted logical operations of motion estimation, rather than to mimic the entire physiological properties of neurons.From a practical point of view, most event-by-event methods have been tested on simple scenes, as opposed to the more complex real-world scenes and publicly-available benchmarks of batch-based methods [26].This may be attributed to the use of tailored hardware [4], strong assumptions of the scene, limited problem settings [5] or the difficulty in defining eventby-event benchmarks on real data with µs resolution.Hence, Estimated flow Input events $ !$ " % Fig. 2: Triplet matchig algorithm.The algorithm seeks spatially and temporally neighboring events in an event-by-event manner, and provides event-based flow f k .For ease of visualization we only show the search in x and t, but it is actually carried out in x, y and t.Note this is an example of batch estimation given the input events.
it is important to explore event-by-event motion estimation algorithms that can solve complex, real-world problems.
This work leverages insights from neuroscience, especially from the classical Barlow-Levick model [27], and proposes a novel optical flow estimation scheme based on correlation of occurrence.In contrast to previous batch-based methods, it requires only three events (triplet) for estimation, which opens the door to future real-time incremental motion estimation methods.Compared to previous event-by-event approaches, it is tested on publicly-available optical flow benchmarks to demonstrate its capability to handle real-world scenes with comparable results.Additionally, it is based on logical operations, which enables a simple and efficient data structure implementation and execution on standard CPUs.In summary, our contributions are twofold: (i) we present a novel eventby-event algorithm for optical flow estimation, theoretically derived from neuroscience insights, and (ii) we practically demonstrate that it achieves comparable results as prior work while only requiring a CPU and being faster than optimizationbased algorithms (Fig. 1).
The signal processing in this work materializes the ideas in current neuroscience models, shedding light on what the strong and weak scenarios are, in order to improve the models.

A. Event Camera
Event cameras acquire visual data in the form of asynchronous per-pixel brightness differences called "events" [1], [3].An event e k .= (t k , x k , p k ) is triggered as soon as the logarithmic brightness at the pixel x k .= (x k , y k ) exceeds a preset threshold.Here, t k is the timestamp of the event with µs resolution, and polarity p k ∈ {+1, −1} is the sign of the brightness change (i.e., increase vs. decrease, respectively).

B. Triplet Matching
The idea of the triplet matching comes from neuroscience models by Hassenstein-Reichardt [28] and Barlow-Levick [27].These correlator models estimate motion by computing pairwise neural activities (e.g., spikes) in space and time [25].Especially, [29] suggests that triplet correlations (the product of pairwise correlations for three spikes in space-time) improve Algorithm 1 Triplet matching algorithm Search for triplet candidates (2).

4:
Collect triplet T = (k, i, j) 5: end for Here, we introduce the idea of the triplet-matching method as logical operations in spacetime coordinates.We build an incremental (event-by-event) estimation algorithm, and extend it into batch mode for testing because benchmarks are specified on a batch basis.
1) Incremental Estimation: It consists of two main steps: search and update (Algorithm 1).Events are split by polarity, following the idea of ON-and OFF-circuits in the brain [25].The search step finds triplets of events that are aligned (i.e., correlated) in space-time assuming a constant velocity model (Fig. 2).One of the events in the triplet is the incoming event, and the other two events are searched for within its spacetime neighborhoods of size d x , d t .The search has two steps: first the set of all potential 2nd events is determined; then the set of all potential 3rd events (compatible with the previous two in the triplet) refines the search.In the update step, every triplet of events is characterized by a different velocity.The velocity (flow) f k for the incoming event e k is computed as the average of the velocities of all triplets.Later, for benchmarking purposes, the flow is voxelized (quantized on a space-time grid) and smoothed.
In the search step, since event data are sorted by timestamp t, we use index maps to make the search efficient, with complexity O(N e log N e ).The index map H k of an event e k consists of the indices of its space-time neighbors: Parameters d t and d x decide the maximum admissible velocity of the flow, and τ is a refractory period, which limits the search space by assuming neighboring events in the moving edge do not exist at the same timestamp.d t can also be interpreted as the delay in the Barlow-Levick model.For each new event e k , we build a set of index maps and output a set of event triplets {T } .= {(e k , e i , e j )}.To find the triplet match we look for event indices j that have roughly constant velocity with the event pairs (e k , e i ) where i ∈ H k : (2) The update step calculates the flow f k and updates the index map H k .H k is obtained by adding new H k to H k−1 and removing old index maps (we keep the last 20000 index maps per polarity).The flow f k is obtained as the weighted average where v T .= (x j − x k )/(t j − t k ) is the velocity of each triplet.Since (3) gives accurate flow if the triplet is caused by the same scene edge, we use the weight w T to estimate the probability that the triplet belongs to the same edge.Assuming constant velocity, if e j is produced by the same edge that generates e k and e i , the expected timestamp of j is given by tj = t i − δ, where δ .= t k − t i .Therefore, to account for errors in the timestamps between tj and t j , we set the weight w T .
= N (t j ; tj , δ 2 ), where N is the Gaussian density function.The proposed average flow due to the triplets (3) may not necessarily equal the optical flow but this strong assumption is justified by the empirical results (Sec.III).
2) Batch Estimation: We extend the incremental (event-byevent) estimator to batch mode because current benchmarks are batch-based.For a set of events E .= {e k } Ne k=1 , we create the index maps H Ne first, which takes O(N 2 e log N e ).Then the flow is calculated looping over each event using Algorithm 1.The overall computational complexity is O(N 2 e log N e ).For benchmarking with ground truth, the event-wise flow is converted into a voxel-wise flow, which also enhances spacetime coherence.We quantize the time coordinates of f k into bins, and take the average of the {f k } that lie in each voxel.We also apply a non-zero average filter (take average of only non-zero values) with kernel size 3 × 3 for spatial smoothing.
The computational complexity of both approaches is summarized in Tab.I.For comparison, we also report those of state-of-the-art methods: Contrast Maximization (CMax) approaches [7], [9] and time-surface matching [13].The latter methods require additional complexity for the number of iterations N iter .We report runtime comparisons in Sec.III-C.

III. EXPERIMENTS A. Datasets and Evaluation Metrics
The MVSEC dataset [30] is a standard dataset for optical flow estimation [6], [9], [16], [17].The data consists of event camera, LiDAR, and camera poses.The event camera (mDAVIS346 camera [31]) provides events, grayscale frames and IMU data (346 × 260 pix).The ground truth optical flow is provided as the motion field from the camera velocity and the depth of the scene [15].The sequences are indoors with a drone and outdoors with a car, and we evaluate on 63.5 million events spanning 265 seconds from both outdoor and indoor sequences.We measure optical flow accuracy to evaluate our method.The metrics are the Average Endpoint Error (AEE) and the percentage of pixels with AEE greater than 3 pixels (% Out).The time intervals for evaluation are ∆t = 1 grayscale frame (at ≈ 45Hz, i.e., 22.2ms) and ∆t = 4 frames (89ms).Flow accuracy is evaluated only in pixels with valid ground truth.All experiments use d x = √ 2 pix, d t = 100ms and τ = 3ms.We also show additional results on the ECD dataset [32], which is widely used for motion estimation [33]- [36].Each sequence provides events, frames, calibration, and IMU data (at 1 kHz) from a DAVIS240C (240 × 180 pix) [37], as well as ground truth camera poses (at 0.2 kHz).than for ∆t = 1, which makes sense, and it is consistently 2.5-3 times bigger than that of the most accurate method [9] for both ∆t = {1, 4}.The fact that for longer time intervals batch-based methods (∆t = 4) achieve higher accuracy than our method may be attributed to the fact that our method is event-by-event, so it does not leverage long-term temporal smoothing, which would improve robustness to noise.

B. Optical Flow Estimation Accuracy
The results in Fig. 3 show that the events displaced using the estimated flow produce sharp images of warped events (Fig. 3b, IWEs [7]).The Flow Warp Loss [40] measures the sharpness of the IWE: 1.154 for outdoor day1, 1.157 for indoor flying1, and 1.248 for indoor flying2, where FWL > 1 indicates sharper than the identity warp baseline (i.e., zero flow).The figure also shows the estimated flow (Fig. 3c, and color wheel); notice that our method produces a flow vector for each event (Fig. 2), whereas it is common to display the flow for every pixel (image-based legacy).Hence, Fig. 3 shows a 2D collapsed version of the estimated space-time optical flow field, for visual comparison with the ground truth (Fig. 3d).The flow is most reliably estimated in regions where events happen, i.e., scene edges.Further spatial and temporal smoothness could be enhanced if needed: for example, homogeneous brightness regions between edges could be filled in by some prior, such as a regularizer or in-painting algorithm.

C. Runtime Comparison
The proposed method runs in an event-by-event manner, hence trades off accuracy for speed, compared with batchbased methods.We showed computational complexity comparison in Tab.I. Now, we conduct the runtime comparison among several previous work.We use Python (3.9.12) on CPUs (Mac M1 2020, 8 Cores) and average the runtime of processing 300k events incrementally.The results are shown in Fig. 1.Our method achieves the fastest runtime among compared methods: 0.0934 milliseconds (>10 kHz).Note that many methods in the literature, such as the 2nd and 3rd fastest ones [38], [39], use GPUs, while ours runs natively on CPUs.This is crucial for applications on resource-constrained platforms.

D. Effect of Pixel Quantization
A limitation of the proposed method is the quantization of the flow direction since the search for the second event in the triplet is limited to the 8 neighboring pixels of the current event.To illustrate it, we conduct experiments on the dynamic translation sequence from the ECD dataset [32].Figure 4 shows the distribution of v T over all events (assuming a planar translation model, i.e., constant velocity over all pixels).Similar to the SNN proposed in [23], v T is constrained to eight cardinal directions.However, in contrast to [23], which quantizes both the direction and magnitude of the flow, our method can estimate a continuum of magnitudes.The distributions are spread around a main direction and its two neighboring ones, which is due to the small aperture (5 × 5 pix) used for each triplet.

IV. CONCLUSION
We proposed a novel event-based optical flow estimation scheme based on triplet matching inspired by motion estimation models in neuroscience.The experiments demonstrate that it is considerably fast (> 10 kHz) on standard CPUs while providing comparable results as prior batch-based algorithms.We hope that our work opens the door to real-time, realistic, incremental motion estimation methods and event-camera applications on resource-constrained devices.
V. ACKNOWLEDGMENT

Fig. 4 :
Fig.4: Effect of pixel quantization on ECD data.In the top row motion is dominantly horizontal, whereas in the bottom row it is vertical, as can be seen by the thickness of the edges (left) and the velocity distributions (right).

TABLE I :
Complexity of algorithms, for batch estimation and event-by-event estimation.
e log Ne) O(Ne log Ne)
Table II comprises flow estimation accuracy results on the MVSEC benchmark.The top part of the table reports results for ∆t = 1, and the bottom part reports ∆t = 4.The methods in the table are categorized as unsupervised learning-based (USL), i.e., using a Deep Neural Network (DNN) on gridconverted events, and model-based (MB).Results for ∆t = 1 are thorough, with our method in the middle accuracy range among all methods.Results for ∆t = 4 are not as complete because the literature does not report them (especially most model-based methods).While a thorough comparison for ∆t = 4 is difficult, our error is roughly four times bigger