Processing math: 0%
Secrets of Event-Based Optical Flow, Depth and Ego-Motion Estimation by Contrast Maximization | IEEE Journals & Magazine | IEEE Xplore

Secrets of Event-Based Optical Flow, Depth and Ego-Motion Estimation by Contrast Maximization


Abstract:

Event cameras respond to scene dynamics and provide signals naturally suitable for motion estimation with advantages, such as high dynamic range. The emerging field of ev...Show More

Abstract:

Event cameras respond to scene dynamics and provide signals naturally suitable for motion estimation with advantages, such as high dynamic range. The emerging field of event-based vision motivates a revisit of fundamental computer vision tasks related to motion, such as optical flow and depth estimation. However, state-of-the-art event-based optical flow methods tend to originate in frame-based deep-learning methods, which require several adaptations (data conversion, loss function, etc.) as they have very different properties. We develop a principled method to extend the Contrast Maximization framework to estimate dense optical flow, depth, and ego-motion from events alone. The proposed method sensibly models the space-time properties of event data and tackles the event alignment problem. It designs the objective function to prevent overfitting, deals better with occlusions, and improves convergence using a multi-scale approach. With these key elements, our method ranks first among unsupervised methods on the MVSEC benchmark and is competitive on the DSEC benchmark. Moreover, it allows us to simultaneously estimate dense depth and ego-motion, exposes the limitations of current flow benchmarks, and produces remarkable results when it is transferred to unsupervised learning settings. Along with various downstream applications shown, we hope the proposed method becomes a cornerstone on event-based motion-related tasks.
Page(s): 7742 - 7759
Date of Publication: 02 May 2024

ISSN Information:

PubMed ID: 38696288

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Event cameras are novel bio-inspired vision sensors that naturally respond to motion of edges in image space with high dynamic range (HDR) and minimal blur at high temporal resolution, on the order of \mus [1]. These advantages provide a rich signal for accurate motion estimation in difficult real-world scenarios for frame-based cameras. However, such a signal is asynchronous and sparse by nature, hence not compatible with traditional computer vision algorithms. This poses the challenge of rethinking visual processing [2], [3]: motion patterns (i.e., optical flow) are no longer obtained by analyzing the intensities of images captured at regular intervals, but by analyzing the stream of per-pixel brightness changes produced by the event camera.

Multiple methods have been proposed for event-based optical flow estimation. They can be broadly categorized in two: (i) model-based methods, which investigate the principles and characteristics of event data that enable optical flow estimation, and (ii) learning-based methods, which exploit correlations in the data and/or apply the above-mentioned principles to compute optical flow. One of the challenges of event-based optical flow is the lack of ground truth flow in real-world datasets (at \mus resolution and HDR) [2], which makes it difficult to evaluate and compare the methods properly, and to train supervised-learning ones. Ground truth (GT) in de facto standard datasets [4], [5] is obtained by the motion field [6] given additional depth sensors and camera motion. However, such data is limited by the field-of-view (FOV) and resolution (spatial and temporal) of the depth sensor, which do not match those of event cameras. Hence, it is paramount to develop interpretable optical flow methods that exploit the characteristics of event data, and that do not need costly and error-prone ground truth.

Among prior work, Contrast Maximization (CM) [7], [8] is a powerful framework that allows us to tackle multiple motion estimation problems (rotational motion [9], [10], [11], [12], homographic motion [7], [13], [14], feature flow estimation [15], [16], [17], [18], motion segmentation [19], [20], [21], [22], and also reconstruction [7], [23], [24]). It maximizes an objective function (e.g., contrast) that measures the alignment of events caused by the same scene edge. The intuitive interpretation is to estimate the motion by recovering the sharp (motion-compensated) image of edge patterns that caused the events. Preliminary work on applying CM to estimate optical flow has reported event collapse [25], [26], producing flows at undesired optima that warp events to few pixels or lines [27]. This issue has been tackled by changing the objective function, from contrast to the energy of an average timestamp image [27], [28], but this loss is not straightforward to interpret [8], [29], and is not without its problems [30].

The state-of-the-art performance of CM in low degrees-of-freedom (DOF) motion estimations and its issues in more complex motions (dense flow) suggests that prior work may have rushed to use CM in unsupervised learning of dense flow. There is a gap in understanding how CM can be sensibly extended to estimate dense optical flow accurately. This paper fills this gap and shows a few “secrets” that are also applicable to overcome the issues of previous approaches.

We propose to extend CM for dense optical flow estimation via a tile-based approach covering the image plane (Fig. 1). We present several distinctive contributions:

  1. A multi-reference focus loss function to improve accuracy and discourage overfitting (Section III-B).

    Fig. 1. - DSEC test sequences (interlaken_00b, thun_01a) [5]. Our optical flow estimation method produces sharp images of warped events (IWE) despite the scene complexity, the large pixel displacement and the high dynamic range.
    Fig. 1.

    DSEC test sequences (interlaken_00b, thun_01a) [5]. Our optical flow estimation method produces sharp images of warped events (IWE) despite the scene complexity, the large pixel displacement and the high dynamic range.

  2. A principled time-aware flow to better handle occlusions, leveraging the solution of transport problems via differential equations (Section III-C).

  3. A multi-scale approach to improve convergence to the solution and avoid getting trapped in local optima (Section III-D).

Optical flow is a fundamental visual quantity related to many others, such as camera motion and scene depth. Hence, in this paper we exploit these connections, in monocular and stereo configurations, and show how a dense flow can serve to tackle various related problems in event-based vision, such as depth estimation, motion segmentation, etc. (Fig. 2). This paper is based on our previous work [31], which we substantially extend in the following points:

  1. We introduce a new objective function that improves both flow and depth estimation (Section III-B1).

    Fig. 2. - Overview. The proposed method solely relies on event data. It not only estimates optical flow, but can also estimate scene depth and ego-motion simultaneously from a monocular or stereo event camera setup. Furthermore, the estimated flow enables various downstream applications such as motion segmentation, intensity reconstruction and event denoising.
    Fig. 2.

    Overview. The proposed method solely relies on event data. It not only estimates optical flow, but can also estimate scene depth and ego-motion simultaneously from a monocular or stereo event camera setup. Furthermore, the estimated flow enables various downstream applications such as motion segmentation, intensity reconstruction and event denoising.

  2. We tackle stationary scenes, estimating monocular depth and ego-motion jointly (Sections III-F1 and IV-D).

  3. We also address the stereo setup (Sections III-F2 and IV-E).

  4. We discuss current optical flow benchmarks, evaluations and “GT” flow (Section IV-B5).

  5. We provide experiments on downstream applications of optical flow: motion segmentation, intensity reconstruction, and denoising (Section IV-C).

  6. We show experiments on 1Mpixel event cameras, the most recent event camera datasets: TUM-VIE [32] and M3ED [33], both in flow (Section IV-B4) and depth estimation (Section IV-D3).

  7. We extend the discussion on computational performance and limitations (Sections VI and VII).

The results of our experimental evaluation are surprising: the above design choices are key to our simple, model-based tile-based method achieving the best accuracy among all state-of-the-art methods, including supervised-learning ones, on the de facto benchmark of MVSEC indoor sequences [34]. Since our method is interpretable and produces better event alignment than the ground truth flow, both qualitatively and quantitatively, the experiments also expose the limitations of the current “ground truth”. The experiments demonstrate that the above key choices are transferable to unsupervised learning methods, thus guiding future design and understanding of more proficient Artificial Neural Networks (ANNs) for event-based optical flow estimation. Finally, the method allows us to solve many motion-related applications, thus becoming a cornerstone in event-based vision.

Because of the above, we believe that the proposed design choices deserve to be called “secrets” [35]. To the best of our knowledge, they are novel in the context of event-based optical flow, depth and ego-motion estimation, e.g., no prior work considers constant flow along its characteristic lines, designs the multi-reference focus loss to tackle overfitting, or has defined multi-scale (i.e., multi-resolution) contrast maximization on the raw events.

SECTION II.

Related Work

A. Event-Based Optical Flow Estimation

Given the identified advantages of event cameras to estimate optical flow, extensive research on this topic has been carried out. Prior work has proposed adaptations of frame-based approaches (block matching [36], Lucas-Kanade [37]), filter-banks [38], [39], spatio-temporal plane-fitting [40], [41], time surface matching [42], variational optimization on voxelized events [43], and feature-based contrast maximization [7], [15]. For a detailed survey, we refer to [2].

Current state-of-the-art approaches are ANNs [27], [30], [34], [44], [45], [46], largely inspired by frame-based optical flow architectures [47], [48]. Non-spiking–based approaches need to additionally adapt the input signal, converting the events into a tensor representation (event frames, voxel grids, etc.). These learning-based methods can be classified into supervised, semi-supervised, or unsupervised (see Table I). In terms of architectures, the three most common ones are U-Net [34], [49], FireNet [28], and RAFT [44], [50].

TABLE I Results on MVSEC Dataset [34]
Table I- Results on MVSEC Dataset [34]

Supervised methods train ANNs in simulation and/or real-data [44], [49], [50], [51], [52], [53], [54]. This requires accurate GT flow that matches the space-time resolution of event cameras. While this is not a problem in simulation, it incurs a performance gap when trained models are used to predict flow on real data, due to often a large domain gap between training and test data [52], [55]. Besides, real-world datasets have issues in providing accurate GT flow.

Semi-supervised methods use the grayscale images from a colocated camera (e.g., DAVIS [56]) as a supervisory signal: images are warped using the flow predicted by the ANN and their photometric consistency is used as loss function [34], [45], [46]. While such supervisory signal is easier to obtain than real-world GT flow, it may suffer from the limitations of frame-based cameras (e.g., motion blur and low dynamic range), consequently affecting the trained ANNs. EV-FlowNet [34] pioneered these approaches.

Unsupervised methods rely solely on event data. Their loss function consists of an event alignment error using the flow predicted by the ANN [27], [28], [30], [57], [58], [59]. Zhu et al. [27] extended EV-FlowNet [34] to the unsupervised setting using a motion-compensation loss inspired by the average timestamp images in [19]. This U-Net–like approach has been improved with recurrent blocks in [28], [30]. Paredes-Vallés et al. [28] also proposed FireFlowNet, a lightweight recurrent ANN with no downsampling. More recently, [30] has proposed several variants of EV-FlowNet and FireFlowNet models, and, enabled by the recurrent blocks, has replaced the usual voxel-grid input event representation by sequentially processing short-time event frames. Finally, concurrent work [59] builds upon [30] (sequential processing of event frames), proposing iterative event warping at multiple reference times in a multi-timescale fashion, which allows curved motion trajectories.

B. Event-Based Depth and Ego-Motion Estimation

Having estimated optical flow, one could try to fit a depth map and camera ego-motion a posteriori consistent with the flow [60]. Instead, it is better to incorporate the assumption of a still scene and a moving camera on the parameterization of the flow using the motion field equation [6]. While this connection exists, the topic of joint ego-motion and dense depth estimation via the motion field is not as explored as optical flow estimation. The problem is difficult, and often one settles for estimating depth alone, with or without knowledge of the camera motion [23], [61], [62].

Closest to our work are [27], [57] because they estimate a depth-parameterized motion field that best fits the event data. They do so by training ANNs in an unsupervised way. The loss functions are based on the energy of an average timestamp image [27] or on the photometric consistency of edge-maps warped by the predicted flow [57].

Similar to the above-mentioned unsupervised-learning works, our method produces dense optical flow and/or depth and does not need ground truth or additional supervisory signals. In contrast to prior work, we adopt a more classical modeling perspective to gain insights into the problem and discover principled solutions that can subsequently be applied to the learning-based setting. Stemming from an accurate and spatially-dependent contrast loss (the gradient magnitude [8]), we model the problem using a tile of patches (in flow or depth parameters) and propose solutions to several problems: overfitting, occlusions, and convergence. To the best of our knowledge, (i) no prior work has proposed to estimate dense optical flow and/or dense depth from a CM model-based perspective, and (ii) no prior unsupervised learning approach based on motion compensation has succeeded in estimating optical flow without the average timestamp image loss. The latter may be due to event collapse [25], but given recent advances on overcoming this issue [31], we show it is possible to succeed.

SECTION III.

Method

In this section, first we briefly revisit the Contrast Maximization framework (Section III-A). Then, the proposed methods are explained in detail: Section III-B proposes the new data fidelity term of the objective function, which discourages event collapse. Section III-C proposes a principled model for optical flow that considers the space-time nature of events. We also explain the multi-scale parameterization of the flow (Section III-D), the composite objective function (Section III-E), and the application to the problem of depth and ego-motion estimation in monocular and stereo configurations (Section III-F).

A. Event Cameras and Contrast Maximization

Event cameras have independent pixels that operate continuously and generate “events” e_{k} \doteq (\mathbf {x}_{k},t_{k},p_{k}) whenever the logarithmic brightness at the pixel increases or decreases by a predefined amount, called contrast sensitivity. Each event e_{k} contains the pixel-time coordinates (\mathbf {x}_{k}, t_{k}) of the brightness change and its polarity p_{k} \in \lbrace +1,-1\rbrace. Events occur asynchronously and sparsely on the pixel lattice, with a variable rate that depends on the scene dynamics.

The CM framework [7] assumes events \mathcal {E}\doteq \lbrace e_{k}\rbrace _{k=1}^{N_{e}} are caused by moving edges (i.e., brightness constancy), and transforms them geometrically according to a motion model \mathbf {W}, producing a set of warped events \mathcal {E}^{\prime }_{t_\text{ref}} \doteq \lbrace e^{\prime }_{k}\rbrace _{k=1}^{N_{e}} at a reference time t_\text{ref} \begin{equation*} e_{k} \doteq (\mathbf {x}_{k},t_{k},p_{k}) \;\,\mapsto \;\, e^{\prime }_{k} \doteq (\mathbf {x}^{\prime }_{k},t_\text{ref},p_{k}). \tag{1} \end{equation*} View SourceRight-click on figure for MathML and additional features.The warp \mathbf {x}^{\prime }_{k} = \mathbf {W}(\mathbf {x}_{k},t_{k}; \boldsymbol{\theta }) transports each event from t_{k} to t_\text{ref} along the motion curve that passes through it. The vector \boldsymbol{\theta } parameterizes the motion curves. Transformed events are aggregated on an image of warped events (IWE) \begin{equation*} I(\mathbf {x}; \mathcal {E}^{\prime }_{t_\text{ref}}, \boldsymbol{\theta }) \doteq \sum _{k=1}^{N_{e}} \delta (\mathbf {x}- \mathbf {x}^{\prime }_{k}), \tag{2} \end{equation*} View SourceRight-click on figure for MathML and additional features.where each pixel \mathbf {x} sums the number of warped events \mathbf {x}^{\prime }_{k} that fall within it. The Dirac delta is approximated by a Gaussian, \delta (\mathbf {x}-\boldsymbol{\mu })\approx \mathcal {N}(\mathbf {x};\boldsymbol{\mu },\epsilon ^{2}\mathtt {Id}) with \epsilon =1 pixel. Next, an objective function f(\boldsymbol{\theta }) is computed, such as the contrast of the IWE (2), given by the variance \begin{equation*} \operatorname{Var}\bigl (I(\mathbf {x};\boldsymbol{\theta })\bigr ) \doteq \frac{1}{|\Omega |} \int _{\Omega } \bigl (I(\mathbf {x};\boldsymbol{\theta })-\mu _{I}\bigr )^{2} d\mathbf {x}, \tag{3} \end{equation*} View SourceRight-click on figure for MathML and additional features.with mean \mu _{I} \doteq \frac{1}{|\Omega |} \int _{\Omega } I(\mathbf {x};\boldsymbol{\theta }) d\mathbf {x}. The objective function measures the goodness of fit between the events and the candidate motion curves (warp). Finally, an optimization algorithm iterates the above steps until convergence. The goal is to find the motion parameters that maximize the alignment of events caused by the same scene edge. Event alignment is measured by the strength of the edges of the IWE, which is directly related to image contrast [8].

For dense optical flow motion, the warp used is [27], [28] \begin{equation*} \mathbf {x}^{\prime }_{k} = \mathbf {x}_{k} + (t_{k}-t_\text{ref}) \, \mathbf {v}(\mathbf {x}_{k}), \tag{4} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \boldsymbol{\theta }= \lbrace \mathbf {v}(\mathbf {x}) \rbrace _{\mathbf {x}\in \Omega } is a flow field on the image plane \Omega at a set time, e.g., t_\text{ref}.

B. Multi-Reference Focus Objective Function

Zhu et al. [27] report that the contrast objective (variance) overfits to the events. This is in part because the warp (4) can describe very complex flow fields, which can push the events to accumulate in few pixels (i.e., event collapse [25], [26]). To mitigate event collapse, we reduce the complexity of the flow field by dividing the image plane into a tile of non-overlapping patches, defining a flow vector at the center of each patch, and interpolating the flow on all other pixels (see Section III-D). Interpolation confers smoothness of the flow field, hence lowering complexity.

However, reducing the complexity of the estimation parameters is not enough. Additionally, we discover that warps that produce sharp IWEs at any reference time t_\text{ref} have a regularizing effect on the flow field, discouraging event collapse. This is illustrated in Fig. 3. In practice we compute the multi-reference focus loss using three reference times: t_{1} (min), t_{\text{ {mid}}}\doteq (t_{1}+t_{N_{e}})/2 (midpoint) and t_{N_{e}} (max). For each set of events, the flow field is defined only at one reference time and then used to warp to \lbrace t_{1}, t_{\text{ {mid}}}, t_{N_{e}}\rbrace.

Fig. 3. - Multi-reference focus loss. Assume an edge moves from left to right. Flow estimation with single reference time ($t_{1}$t1) can warp all events into a single pixel, which results in a maximum contrast (at $t_{1}$t1). However, the same flow would produce low contrast (i.e., a blurry image) if events were warped to time $t_{N_{e}}$tNe. Instead, we favor flow fields that produce high contrast (i.e., sharp images) at any reference time (here, $t_\text{ref}= t_{1}$tref=t1 and $t_\text{ref}= t_{N_{e}}$tref=tNe). See also results in Fig. 20.
Fig. 3.

Multi-reference focus loss. Assume an edge moves from left to right. Flow estimation with single reference time (t_{1}) can warp all events into a single pixel, which results in a maximum contrast (at t_{1}). However, the same flow would produce low contrast (i.e., a blurry image) if events were warped to time t_{N_{e}}. Instead, we favor flow fields that produce high contrast (i.e., sharp images) at any reference time (here, t_\text{ref}= t_{1} and t_\text{ref}= t_{N_{e}}). See also results in Fig. 20.

Letting G be the objective function at a single reference time (e.g., (3)), the proposed multi-reference focus objective function is the average of the G functions \begin{equation*} f(\boldsymbol{\theta }) \doteq \bigl (G(\boldsymbol{\theta }; t_{1}) + 2G(\boldsymbol{\theta }; t_{\text{ {mid}}}) + G(\boldsymbol{\theta }; t_{N_{e}})\bigr ) \,/\, 4 G(\mathbf {0}; -), \tag{5} \end{equation*} View SourceRight-click on figure for MathML and additional features.normalized by the value of the G function with zero flow (identity warp): G(\mathbf {0}; -). We could choose different convex combinations of normalized G functions and different reference times, but the proposed combination (5) works well in practice. The normalization in (5) provides the same interpretation as the Flow Warp Loss (FWL) [52]: f< 1 implies the flow is worse than the zero flow baseline, whereas f> 1 means that the flow produces sharper IWEs than the baseline. Such an interpretation is beneficial for model-based and unsupervised-learning methods.

Remark: Warping to two reference times (min and max) was proposed in [27], but with important differences: (i) it was done for the average timestamp loss, hence it did not consider the effect on contrast or focus functions [8], and (ii) it had a completely different motivation: to lessen a back-propagation scaling problem, so that the gradients of the loss would not favor events far from t_\text{ref}.

1) Objective Functions Based on the IWE Gradient

Among the contrast functions proposed in [7], [8], we use two functions based on the gradient of the IWE \begin{equation*} G(\boldsymbol{\theta }; t_\text{ref}) \doteq \frac{1}{|\Omega |} \int _{\Omega } \Vert \nabla I(\mathbf {x}; t_\text{ref})\Vert ^{q}\,d\mathbf {x}, \tag{6} \end{equation*} View SourceRight-click on figure for MathML and additional features.with q = 1 (the L^{1}-norm) and q = 2 (the squared L^{2}-norm). Both functions have the following desired properties: (i) they are sensitive to the arrangement (i.e., permutation) of the IWE pixel values, whereas the variance of the IWE (3) is not, (ii) they have top accuracy performance and converge more easily than other objectives we tested, and (iii) they differ from the FWL [52], which is defined using the variance (3) and will be used for evaluation. The two proposed functions have different sensitivities for the number of accumulated events on the IWE, which affects estimation accuracy, especially when the scene has large variations in the number of events per pixel (e.g., scenes with various depth). We find that using L^{1} improves the results of L^{2} [31] in most cases, as we show in Section IV.

C. Time-Aware Flow

State-of-the-art event-based optical flow approaches are based on frame-based ones, and so they use the warp (4), which defines the flow \mathbf {v}(\mathbf {x}) as a function of \mathbf {x} (i.e., a pixel displacement between two given frames). However, this does not take into account the space-time nature of events, which is the basis of CM, because not all events at a pixel \mathbf {x}_{0} are triggered at the same timestamp t_{k}. They do not need to be warped with the same velocity \mathbf {v}(\mathbf {x}_{0}). Fig. 4 illustrates this with an occlusion example taken from the slider_depth sequence [63]. Instead of \mathbf {v}(\mathbf {x}), the event-based flow should be a function of space-time, \mathbf {v}(\mathbf {x},t), i.e, time-aware, and each event e_{k} should be warped according to the flow value at (\mathbf {x}_{k},t_{k}). Let us propose a more principled warp than (4).

Fig. 4. - Time-aware Flow. Traditional flow (4), inherited from the frame-based one, assumes per-pixel constant flow $\mathbf {v}(\mathbf {x}) = \text{const}$v(x)=const, which cannot handle occlusions properly. The proposed space-time flow assumes constancy along streamlines, $\mathbf {v}(\mathbf {x}(t),t) = \text{const}$v(x(t),t)=const, which allows us to handle occlusions more accurately. (See results in Figs. 21 and 24).
Fig. 4.

Time-aware Flow. Traditional flow (4), inherited from the frame-based one, assumes per-pixel constant flow \mathbf {v}(\mathbf {x}) = \text{const}, which cannot handle occlusions properly. The proposed space-time flow assumes constancy along streamlines, \mathbf {v}(\mathbf {x}(t),t) = \text{const}, which allows us to handle occlusions more accurately. (See results in Figs. 21 and 24).

To define a space-time flow \mathbf {v}(\mathbf {x},t) that is compatible with the propagation of events along motion curves, we are inspired by the method of characteristics [64]. Mimicking the mainstream assumption about brightness being constant along the true motion curves in image space, we assume the flow is constant along its streamlines: \mathbf {v}(\mathbf {x}(t),t) = \text{const} (Fig. 4). Differentiating in time and applying the chain rule gives a system of partial differential equations (PDEs) \begin{equation*} \frac{\partial \mathbf {v}}{\partial \mathbf {x}} \frac{d\mathbf {x}}{dt} + \frac{\partial \mathbf {v}}{\partial t} = \mathbf {0}, \tag{7} \end{equation*} View SourceRight-click on figure for MathML and additional features.where, as usual, \mathbf {v}= d\mathbf {x}/dt is the flow. The boundary condition is given by the flow at say t=0: \mathbf {v}(\mathbf {x},0) = \mathbf {v}^{0}(\mathbf {x}). This system of PDEs states how to propagate (i.e., transport) a given flow \mathbf {v}^{0}(\mathbf {x}), from the boundary t=0 to the rest of space-time. The PDEs have advection terms and others that resemble those of the inviscid Burgers’ equation [64] since the flow is transporting itself. We parameterize the flow at t=t_{\text{{mid}}} (boundary condition), and propagate it to the volume that encloses the current set of events \mathcal {E}. We develop two explicit methods to solve the PDEs, one with upwind differences and one with a conservative scheme adapted to Burgers’ terms [65]. Each event e_{k} is then warped according to a flow \hat{\mathbf {v}} given by the solution of the PDEs at (\mathbf {x}_{k},t_{k}) \begin{equation*} \mathbf {x}^{\prime }_{k} = \mathbf {x}_{k} + (t_{k}-t_\text{ref}) \, \hat{\mathbf {v}}(\mathbf {x}_{k},t_{k}). \tag{8} \end{equation*} View SourceRight-click on figure for MathML and additional features.

D. Multi-Scale Flow Parameterization

Inspired by classical estimation methods, we combine our tile-based approach with a multi-scale strategy. The goal is to improve the convergence of the optimizer in terms of speed and robustness (i.e., avoiding local optima).

Some learning-based works [27], [28], [34] also have a multi-scale component, inherited from the use of a U-Net architecture. However, they work on discretized event representations (voxel grid, etc.) to be compatible with DNNs. In contrast, our tile-based approach works directly on raw events, without discarding or quantizing the temporal information in the event stream.

Our multi-scale CM approach is illustrated in Fig. 5. For an event set \mathcal {E}_{i}, we apply the tile-based CM in a coarse-to-fine manner (e.g., N_{\ell } = 5 scales). There are 2^{l - 1} \times 2^{l - 1} tiles at the lth scale. We use bilinear interpolation to upscale between any two scales. If there is a subsequent set \mathcal {E}_{i+1}, the flow estimated from \mathcal {E}_{i} is used to initialize the flow for \mathcal {E}_{i+1}. This is done by downsampling the finest flow to coarser scales. The coarsest scale initializes the flow for \mathcal {E}_{i+1}. For finer scales, initialization is computed as the average of the upsampled flow from the coarser scale of \mathcal {E}_{i+1} and the same-scale flow from \mathcal {E}_{i}.

Fig. 5. - Multi-scale Approach using tiles (rectangles) and raw events. (See results in Fig. 22).
Fig. 5.

Multi-scale Approach using tiles (rectangles) and raw events. (See results in Fig. 22).

E. Composite Objective Function

To encourage additional smoothness of the flow, even in regions with few events, we include a flow regularizer \mathcal {R}(\boldsymbol{\theta }). The flow is obtained as the solution to the problem \begin{equation*} \boldsymbol{\theta }^{\ast } = \arg \min _{\boldsymbol{\theta }} \left(\frac{1}{f(\boldsymbol{\theta })} + \lambda \mathcal {R}(\boldsymbol{\theta })\right), \tag{9} \end{equation*} View SourceRight-click on figure for MathML and additional features.where, \lambda > 0 is the regularizer weight, and we use the total variation (TV) [66] as regularizer. We use 1/f instead of -f because it is convenient for ANN training (Section IV-B3).

F. Depth and Ego-Motion Estimation

1) Monocular

For a still scene but with a moving camera, the motion induced on the image plane has fewer DOFs than the most general case considered so far. In this scenario, it is beneficial to parameterize the optical flow in terms of the scene depth Z(\mathbf {x}) and the camera motion (linear velocity \mathbf {V} and angular velocity \boldsymbol{\omega }) via the well-known motion field equation [6] \begin{equation*} \mathbf {v}(\mathbf {x}) = \frac{1}{Z(\mathbf {x})}A(\mathbf {x})\mathbf {V}+ B(\mathbf {x})\boldsymbol{\omega }, \tag{10} \end{equation*} View SourceRight-click on figure for MathML and additional features.where the 2\times 3 matrices A(\mathbf {x}) and B(\mathbf {x}) solely depend on the pixel coordinate. Substituting (10) in (4) or (8) and using it to warp events yields that the contrast is now maximized with respect to the depth and camera motion parameters while the flow \mathbf {v} acts as an intermediate variable.

Similarly to Section III-D, we parameterize the depth Z(\mathbf {x}) using a tile of patches, which results in 6 + N_\text{patch} DOFs (instead of 2N_\text{patch} DOFs). By doing this, we not only reduce the complexity of the estimation but also demonstrate the extensibility of the proposed method to the simultaneous estimation of ego-motion and dense depth. Note that parameters Z(\mathbf {x}) and \mathbf {V} appear in a product in (10), hence there is a scale ambiguity (typical of monocular setups). Furthermore, we apply an exponential parameterization \rho \mapsto Z\doteq e^{a\rho + b} to avoid negative depth predictions. To mitigate isolated patches with very large depth values we apply median filters [35] and a Charbonnier loss [67] for regularization.

Note that the motion field parameterization (10) is not supposed to handle independently moving objects (IMOs), although it is effective in many event-based optical flow benchmarks (e.g., Sections IV-B1 and IV-B2). We discuss the validity and the limitations of optical flow benchmarking in Section IV-B5, as well as the comprehensive results in Section IV-D.

2) Stereo

The proposed method can be extended to stereo configurations. Parameterizing scene depth and ego-motion on the left camera and using the extrinsic parameters of the stereo setup, we can compute the depth and the motion on the right camera (e.g., by warping the left depth map onto the right camera using the nearest neighbor interpolation). Having depth and ego-motion on each camera, we define the objective function as the sum \begin{equation*} \boldsymbol{\theta }^{\ast } = \arg \min _{\boldsymbol{\theta }} \left(\frac{1}{f_{\text{l}}(\boldsymbol{\theta })} + \lambda \mathcal {R}_{\text{l}}(\boldsymbol{\theta }) + \frac{1}{f_{\text{r}}(\boldsymbol{\theta })} + \lambda \mathcal {R}_{\text{r}}(\boldsymbol{\theta }) \right), \tag{11} \end{equation*} View SourceRight-click on figure for MathML and additional features.where the parameters \boldsymbol{\theta } are only those of the left camera.

In prior works of stereo depth estimation [68], one of the main challenges is how to find correspondences between event streams from multiple cameras. This is a non-trivial problem and is prone to event noise. The proposed method bypasses the event-to-event correspondence problem by parameterizing the depth densely on the whole image plane of one camera and transferring it to the other camera.

Summarizing Remark: All the proposals in Section III are formulated in the form of an optimization problem, and they are theoretically extensible to learning-based approaches (DNNs), since they are fully differentiable. We will show an example of the learning-based flow estimation in Section IV-B3. Hence, our work provides model-based approaches that can act as baselines for the development of learning-based methods in the context of event-based optical flow, monocular depth, ego-motion, and stereo depth estimation problems.

SECTION IV.

Experiments

We assess the performance of our method on seven datasets, which are presented in Section IV-A. We provide a comprehensive evaluation of optical flow estimation in Section IV-B. Additionally, we demonstrate the learning-based extension (DNN) (Section IV-B3), discuss current optical flow benchmarks (Section IV-B5), and show downstream applications (Section IV-C). The results of depth and ego-motion estimation are presented in Section IV-D (monocular) and Section IV-E (stereo).

A. Datasets, Metrics and Hyper-Parameters

The proposed method works robustly on data comprising different camera motions, scenes, and spatial resolutions. We conduct experiments on the following seven datasets.

Datasets: First, we evaluate our method on sequences from the MVSEC dataset [4], [34], which is the de facto standard dataset used by prior works to benchmark optical flow. The dataset contains sequences recorded indoors with a drone, and outdoors with a car. It provides events, grayscale frames, IMU data, camera poses, and scene depth from a LiDAR [4]. The dataset was extended in [34] to provide ground truth (GT) optical flow, computed as the motion field [6] given the camera velocity and the depth of the scene. Notice that the indoor sequences do not have IMOs, and the outdoor sequences do not include scenes with IMOs in the benchmark evaluation. The event camera has 346 \times 260 pixel resolution [56]. In total, we evaluate on 63.5 million events spanning 265 seconds. We quantitatively and qualitatively show results on flow, depth, and ego-motion estimation.

We also evaluate on a recent dataset that provides ground truth flow: DSEC [44]. It consists of sequences recorded with Prophesee Gen3 event cameras (stereo), of higher resolution (640 \times 480 pixels), mounted on a car. Optical flow is also computed as the motion field, with the scene depth given by a LiDAR. The flow benchmark contains scenes with IMOs, but performance is assessed only in non-IMO pixels (where the GT from the motion field is valid). In total, we evaluate on 3 billion events spanning the 208 s of the test sequences. We quantitatively/qualitatively show results of flow and stereo depth estimation.

Additionally, we carry out experiments on two HD resolution event camera datasets, TUM-VIE [32] and M3ED [33], recorded with stereo Prophesee Gen4 event cameras (1280 \times 720 pixels, i.e., 1 Mpixel). The TUM-VIE dataset consists of indoor and outdoor sequences recorded with the sensor rig mounted on a helmet. In the M3ED dataset the sensor rig is mounted on a car (outdoor), a quadruped robot (outdoor), and a drone (indoor and outdoor). We show qualitative results for the flow and depth estimation since the GT data for M3ED is not available at submission time.

The ECD dataset [63] is a lower resolution, standard dataset to assess camera ego-motion [9], [16], [25], [69], [70], [71], [72]. Each sequence provides events, frames, calibration information, and IMU data from a DAVIS240 C camera (240 \times 180 pixels [73]), as well as ground truth camera poses from a motion capture system (at 200 Hz). We use slider_depth and simulation_3planes sequences for depth and ego-motion estimation. In the first sequence the event camera moves along a motorized linear slider, recording objects at different depths. The second sequence is synthetic with a circular camera trajectory; since it provides ground truth depth, we report quantitative metrics for depth and ego-motion estimation accuracy. In total, we evaluate on 1.1 million events (3 s) of the slider sequence and on 6.8 million events (2 s) of the simulation sequence.

Finally, we also test sequences from two motion segmentation datasets [20], [21]. The sequences in EMSMC [20] are recorded using a hand-held DAVIS240 C camera (240 \times 180 pixels). The sequences in EMSGC [21] are recorded with a hand-held DAVIS346 camera (346 \times 260 pixels). Both datasets consist of small camera motions and several IMOs in the scene. We demonstrate qualitative results of flow estimation and its application to motion segmentation.

Evaluation Metrics: The metrics used to assess optical flow accuracy are the average endpoint error (AEE), the angular error (AE), and the percentage of pixels with \text{AEE}> 3 pixels (denoted by “% Out”), all measured over pixels with valid GT and at least one event in the evaluation interval. We also use the FWL metric (the IWE variance relative to that of the identity warp) to assess event alignment [52].

For depth accuracy evaluation, we use standard metrics following previous work on monocular depth estimation [57], [74]. The depth error metrics are SiLog, Absolute Relative Difference (denoted by “AbsRelDiff”), and the logarithmic RMSE (“logRMSE”). While SiLog is scale-invariant, we substitute the prediction using the mean of the GT for AbsRelDiff and logRMSE. We furthermore report depth accuracy metrics that compute the percentage of pixels whose relative depth with respect to GT is smaller than a threshold. We use three common thresholds: \delta < \lbrace 1.25, 1.25^{2}, 1.25^{3}\rbrace, denoted by “A1”, “A2” and “A3”, respectively.

Hyper-parameters: For flow estimation our method uses N_{\ell }=5 resolution scales, \lambda = 0.0025 in (9), and the Newton-CG optimization algorithm with a maximum of 30 iterations per scale. The flow at t_{\text{{mid}}} is transported to each side via the upwind or Burgers’ PDE solver (using 5 bins for MVSEC, 40 for DSEC), and used for event warping (8) (see [31]). In the optimization, we use 30k events for MVSEC indoor sequences, 40 k events for outdoors, 50 k events for ECD, 1.5M events for DSEC, 1.8 M events for TUM-VIE and M3ED, and 5 k events for the motion segmentation examples.

The number of events was selected guided by the benchmarks and/or experimentally, based on the variables that affect the event generation (camera's spatial resolution, scene texture, motion, etc.) and the CM method (edges should displace enough, e.g., three pixels, see [18]). The estimated flow is scaled and aligned with the benchmark timestamps, if necessary (e.g., MVSEC). There is a trade-off: too few events, then CM does not work (scarce data and there is not enough displacement to have a good objective function landscape); too many events, and the method may not produce a good fit if the constant optical flow assumption does not hold during the time span of the events.

Since the motion-field parameterization reduces the complexity of the problem, we successfully use finer scales N_{\ell }=6 for MVSEC/DSEC and N_{\ell }=7 for the 1 Mpixel datasets. By increasing the patch level in static scenes, we expect finer and better flow estimates. While we initialize depth between event packets with the same strategy as that of optical flow, we do not propagate the linear velocity to the subsequent packet in order to avoid errors when abrupt motion changes happen (e.g., during velocity sign changes).

B. Optical Flow Estimation

1) Results on the MVSEC Benchmark

We first report the results on the MVSEC benchmark (Table I). The different methods (rows) are compared on one outdoor and three indoor sequences (columns). This is because many learning-based methods train on the other outdoor sequence, which is therefore not used for testing. Following Zhu et al. outdoor_day1 is tested only on specified 800 frames [34]. The top part of Table I reports the flow corresponding to a time interval of dt=1 grayscale frame (at \approx 45 Hz, i.e., 22.2 ms), and the bottom part corresponds to dt=4 frames (89 ms).

The table is comprehensive, showing where the proposed methods stand compared to prior work. Our methods provide the best results among all methods in all indoor sequences and are the best among the unsupervised and model-based methods in the outdoor sequence. The errors for dt=4 are about four times larger than those for dt=1, which is sensible given the ratio of time interval durations.

Among different variations of the proposed methods, we observe that (i) the motion field parameterization achieves better accuracy than the direct parameterization of the flow in indoor sequences, (ii) there are no significant differences between the three versions of the flow warp models, and (iii) the L^{1} loss improves accuracy over L^{2}. Elaborating on these three points, (i) the effectiveness of the motion field estimation indoors is due to a good match between the model assumptions and the data (there are no IMOs in the scene), and outdoors depth estimation is generally difficult for driving sequences. (ii) The negligible difference between the flow warp models can be attributed to the fact that the MVSEC dataset does not comprise large pixel displacements or occlusions, which is further discussed in Section V-B. (iii) The L^{1} norm grows more slowly than the L^{2} norm along the increased number of accumulated events in the IWE. This property makes the L^{1} objective function more sensitive to the areas with few events (e.g., pixels of far away objects), resulting in better estimation accuracy.

Qualitative results are shown in Fig. 6, where we compare our method against the state-of-the-art learning-based methods. Our method provides sharper IWEs than the baselines, without overfitting, and the estimated flow resembles the GT. We display flow masked by the events, for consistency with the benchmark. Recall that our method interpolates the flow at pixels with zero events. The USL result [30] is obtained using its official implementation, comprising a recurrent model that sequentially processes sub-partitions of event data. Notice that we use the event mask of the full timestamps (dt=4), which agrees with the quantitative evaluation for a consistent discussion.

Fig. 6. - MVSEC results ($dt=4$dt=4) of our method and two state-of-the-art baselines: ConvGRU-EV-FlowNet (USL) [30] and EV-FlowNet (SSL) [34]. For each sequence, the upper row shows the flow masked by the input events, and the lower row shows the IWE using the flow. Our method produces the sharpest motion-compensated IWEs. Note that learning-based methods crop the events to the central 256 × 256 pixels, whereas our method does not. Black points in ground truth (GT) flow maps indicate the absence of LiDAR data. Additional plots are given in [31, Fig. 5].
Fig. 6.

MVSEC results (dt=4) of our method and two state-of-the-art baselines: ConvGRU-EV-FlowNet (USL) [30] and EV-FlowNet (SSL) [34]. For each sequence, the upper row shows the flow masked by the input events, and the lower row shows the IWE using the flow. Our method produces the sharpest motion-compensated IWEs. Note that learning-based methods crop the events to the central 256 × 256 pixels, whereas our method does not. Black points in ground truth (GT) flow maps indicate the absence of LiDAR data. Additional plots are given in [31, Fig. 5].

Ground truth is not available on the entire image plane (see Fig. 6), such as in pixels not covered by the LiDAR's range, FOV, or spatial sampling. Additionally, there may be interpolation issues in the GT, since the LiDAR works at 20 Hz and the GT flow is given at frame rate (45 Hz). In the outdoor sequences, the GT from the LiDAR and the camera motion cannot provide correct flow for IMOs. These issues of the GT are noticeable in the IWEs: they are not as sharp as expected. In contrast, the IWEs produced by our method are sharp.

2) Results on the DSEC Benchmark

Table II gives quantitative results on the DSEC Optical Flow benchmark. No GT flow is available for these test sequences. The proposed methods are compared with an unsupervised-learning method [59] (Section II) and a supervised-learning method E-RAFT [44]. E-RAFT is an ANN that extracts features in event correlation volumes via an iterative update scheme instead of using a U-Net architecture. This version of RAFT [48] was introduced along with the DSEC flow benchmark and showed it can estimate pixel correspondences for large displacements. As expected, E-RAFT is better than ours in terms of flow accuracy because (i) it has additional training information (GT labels), and (ii) it is trained using the same type of GT signal used in the evaluation. Nevertheless, our method provides sensible results and is better in terms of FWL, which exposes similar GT quality issues as those of MVSEC: many pixels have no GT (LiDAR's FOV and IMOs). This is also confirmed in the qualitative results (Fig. 7). Our method provides sharp IWEs, even for IMOs (car) and the road close to the camera. We further discuss the issue of IMOs in the flow benchmarks in Section IV-B5.

TABLE II Results on the DSEC Optical Flow Benchmark [44]
Table II- Results on the DSEC Optical Flow Benchmark [44]
Fig. 7. - DSEC results on the interlaken_00b sequence (no GT available). Since GT is missing at IMOs and points outside the LiDAR's FOV, the supervised method [44] may provide inaccurate predictions around IMOs and road points close to the camera, whereas our method produces sharp edges. For visualization, we use 1 M events.
Fig. 7.

DSEC results on the interlaken_00b sequence (no GT available). Since GT is missing at IMOs and points outside the LiDAR's FOV, the supervised method [44] may provide inaccurate predictions around IMOs and road points close to the camera, whereas our method produces sharp edges. For visualization, we use 1 M events.

Remarkably, the proposed methods achieve competitive results in terms of flow accuracy with the unsupervised-learning method [59]. Among different variations, the “Flow (L^{1})” achieves the most competitive results for all sequences except for zurich_city_12a, a night sequence. The night scenes have many light-induced events that are not due to motion, and naturally the proposed methods tend to fail.

Notice that both DNN methods [44], [59] train and evaluate on the DSEC dataset, which is dominantly forward driving motion. As a result, these learning-based methods may overfit to the driving data (i.e., tend to predict forward motion) and fail to produce good results in other motions and datasets [55] (e.g., see E-RAFT rows on the MVSEC indoor seqs. in Table I). On the contrary, the proposed methods rely on the principle of event alignment and generalize to various datasets, producing consistently good results.

Similarly to the MVSEC results, the L^{1} loss achieves better accuracy than the L^{2} loss. Contrary to MVSEC, the results of the depth parameterization are generally worse than those of the flow parameterization. This can be attributed to the IMOs: although not included in the evaluation pixels, the scenes include IMOs which directly affect the estimated flow. As expected, the motion field estimation fails since it cannot fit the events caused by IMOs.

We observe that the evaluation intervals (100 ms) are large for optical flow standards. In the benchmark, 80% of the GT flow has up to 22px displacement, which means that 20% of the GT flow is larger than 22px (on VGA resolution). The apparent motion during such intervals is sufficiently large that it breaks the classical assumption of scene points flowing in linear trajectories (more details in Section IV-B5).

3) Application to Deep Neural Networks (DNN)

The proposed secrets are not only applicable to model-based methods, but also to unsupervised-learning methods. To this end, we train EV-FlowNet [34] in an unsupervised manner on the MVSEC dataset, using (9) as data-fidelity term and a Charbonnier loss [67] as the regularizer. We convert 40 k events into the voxel-grid representation [27] with 5 time bins. The network is trained for 50 epochs with a learning rate of 0.001 and its decay of 0.8 with Adam optimizer [80]. To ensure generalization, we train our network on indoor sequences and test on the outdoor_day1 sequence. Since the time-aware flow does not have a significant influence on the MVSEC benchmark (Table I), we do not port it to the learning-based setting.

Table III shows the quantitative comparison with unsupervised learning methods. Our model achieves the second best accuracy, following [27], and the best sharpness (FWL) among the existing methods. Notice that [27] was trained on the outdoor_day2 sequence, which is a similar driving sequence to the test one, while the other methods were trained on drone data [81]. Hence [27] might be overfitting to the driving data, while ours is not, by the choice of training data. The qualitative results of our unsupervised learning setting are shown in Fig. 8. We compare our method with the state-of-the-art unsupervised learning [30]. Our results resemble the GT flow.

TABLE III Results of Unsupervised Learning Methods on MVSEC's outdoor_day1 Sequence
Table III- Results of Unsupervised Learning Methods on MVSEC's outdoor_day1 Sequence
Fig. 8. - Results of our DNN on the MVSEC outdoor sequence. Our DNN (EV-FlowNet architecture) trained with (9) outperforms the unsupervised learning method [30].
Fig. 8.

Results of our DNN on the MVSEC outdoor sequence. Our DNN (EV-FlowNet architecture) trained with (9) outperforms the unsupervised learning method [30].

Additionally, we train the architecture in [59] on DSEC data using the L^{1} loss and the Charbonnier loss (with the regularizer weight of 0.15). The accuracy results, reported in Table II as “Ours (USL, L^{1})”, are on par with the model-based one. The two experiments in this section (Section IV-B3) confirm the transferability of the techniques in Section III to learning-based approaches, reaffirming the importance of our contributions.

4) Results on 1 Mpixel Datasets: TUM-VIE and M3ED

The proposed method generalizes to recent high spatial resolution event cameras. We show qualitative results on the TUM-VIE dataset [32] and the M3ED dataset [33] in Fig. 9. The flow looks realistic and produces sharp IWEs for various motions (forward motion, rotation, translation) and scenes (indoor and outdoor). Also, the flow estimation is stable regardless of the absolute scene intensity, while frames suffer from a limited dynamic range. Hence, we leverage the HDR advantages of event cameras.

Fig. 9. - Results on 1Mpixel event camera data. Sequences are bike-easy, skate-easy (TUM-VIE [32]), and falcon (M3ED [33]).
Fig. 9.

Results on 1Mpixel event camera data. Sequences are bike-easy, skate-easy (TUM-VIE [32]), and falcon (M3ED [33]).

5) Discussion on Optical Flow Benchmarks and “GT” Flow

Throughout the quantitative evaluation of the event-based optical flow (Sections IV-B1 and IV-B2), we observe some limitations for the current benchmarks: (i) size of the evaluation interval and (ii) independently moving objects (IMOs).

Evaluation intervals and the linearity of optical flow: The time-aware flow is designed to consider the space-time nature of events. Recently, there have also been other proposals aiming to leverage such nature for per-pixel motion estimation. The main difference between our flow (Section III-C) and concurrent proposals [53], [59], [82] is the motion hypothesis and its underlying assumptions: (7) assumes that the flow is constant along its streamlines within short time intervals, which produces linear motion trajectories (Fig. 4). The number DOFs of the motion is 2N_{p}, and the efficacy of the parameterization for occlusions is shown in Section V-B.

On the other hand, [53], [59] propose non-linear trajectories (e.g., Bézier curves) for the “optical flow”. We suspect that the choice of assuming non-linear trajectories stems from the necessity of reporting good figures on the DSEC benchmark (Table II), which has relatively long evaluation intervals. While it is called an “optical flow” benchmark, the ground truth on time intervals of 100 ms at moderate vehicle speeds can result in curved trajectories. The increased complexity of the non-linear trajectory estimation problem has several challenges to be addressed: (i) accuracy is difficult to evaluate with existing benchmarks, which are based on the standard definition of flow, (ii) there is a trade-off between the increased complexity of possible motions and the tendency to overfit, (iii) it is important to assess the efficacy of the curved trajectory in terms of downstream applications. We show various applications of the linear trajectory in Sections IV-C, IV-D, and IV-E; for curved trajectories, beyond focusing on beating the current benchmark, it would be interesting to show new applications. Finally, it is worth reconsidering the terminology of the estimation task, such as “instantaneous” (short-baseline) optical flow, versus “non-instantaneous” (i.e., large-baseline) curved trajectory estimation.

IMOs: The de facto standard flow benchmarks MVSEC and DSEC ignore pixels corresponding to IMOs (because it is difficult to obtain GT labels for IMOs in the real-world). However, optical flow can describe such motions. Indeed, as Table I shows, the motion-field–parameterized flow achieves better accuracy in still scenes. Training ANNs using only flow from rigid scenes may affect their learning capabilities. To avoid potential pitfalls of optical flow algorithms, it is therefore important that the data used for (training and) evaluation contains IMOs and a variety of ego-motions.

C. Applications of Optical Flow

This section demonstrates three exemplary applications of the estimated optical flow: motion segmentation, intensity reconstruction, and denoising.

1) Motion Segmentation

Motion segmentation is the task of splitting a scene into objects moving with different velocities. Thus, it is natural to address it by clustering optical flow [20]. To this end, we show results on three sequences from [20], [21] in Fig. 10 using k-means with 2 to 3 clusters. In the corridor scene (first row of Fig. 10) there are 3 clusters: two people are walking in opposite directions while the camera is moving (background). In the second example, the scene includes cars with horizontal motion while the camera tilts. The third example (car) has a car moving at a different speed in the same direction as the background, which is the most challenging case among these examples. In all examples, our method successfully provides sensible segmentation masks (last column of Fig. 10) corresponding to the scene objects.

Fig. 10. - Motion Segmentation. First row: corridor sequence from [21]. Second and third rows are sequences from [20].
Fig. 10.

Motion Segmentation. First row: corridor sequence from [21]. Second and third rows are sequences from [20].

Fig. 11 provides detailed analyses of the clustering operation for the corridor and car examples. Since the proposed method uses a tile-based parameterization of the flow, the interpolation between tiles produces flow vectors that fill in the regions between the distinctive cluster centroids. One could use other clustering algorithms, such as DBSCAN [83], to treat such interpolation effects as outliers.

Fig. 11. - Visualization of the flow clustering on the first and third examples in Fig. 10. The stars denote the cluster centroids. Cluster 0 (blue) corresponds to the background, while clusters 1 and 2 are independently moving objects.
Fig. 11.

Visualization of the flow clustering on the first and third examples in Fig. 10. The stars denote the cluster centroids. Cluster 0 (blue) corresponds to the background, while clusters 1 and 2 are independently moving objects.

2) Image Reconstruction

Events encode the apparent motion of scene edges (e.g., optical flow) as well as their brightness. These two quantities are entangled, and it is possible to use computed optical flow to recover brightness, i.e., reconstruct intensity images [24]. We demonstrate it on a 1 Mpixel dataset in Fig. 12. The estimated flow provides sharp IWEs, which successfully aids reconstruct intensities such as the checkerboard on the wall, the light and its reflection on the corridor, and the complex structure of the stairs. The results are remarkable despite the noise in the corridor scene (see Section IV-C3). Due to the regularizer in [24], the very fine structure (e.g., the poster contents) might not be crisp.

Fig. 12. - Image reconstruction after optical flow estimation. Data from the 1Mpixel TUM-VIE dataset [32].
Fig. 12.

Image reconstruction after optical flow estimation. Data from the 1Mpixel TUM-VIE dataset [32].

3) Denoising Event Data

By extending the idea of [84], which classifies events for temporal upsampling into signal or noise based on a predicted 2-DOF motion, we use the estimated optical flow to identify noise events as those where the IWE is smaller than some value (e.g., 3 events). Fig. 13 shows qualitative results. The corridor scene has a large amount of noise due to lighting (i.e., flickering events). The denoised event data looks clearer, while it retains the edge structure of the scene.

Fig. 13. - Denoising. The data is the skate-easy sequence from the TUM-VIE dataset. The top row is the image representation of the events, while the bottom row shows them in space-time coordinates (for better visualization, only the bottom-right quarter of the image plane is displayed).
Fig. 13.

Denoising. The data is the skate-easy sequence from the TUM-VIE dataset. The top row is the image representation of the events, while the bottom row shows them in space-time coordinates (for better visualization, only the bottom-right quarter of the image plane is displayed).

D. Monocular Depth and Ego-Motion Estimation

1) Results on MVSEC

Evaluation on Depth: Table IV summarizes the quantitative results of depth estimation on the MVSEC dataset [34]. Following the convention [57], we report the metrics for indoor as the average of the three indoor sequences. Although prior works use different strategies, such as additional sensor information, different train-test split, and different evaluations, we provide exhaustive comparisons across the existing methods to date: a model-based method where the pose information is given (EMVS) [23], a supervised-learning method [61] trained on real data (outdoor_day2, denoted “SL (R)”) or in simulation (“SL (S)”), and two unsupervised-learning methods [27], [57].

TABLE IV Depth Evaluation on MVSEC (Mean of Three Indoor Sequences)
Table IV- Depth Evaluation on MVSEC (Mean of Three Indoor Sequences)

The proposed methods achieve overall better accuracy on the indoor sequences and competitive results on the outdoor sequence compared with ECN [57], the closest work to ours. However, ECN uses the 80/20 train-test split within each sequence (i.e., the training data consists of the same sequences as the test data), hence it might suffer from data leakage. For the outdoor sequence, our methods provide better results than the real-world supervised-learning method (“SL (R)”), and competitive results with the other learning-based approaches. We find that outdoor sequences are in general more challenging for the proposed approach. This can be attributed to the facts that (i) the MVSEC outdoor data has considerably sparse events, which affects the convergence of the method, and (ii) events in a scene comprise various displacements with uneven distribution on the image plane. Indeed, the L^{1} gradient magnitude loss achieves better results than the L^{2} loss.

Qualitative results are shown in Fig. 14. For completeness, we show the flow (i.e., motion field) computed from the estimated depth and ego-motion. The estimated depth resembles the GT for both sequences, resulting in sharp IWEs. Moreover, similarly to the flow estimation (Section IV-B1), the proposed depth covers the pixels where the GT does not exist, such as the middle board in the indoor scene and poles in the outdoor scene. Also, the estimated depth looks reasonable where LiDAR may fail to produce reliable depth maps due to the differences in the sampling frequency (e.g., the left-most board in the indoor results). Overall, the results illustrate that the proposed method is effective in estimating depth for these standard, real-world sequences.

Fig. 14. - Depth estimation results on indoor_flying3 and outdoor_day1 sequences of the MVSEC dataset [34]. The 2nd and 3 rd columns show the estimation and GT, respectively.
Fig. 14.

Depth estimation results on indoor_flying3 and outdoor_day1 sequences of the MVSEC dataset [34]. The 2nd and 3 rd columns show the estimation and GT, respectively.

Ego-Motion Estimation: Fig. 15 shows ego-motion estimation results on the indoor_flying1 sequence. The estimated linear velocity is scaled using the GT (IMU). The linear velocities resemble the GT, indicating that our method successfully estimates the camera motion of the freely-moving (6-DOF) drone. The pitch/yaw angular velocities are challenging to estimate since the motion field due to the pitch/yaw rotations is similar to that of a translation.

Fig. 15. - Ego-motion estimation results on the indoor_flying1 sequence from the MVSEC dataset [34].
Fig. 15.

Ego-motion estimation results on the indoor_flying1 sequence from the MVSEC dataset [34].

Quantitative results are reported in Table V. Linear velocity errors are sensible: 20–30 cm/s for indoor (drone) sequences and 5.9 m/s for the outdoor (car) sequence. Forward-moving motion is more challenging for depth estimation, as the scene contains less parallax than lateral translational motions, which is also confirmed by our results. Angular velocity errors are small in all sequences, as they do not contain rotational-dominant motions. Few prior works report numerical values for comparison. As discussed in (Section IV-D1 and Table IV), ECN [57] might have overfit to this outdoor sequence that reports a very small error (0.7m/s). On the other hand, our results provide constantly reasonable/similar metrics for all sequences. We hope Table V will encourage more works to benchmark monocular ego-motion estimation on these datasets.

TABLE V Pose Evaluation on MVSEC [34]
Table V- Pose Evaluation on MVSEC [34]

2) Results on ECD

Depth and ego-motion estimation results on the slider_depth sequence from the ECD dataset [63] are shown on Fig. 16. Our method produces a sharp IWE as well as reasonable depth map, flow and poses, handling complex objects with occlusion and at different distances. The camera pose RMS errors are: 0.11 m/s (in \mathbf {V}) and 0.94 ^{\circ }/s (in \boldsymbol{\omega }). We observe that the predicted linear velocity stays relatively constant, as expected. Also, the angular velocity error stays small, as the dominant motion of the sequence is translational. This is favorable for future extension of the proposed method to global adjustment (e.g., SLAM).

Fig. 16. - Depth and Ego-motion estimation for the slider_depth sequence (real data) from the ECD dataset [63]. RMS errors: 0.11 m/s (in $\mathbf {V}$V) and 0.94 $^{\circ }$∘/s (in $\boldsymbol{\omega }$ω).
Fig. 16.

Depth and Ego-motion estimation for the slider_depth sequence (real data) from the ECD dataset [63]. RMS errors: 0.11 m/s (in \mathbf {V}) and 0.94 ^{\circ }/s (in \boldsymbol{\omega }).

Fig. 17 shows the results on a synthetic sequence from [63]. Since it has ground truth poses and depth, we also report these evaluation metrics, as SiLog[x100]: 1.16, AbsRelDiff: 0.09, logRMSE: 0.11, A1: 0.98, A2: 1.0 and A3: 1.0 for depth, and RMS: 0.30 m/s (in \mathbf {V}) and 0.199 ^{\circ }/s (in \boldsymbol{\omega }) for velocities. The estimated depth, flow, and ego-motion resemble those of GT, producing a sharp IWE.

Fig. 17. - Depth and Ego-motion estimation for the simulation_3planes sequence from the ECD dataset [63]. GT flow is generated using GT poses and GT depth. RMS errors: 0.30 m/s (in $\mathbf {V}$V) and 0.199 $^{\circ }$∘/s (in $\boldsymbol{\omega }$ω).
Fig. 17.

Depth and Ego-motion estimation for the simulation_3planes sequence from the ECD dataset [63]. GT flow is generated using GT poses and GT depth. RMS errors: 0.30 m/s (in \mathbf {V}) and 0.199 ^{\circ }/s (in \boldsymbol{\omega }).

3) Results on 1 Mpixel Datasets. TUM-VIE and M3ED

Fig. 18 shows the qualitative depth estimation results on the TUM-VIE and M3ED datasets [32], [33]. The estimated depth is realistic, even for the challenging corridor sequence, which contains a large amount of noise and large variations of contour displacement in the scene due to the forward motion. The resulting flows are reasonable and the IWEs are sharp. Since the datasets do not have GT depth, we cannot conduct the quantitative evaluation.

Fig. 18. - Depth estimation on 1Mpixel event datasets [32], [33].
Fig. 18.

Depth estimation on 1Mpixel event datasets [32], [33].

E. Stereo Depth Estimation

As explained in Section III-F2, our method can also tackle the event-based stereo scenario. Fig. 19 shows stereo depth estimation results on the DSEC and MVSEC datasets. By parameterizing the depth and ego-motion on one camera only, the proposed model-based method successfully converges and provides sharp IWEs for both event cameras. We observe that, while IMOs are not explicitly modeled, depth estimation becomes more robust against them in the stereo setting. We leave a detailed analysis, evaluation, and benchmarks for future work.

Fig. 19. - Stereo depth estimation results on MVSEC (indoor2) and DSEC (zurich_05b) datasets.
Fig. 19.

Stereo depth estimation results on MVSEC (indoor2) and DSEC (zurich_05b) datasets.

SECTION V.

Ablation and Sensitivity Analysis

A. Effect of the Multi-Reference Focus Loss

The effect of the proposed multi-reference focus loss is shown in Fig. 20. The single-reference focus loss function can easily overfit to the only reference time, pushing all events into a small region of the image at t_{1} while producing blurry IWEs at other times (t_{\text{{mid}}} and t_{N_{e}}). Instead, our proposed multi-reference focus loss discourages such overfitting, as the loss favors flow fields which produce sharp IWEs at any reference time. The difference is also noticeable in the flow: the flow from the single-reference loss is irregular, with a lot of spatial variability in terms of directions (many colors, often in opposite directions of the color wheel). In contrast, the flow from the multi-reference loss is considerably more regular.

Fig. 20. - Effect of the multi-reference focus loss. Top row: single reference ($t_{1}$t1). Bottom row: proposed multi-reference.
Fig. 20.

Effect of the multi-reference focus loss. Top row: single reference (t_{1}). Bottom row: proposed multi-reference.

B. Effect of the Time-Aware Flow

To assess the effect of the proposed time-aware warp (8), we conducted experiments on MVSEC, DSEC and ECD [63] datasets. Accuracy results are already reported in Tables I and II. We now report values of the FWL metric in Table VI. For MVSEC, dt=1 is a very short time interval, with small motion and therefore few events, hence the sharpness of the IWE with or without motion compensation are about the same (FWL \approx 1). Instead, dt=4 provides more events, and larger FWL values (1.1–1.3), which means that the contrast of the motion-compensated IWE is larger than that of the zero flow baseline. All three methods provide sharper IWEs than ground truth. The advantages of the time-aware warp (8) over (4) to produce better IWEs (higher FWL) are most noticeable on sequences like slider_depth [63] and DSEC (see Fig. 21) because of the occlusions and larger motions. Notice that FWL differences below 0.1 are significant as seen in [52, Fig. 1] (cf. last two columns) and [52, Fig. 3], demonstrating the efficacy of time-awareness.

TABLE VI FWL (IWE Sharpness) Results on MVSEC, DSEC, and ECD
Table VI- FWL (IWE Sharpness) Results on MVSEC, DSEC, and ECD
Fig. 21. - Effect of the time-aware flow. Comparison between three flow models: Burgers’, upwind, and no time-aware (4). At occlusions (dartboard in slider_depth [63] and garage door in DSEC [5]), upwind and Burgers’ produce sharper IWEs. Due to the smoothness of the flow conferred by the tile-based approach, some small regions are still blurry.
Fig. 21.

Effect of the time-aware flow. Comparison between three flow models: Burgers’, upwind, and no time-aware (4). At occlusions (dartboard in slider_depth [63] and garage door in DSEC [5]), upwind and Burgers’ produce sharper IWEs. Due to the smoothness of the flow conferred by the tile-based approach, some small regions are still blurry.

C. Effect of the Multi-Scale Approach

The effect of the proposed multi-scale approach (Fig. 5) is shown in Fig. 22. This experiment compares the results of using multi-scale approaches (in a coarse-to-fine fashion) versus using a single (finest) scale. With a single scale, the optimizer gets stuck in a local extremal, yielding an irregular flow field (see the optical flow rows), which may produce a blurry IWE (e.g., outdoor_day1 scene). With three scales (finest tile and two downsampled ones), the flow becomes less irregular than with one single scale, but there may be regions with few events where the flow is difficult to estimate. With five scales the flow becomes smoother, more coherent over the whole image domain, while still being able to produce sharp IWEs.

Fig. 22. - Effect of the multi-scale approach. For each sequence, the top row shows the estimated flow and the bottom row shows the IWEs.
Fig. 22.

Effect of the multi-scale approach. For each sequence, the top row shows the estimated flow and the bottom row shows the IWEs.

D. The Choice of Loss Function

Table VII shows the results on the MVSEC benchmark for different loss functions. We compare the gradient-based functions (L^{1} and L^{2}), image variance [7], average timestamp [27], and normalized average timestamp [30]. The contrast functions (L^{1}, L^{2}, and variance) yield consistently better accuracy than the two average timestamp losses. Although the variance gives competitive results, we use the functions based on the IWE gradient for the reasons described in Section III-B1. Both average timestamp losses are trapped in the global optima which pushes most events out of the image plane (see Fig. 23), hence, they provide very large errors (marked as “> 99” in Table VII). Despite this, they have been successfully used in several learning-based methods.

TABLE VII Sensitivity Analysis on the Choice of Loss Function
Table VII- Sensitivity Analysis on the Choice of Loss Function
Fig. 23. - IWEs for different loss functions: (a) Gradient Magnitude ($L^{2}$L2); (b) Variance; (c) Avg. timestamp [27]; and (d) Normalized avg. timestamp [30].
Fig. 23.

IWEs for different loss functions: (a) Gradient Magnitude (L^{2}); (b) Variance; (c) Avg. timestamp [27]; and (d) Normalized avg. timestamp [30].

Remark: Maximization of (5) does not suffer from the problem mentioned in [30] that affects the average timestamp loss function, namely that the optimal flow warps all events outside the image so as to minimize the loss (undesired global optima shown in Fig. 23(c)–​(d)). If most events were warped outside of the image, then (5) would be smaller than the identity warp, which contradicts maximization.

E. The Regularizer Weight

Table VIII shows the sensitivity analysis on the regularizer weight \lambda in (9). \lambda =0.0025 provides the best accuracy in the outdoor sequence, while \lambda =0.025 provides slightly better accuracy in the indoor sequences. Comparing their accuracy, we use the former because it has a higher gain.

TABLE VIII Sensitivity Analysis on the Regularizer Weight
Table VIII- Sensitivity Analysis on the Regularizer Weight
SECTION VI.

Computational Performance

Each scale of our method has the same computational complexity as CM [7], O(N_{e}+ N_{p}) because the multi-reference warps yield a constant scaling factor. Our unoptimized implementation using PyTorch (v1.9) running on a GPU (NVIDIA Quadro RTX 8000) without time awareness takes about 9.9 s per batch to converge on the MVSEC experiments (3\times more in case of Burgers’ scheme) (Section IV-B1). However, if we apply the proposed method to a DNN (EV-FlowNet), training takes about 10 h, preprocessing (crop center image and voxelization) takes 74 ms, and inference takes about 3 ms (Section IV-B3). This inference time is on par with other DNN-based methods.

SECTION VII.

Limitations

Like previous unsupervised works [27], [30], our method is based on the brightness constancy assumption. Hence, it struggles to estimate flow from events that are not due to motion, such as those caused by flickering lights. SL and SSL methods may forego this assumption, but they require high quality supervisory signal, which is challenging due to the HDR and high speed of event cameras.

Like other optical flow methods, our approach may suffer from the aperture problem. The flow could still cause event collapse if tiles become too small (higher DOFs), or if the regularization is too small compared with the texture density that drives the data-fidelity term. This effect can be observed in Fig. 1, where the flow becomes irregular for the tree leaves (in the example on row 2). Optical flow is also difficult to estimate in regions with few events, such as homogeneous brightness regions and regions with small apparent motion. Regularization fills in the homogeneous regions, whereas recurrent connections could help with small apparent motion.

The monocular depth and ego-motion estimation approach considers each event packet (i.e., time interval) independently, hence it only recovers camera velocities. Absolute poses could be estimated if the camera velocities were simultaneously recovered over multiple event packets while sharing a common depth map. The stereo approach enables the recovery of the absolute scale.

While the computational effort of the proposed approach is high in our current (unoptimized) implementation, it allowed us to focus on modeling the problem and uncovering the “secrets” of event-based optical flow, i.e., identifying the successful ingredients for accurate motion estimation. Then, we showed how such knowledge could be transferred to learning-based settings, with the same computational cost and speed as prior work (ms inference time on GPUs).

SECTION VIII.

Conclusion

We have extended the CM framework to estimate dense optical flow, depth and ego-motion from events alone. The proposed principled method overcomes problems of overfitting, occlusions, and convergence by sensibly modeling the space-time nature of event data. The comprehensive experiments show that our method achieves the best flow accuracy among all methods in the MVSEC indoor benchmark, and among the unsupervised and model-based methods in the outdoor sequence. It also provides competitive results in the DSEC optical flow benchmark and generalizes to various datasets, including the latest 1 Mpixel ones, delivering the sharpest IWEs. The method exposes the limitations of the current flow benchmarks and produces remarkable results when it is transferred to unsupervised learning settings. We show downstream applications of the estimated flow, such as motion segmentation, intensity reconstruction and event denoising. Finally, the method achieves competitive results in depth and ego-motion estimation in both monocular and stereo settings. As demonstrated, the proposed framework is able to handle a broad set of motion-related tasks across multiple datasets and event camera resolutions, hence we believe it is a cornerstone in event-based vision. We hope our work inspires future model-based and learning-based approaches in these motion-related problems.

References

References is not available for this document.