Journals & Magazines >IEEE Transactions on Pattern ... >Volume: 46 Issue: 12

Secrets of Event-Based Optical Flow, Depth and Ego-Motion Estimation by Contrast Maximization

Abstract:

Event cameras respond to scene dynamics and provide signals naturally suitable for motion estimation with advantages, such as high dynamic range. The emerging field of ev...Show More

Metadata

Abstract:

Event cameras respond to scene dynamics and provide signals naturally suitable for motion estimation with advantages, such as high dynamic range. The emerging field of event-based vision motivates a revisit of fundamental computer vision tasks related to motion, such as optical flow and depth estimation. However, state-of-the-art event-based optical flow methods tend to originate in frame-based deep-learning methods, which require several adaptations (data conversion, loss function, etc.) as they have very different properties. We develop a principled method to extend the Contrast Maximization framework to estimate dense optical flow, depth, and ego-motion from events alone. The proposed method sensibly models the space-time properties of event data and tackles the event alignment problem. It designs the objective function to prevent overfitting, deals better with occlusions, and improves convergence using a multi-scale approach. With these key elements, our method ranks first among unsupervised methods on the MVSEC benchmark and is competitive on the DSEC benchmark. Moreover, it allows us to simultaneously estimate dense depth and ego-motion, exposes the limitations of current flow benchmarks, and produces remarkable results when it is transferred to unsupervised learning settings. Along with various downstream applications shown, we hope the proposed method becomes a cornerstone on event-based motion-related tasks.

Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 46, Issue: 12, December 2024)

Page(s): 7742 - 7759

Date of Publication: 02 May 2024

ISSN Information:

PubMed ID: 38696288

DOI: 10.1109/TPAMI.2024.3396116

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Event cameras are novel bio-inspired vision sensors that naturally respond to motion of edges in image space with high dynamic range (HDR) and minimal blur at high temporal resolution, on the order of $\mu$ s [1]. These advantages provide a rich signal for accurate motion estimation in difficult real-world scenarios for frame-based cameras. However, such a signal is asynchronous and sparse by nature, hence not compatible with traditional computer vision algorithms. This poses the challenge of rethinking visual processing [2], [3]: motion patterns (i.e., optical flow) are no longer obtained by analyzing the intensities of images captured at regular intervals, but by analyzing the stream of per-pixel brightness changes produced by the event camera.

Multiple methods have been proposed for event-based optical flow estimation. They can be broadly categorized in two: (i) model-based methods, which investigate the principles and characteristics of event data that enable optical flow estimation, and (ii) learning-based methods, which exploit correlations in the data and/or apply the above-mentioned principles to compute optical flow. One of the challenges of event-based optical flow is the lack of ground truth flow in real-world datasets (at $\mu$ s resolution and HDR) [2], which makes it difficult to evaluate and compare the methods properly, and to train supervised-learning ones. Ground truth (GT) in de facto standard datasets [4], [5] is obtained by the motion field [6] given additional depth sensors and camera motion. However, such data is limited by the field-of-view (FOV) and resolution (spatial and temporal) of the depth sensor, which do not match those of event cameras. Hence, it is paramount to develop interpretable optical flow methods that exploit the characteristics of event data, and that do not need costly and error-prone ground truth.

Among prior work, Contrast Maximization (CM) [7], [8] is a powerful framework that allows us to tackle multiple motion estimation problems (rotational motion [9], [10], [11], [12], homographic motion [7], [13], [14], feature flow estimation [15], [16], [17], [18], motion segmentation [19], [20], [21], [22], and also reconstruction [7], [23], [24]). It maximizes an objective function (e.g., contrast) that measures the alignment of events caused by the same scene edge. The intuitive interpretation is to estimate the motion by recovering the sharp (motion-compensated) image of edge patterns that caused the events. Preliminary work on applying CM to estimate optical flow has reported event collapse [25], [26], producing flows at undesired optima that warp events to few pixels or lines [27]. This issue has been tackled by changing the objective function, from contrast to the energy of an average timestamp image [27], [28], but this loss is not straightforward to interpret [8], [29], and is not without its problems [30].

The state-of-the-art performance of CM in low degrees-of-freedom (DOF) motion estimations and its issues in more complex motions (dense flow) suggests that prior work may have rushed to use CM in unsupervised learning of dense flow. There is a gap in understanding how CM can be sensibly extended to estimate dense optical flow accurately. This paper fills this gap and shows a few “secrets” that are also applicable to overcome the issues of previous approaches.

We propose to extend CM for dense optical flow estimation via a tile-based approach covering the image plane (Fig. 1). We present several distinctive contributions:

A multi-reference focus loss function to improve accuracy and discourage overfitting (Section III-B).
Fig. 1.
DSEC test sequences (interlaken_00b, thun_01a) [5]. Our optical flow estimation method produces sharp images of warped events (IWE) despite the scene complexity, the large pixel displacement and the high dynamic range.
Show All
A principled time-aware flow to better handle occlusions, leveraging the solution of transport problems via differential equations (Section III-C).
A multi-scale approach to improve convergence to the solution and avoid getting trapped in local optima (Section III-D).

Optical flow is a fundamental visual quantity related to many others, such as camera motion and scene depth. Hence, in this paper we exploit these connections, in monocular and stereo configurations, and show how a dense flow can serve to tackle various related problems in event-based vision, such as depth estimation, motion segmentation, etc. (Fig. 2). This paper is based on our previous work [31], which we substantially extend in the following points:

We introduce a new objective function that improves both flow and depth estimation (Section III-B1).
Fig. 2.
Overview. The proposed method solely relies on event data. It not only estimates optical flow, but can also estimate scene depth and ego-motion simultaneously from a monocular or stereo event camera setup. Furthermore, the estimated flow enables various downstream applications such as motion segmentation, intensity reconstruction and event denoising.
Show All
We tackle stationary scenes, estimating monocular depth and ego-motion jointly (Sections III-F1 and IV-D).
We also address the stereo setup (Sections III-F2 and IV-E).
We discuss current optical flow benchmarks, evaluations and “GT” flow (Section IV-B5).
We provide experiments on downstream applications of optical flow: motion segmentation, intensity reconstruction, and denoising (Section IV-C).
We show experiments on 1Mpixel event cameras, the most recent event camera datasets: TUM-VIE [32] and M3ED [33], both in flow (Section IV-B4) and depth estimation (Section IV-D3).
We extend the discussion on computational performance and limitations (Sections VI and VII).

The results of our experimental evaluation are surprising: the above design choices are key to our simple, model-based tile-based method achieving the best accuracy among all state-of-the-art methods, including supervised-learning ones, on the de facto benchmark of MVSEC indoor sequences [34]. Since our method is interpretable and produces better event alignment than the ground truth flow, both qualitatively and quantitatively, the experiments also expose the limitations of the current “ground truth”. The experiments demonstrate that the above key choices are transferable to unsupervised learning methods, thus guiding future design and understanding of more proficient Artificial Neural Networks (ANNs) for event-based optical flow estimation. Finally, the method allows us to solve many motion-related applications, thus becoming a cornerstone in event-based vision.

Because of the above, we believe that the proposed design choices deserve to be called “secrets” [35]. To the best of our knowledge, they are novel in the context of event-based optical flow, depth and ego-motion estimation, e.g., no prior work considers constant flow along its characteristic lines, designs the multi-reference focus loss to tackle overfitting, or has defined multi-scale (i.e., multi-resolution) contrast maximization on the raw events.

SECTION II.

Related Work

A. Event-Based Optical Flow Estimation

Given the identified advantages of event cameras to estimate optical flow, extensive research on this topic has been carried out. Prior work has proposed adaptations of frame-based approaches (block matching [36], Lucas-Kanade [37]), filter-banks [38], [39], spatio-temporal plane-fitting [40], [41], time surface matching [42], variational optimization on voxelized events [43], and feature-based contrast maximization [7], [15]. For a detailed survey, we refer to [2].

Current state-of-the-art approaches are ANNs [27], [30], [34], [44], [45], [46], largely inspired by frame-based optical flow architectures [47], [48]. Non-spiking–based approaches need to additionally adapt the input signal, converting the events into a tensor representation (event frames, voxel grids, etc.). These learning-based methods can be classified into supervised, semi-supervised, or unsupervised (see Table I). In terms of architectures, the three most common ones are U-Net [34], [49], FireNet [28], and RAFT [44], [50].

TABLE I Results on MVSEC Dataset [34]

Supervised methods train ANNs in simulation and/or real-data [44], [49], [50], [51], [52], [53], [54]. This requires accurate GT flow that matches the space-time resolution of event cameras. While this is not a problem in simulation, it incurs a performance gap when trained models are used to predict flow on real data, due to often a large domain gap between training and test data [52], [55]. Besides, real-world datasets have issues in providing accurate GT flow.

Semi-supervised methods use the grayscale images from a colocated camera (e.g., DAVIS [56]) as a supervisory signal: images are warped using the flow predicted by the ANN and their photometric consistency is used as loss function [34], [45], [46]. While such supervisory signal is easier to obtain than real-world GT flow, it may suffer from the limitations of frame-based cameras (e.g., motion blur and low dynamic range), consequently affecting the trained ANNs. EV-FlowNet [34] pioneered these approaches.

Unsupervised methods rely solely on event data. Their loss function consists of an event alignment error using the flow predicted by the ANN [27], [28], [30], [57], [58], [59]. Zhu et al. [27] extended EV-FlowNet [34] to the unsupervised setting using a motion-compensation loss inspired by the average timestamp images in [19]. This U-Net–like approach has been improved with recurrent blocks in [28], [30]. Paredes-Vallés et al. [28] also proposed FireFlowNet, a lightweight recurrent ANN with no downsampling. More recently, [30] has proposed several variants of EV-FlowNet and FireFlowNet models, and, enabled by the recurrent blocks, has replaced the usual voxel-grid input event representation by sequentially processing short-time event frames. Finally, concurrent work [59] builds upon [30] (sequential processing of event frames), proposing iterative event warping at multiple reference times in a multi-timescale fashion, which allows curved motion trajectories.

B. Event-Based Depth and Ego-Motion Estimation

Having estimated optical flow, one could try to fit a depth map and camera ego-motion a posteriori consistent with the flow [60]. Instead, it is better to incorporate the assumption of a still scene and a moving camera on the parameterization of the flow using the motion field equation [6]. While this connection exists, the topic of joint ego-motion and dense depth estimation via the motion field is not as explored as optical flow estimation. The problem is difficult, and often one settles for estimating depth alone, with or without knowledge of the camera motion [23], [61], [62].

Closest to our work are [27], [57] because they estimate a depth-parameterized motion field that best fits the event data. They do so by training ANNs in an unsupervised way. The loss functions are based on the energy of an average timestamp image [27] or on the photometric consistency of edge-maps warped by the predicted flow [57].

Similar to the above-mentioned unsupervised-learning works, our method produces dense optical flow and/or depth and does not need ground truth or additional supervisory signals. In contrast to prior work, we adopt a more classical modeling perspective to gain insights into the problem and discover principled solutions that can subsequently be applied to the learning-based setting. Stemming from an accurate and spatially-dependent contrast loss (the gradient magnitude [8]), we model the problem using a tile of patches (in flow or depth parameters) and propose solutions to several problems: overfitting, occlusions, and convergence. To the best of our knowledge, (i) no prior work has proposed to estimate dense optical flow and/or dense depth from a CM model-based perspective, and (ii) no prior unsupervised learning approach based on motion compensation has succeeded in estimating optical flow without the average timestamp image loss. The latter may be due to event collapse [25], but given recent advances on overcoming this issue [31], we show it is possible to succeed.

SECTION III.

Method

In this section, first we briefly revisit the Contrast Maximization framework (Section III-A). Then, the proposed methods are explained in detail: Section III-B proposes the new data fidelity term of the objective function, which discourages event collapse. Section III-C proposes a principled model for optical flow that considers the space-time nature of events. We also explain the multi-scale parameterization of the flow (Section III-D), the composite objective function (Section III-E), and the application to the problem of depth and ego-motion estimation in monocular and stereo configurations (Section III-F).

A. Event Cameras and Contrast Maximization

Event cameras have independent pixels that operate continuously and generate “events” $e_{k} \doteq (\mathbf {x}_{k},t_{k},p_{k})$ whenever the logarithmic brightness at the pixel increases or decreases by a predefined amount, called contrast sensitivity. Each event $e_{k}$ contains the pixel-time coordinates ( $\mathbf {x}_{k}, t_{k}$ ) of the brightness change and its polarity $p_{k} \in \lbrace +1,-1\rbrace$ . Events occur asynchronously and sparsely on the pixel lattice, with a variable rate that depends on the scene dynamics.

The CM framework [7] assumes events $\mathcal {E}\doteq \lbrace e_{k}\rbrace _{k=1}^{N_{e}}$ are caused by moving edges (i.e., brightness constancy), and transforms them geometrically according to a motion model $\mathbf {W}$ , producing a set of warped events $\mathcal {E}^{\prime }_{t_\text{ref}} \doteq \lbrace e^{\prime }_{k}\rbrace _{k=1}^{N_{e}}$ at a reference time $t_\text{ref}$ $\begin{equation*} e_{k} \doteq (\mathbf {x}_{k},t_{k},p_{k}) \;\,\mapsto \;\, e^{\prime }_{k} \doteq (\mathbf {x}^{\prime }_{k},t_\text{ref},p_{k}). \tag{1} \end{equation*}$ View SourceThe warp $\mathbf {x}^{\prime }_{k} = \mathbf {W}(\mathbf {x}_{k},t_{k}; \boldsymbol{\theta })$ transports each event from $t_{k}$ to $t_\text{ref}$ along the motion curve that passes through it. The vector $\boldsymbol{\theta }$ parameterizes the motion curves. Transformed events are aggregated on an image of warped events (IWE) $\begin{equation*} I(\mathbf {x}; \mathcal {E}^{\prime }_{t_\text{ref}}, \boldsymbol{\theta }) \doteq \sum _{k=1}^{N_{e}} \delta (\mathbf {x}- \mathbf {x}^{\prime }_{k}), \tag{2} \end{equation*}$ View Sourcewhere each pixel $\mathbf {x}$ sums the number of warped events $\mathbf {x}^{\prime }_{k}$ that fall within it. The Dirac delta is approximated by a Gaussian, $\delta (\mathbf {x}-\boldsymbol{\mu })\approx \mathcal {N}(\mathbf {x};\boldsymbol{\mu },\epsilon ^{2}\mathtt {Id})$ with $\epsilon =1$ pixel. Next, an objective function $f(\boldsymbol{\theta })$ is computed, such as the contrast of the IWE (2), given by the variance $\begin{equation*} \operatorname{Var}\bigl (I(\mathbf {x};\boldsymbol{\theta })\bigr ) \doteq \frac{1}{|\Omega |} \int _{\Omega } \bigl (I(\mathbf {x};\boldsymbol{\theta })-\mu _{I}\bigr )^{2} d\mathbf {x}, \tag{3} \end{equation*}$ View Sourcewith mean $\mu _{I} \doteq \frac{1}{|\Omega |} \int _{\Omega } I(\mathbf {x};\boldsymbol{\theta }) d\mathbf {x}$ . The objective function measures the goodness of fit between the events and the candidate motion curves (warp). Finally, an optimization algorithm iterates the above steps until convergence. The goal is to find the motion parameters that maximize the alignment of events caused by the same scene edge. Event alignment is measured by the strength of the edges of the IWE, which is directly related to image contrast [8].

For dense optical flow motion, the warp used is [27], [28] $\begin{equation*} \mathbf {x}^{\prime }_{k} = \mathbf {x}_{k} + (t_{k}-t_\text{ref}) \, \mathbf {v}(\mathbf {x}_{k}), \tag{4} \end{equation*}$ View Sourcewhere $\boldsymbol{\theta }= \lbrace \mathbf {v}(\mathbf {x}) \rbrace _{\mathbf {x}\in \Omega }$ is a flow field on the image plane $\Omega$ at a set time, e.g., $t_\text{ref}$ .

B. Multi-Reference Focus Objective Function

Zhu et al. [27] report that the contrast objective (variance) overfits to the events. This is in part because the warp (4) can describe very complex flow fields, which can push the events to accumulate in few pixels (i.e., event collapse [25], [26]). To mitigate event collapse, we reduce the complexity of the flow field by dividing the image plane into a tile of non-overlapping patches, defining a flow vector at the center of each patch, and interpolating the flow on all other pixels (see Section III-D). Interpolation confers smoothness of the flow field, hence lowering complexity.

However, reducing the complexity of the estimation parameters is not enough. Additionally, we discover that warps that produce sharp IWEs at any reference time $t_\text{ref}$ have a regularizing effect on the flow field, discouraging event collapse. This is illustrated in Fig. 3. In practice we compute the multi-reference focus loss using three reference times: $t_{1}$ (min), $t_{\text{ {mid}}}\doteq (t_{1}+t_{N_{e}})/2$ (midpoint) and $t_{N_{e}}$ (max). For each set of events, the flow field is defined only at one reference time and then used to warp to $\lbrace t_{1}, t_{\text{ {mid}}}, t_{N_{e}}\rbrace$ .

$Fig. 3. - Multi-reference focus loss. Assume an edge moves from left to right. Flow estimation with single reference time ($t_{1}$t1) can warp all events into a single pixel, which results in a maximum contrast (at $t_{1}$t1). However, the same flow would produce low contrast (i.e., a blurry image) if events were warped to time $t_{N_{e}}$tNe. Instead, we favor flow fields that produce high contrast (i.e., sharp images) at any reference time (here, $t_\text{ref}= t_{1}$tref=t1 and $t_\text{ref}= t_{N_{e}}$tref=tNe). See also results in Fig. 20.$

Fig. 3.

Multi-reference focus loss. Assume an edge moves from left to right. Flow estimation with single reference time ( $t_{1}$ ) can warp all events into a single pixel, which results in a maximum contrast (at $t_{1}$ ). However, the same flow would produce low contrast (i.e., a blurry image) if events were warped to time $t_{N_{e}}$ . Instead, we favor flow fields that produce high contrast (i.e., sharp images) at any reference time (here, $t_\text{ref}= t_{1}$ and $t_\text{ref}= t_{N_{e}}$ ). See also results in Fig. 20.

Secrets of Event-Based Optical Flow, Depth and Ego-Motion Estimation by Contrast Maximization

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work

A. Event-Based Optical Flow Estimation

B. Event-Based Depth and Ego-Motion Estimation

Method

A. Event Cameras and Contrast Maximization

B. Multi-Reference Focus Objective Function

1) Objective Functions Based on the IWE Gradient

C. Time-Aware Flow

D. Multi-Scale Flow Parameterization

E. Composite Objective Function

F. Depth and Ego-Motion Estimation

1) Monocular

2) Stereo

Experiments

A. Datasets, Metrics and Hyper-Parameters

B. Optical Flow Estimation

1) Results on the MVSEC Benchmark

2) Results on the DSEC Benchmark

3) Application to Deep Neural Networks (DNN)

4) Results on 1 Mpixel Datasets: TUM-VIE and M3ED

5) Discussion on Optical Flow Benchmarks and “GT” Flow

C. Applications of Optical Flow

1) Motion Segmentation

2) Image Reconstruction

3) Denoising Event Data

D. Monocular Depth and Ego-Motion Estimation

1) Results on MVSEC

2) Results on ECD

3) Results on 1 Mpixel Datasets. TUM-VIE and M3ED

E. Stereo Depth Estimation

Ablation and Sensitivity Analysis

A. Effect of the Multi-Reference Focus Loss

B. Effect of the Time-Aware Flow

C. Effect of the Multi-Scale Approach

D. The Choice of Loss Function

E. The Regularizer Weight

Computational Performance

Limitations

Conclusion

Authors

Figures

References

Citations

Keywords

Metrics

Supplemental Items

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?