Introduction
Event cameras are novel bio-inspired vision sensors that naturally respond to motion of edges in image space with high dynamic range (HDR) and minimal blur at high temporal resolution, on the order of
Multiple methods have been proposed for event-based optical flow estimation. They can be broadly categorized in two: (i) model-based methods, which investigate the principles and characteristics of event data that enable optical flow estimation, and (ii) learning-based methods, which exploit correlations in the data and/or apply the above-mentioned principles to compute optical flow. One of the challenges of event-based optical flow is the lack of ground truth flow in real-world datasets (at
Among prior work, Contrast Maximization (CM) [7], [8] is a powerful framework that allows us to tackle multiple motion estimation problems (rotational motion [9], [10], [11], [12], homographic motion [7], [13], [14], feature flow estimation [15], [16], [17], [18], motion segmentation [19], [20], [21], [22], and also reconstruction [7], [23], [24]). It maximizes an objective function (e.g., contrast) that measures the alignment of events caused by the same scene edge. The intuitive interpretation is to estimate the motion by recovering the sharp (motion-compensated) image of edge patterns that caused the events. Preliminary work on applying CM to estimate optical flow has reported event collapse [25], [26], producing flows at undesired optima that warp events to few pixels or lines [27]. This issue has been tackled by changing the objective function, from contrast to the energy of an average timestamp image [27], [28], but this loss is not straightforward to interpret [8], [29], and is not without its problems [30].
The state-of-the-art performance of CM in low degrees-of-freedom (DOF) motion estimations and its issues in more complex motions (dense flow) suggests that prior work may have rushed to use CM in unsupervised learning of dense flow. There is a gap in understanding how CM can be sensibly extended to estimate dense optical flow accurately. This paper fills this gap and shows a few “secrets” that are also applicable to overcome the issues of previous approaches.
We propose to extend CM for dense optical flow estimation via a tile-based approach covering the image plane (Fig. 1). We present several distinctive contributions:
A multi-reference focus loss function to improve accuracy and discourage overfitting (Section III-B).
A principled time-aware flow to better handle occlusions, leveraging the solution of transport problems via differential equations (Section III-C).
A multi-scale approach to improve convergence to the solution and avoid getting trapped in local optima (Section III-D).
Optical flow is a fundamental visual quantity related to many others, such as camera motion and scene depth. Hence, in this paper we exploit these connections, in monocular and stereo configurations, and show how a dense flow can serve to tackle various related problems in event-based vision, such as depth estimation, motion segmentation, etc. (Fig. 2). This paper is based on our previous work [31], which we substantially extend in the following points:
We introduce a new objective function that improves both flow and depth estimation (Section III-B1).
Fig. 2.Overview. The proposed method solely relies on event data. It not only estimates optical flow, but can also estimate scene depth and ego-motion simultaneously from a monocular or stereo event camera setup. Furthermore, the estimated flow enables various downstream applications such as motion segmentation, intensity reconstruction and event denoising.
We tackle stationary scenes, estimating monocular depth and ego-motion jointly (Sections III-F1 and IV-D).
We also address the stereo setup (Sections III-F2 and IV-E).
We discuss current optical flow benchmarks, evaluations and “GT” flow (Section IV-B5).
We provide experiments on downstream applications of optical flow: motion segmentation, intensity reconstruction, and denoising (Section IV-C).
We show experiments on 1Mpixel event cameras, the most recent event camera datasets: TUM-VIE [32] and M3ED [33], both in flow (Section IV-B4) and depth estimation (Section IV-D3).
We extend the discussion on computational performance and limitations (Sections VI and VII).
The results of our experimental evaluation are surprising: the above design choices are key to our simple, model-based tile-based method achieving the best accuracy among all state-of-the-art methods, including supervised-learning ones, on the de facto benchmark of MVSEC indoor sequences [34]. Since our method is interpretable and produces better event alignment than the ground truth flow, both qualitatively and quantitatively, the experiments also expose the limitations of the current “ground truth”. The experiments demonstrate that the above key choices are transferable to unsupervised learning methods, thus guiding future design and understanding of more proficient Artificial Neural Networks (ANNs) for event-based optical flow estimation. Finally, the method allows us to solve many motion-related applications, thus becoming a cornerstone in event-based vision.
Because of the above, we believe that the proposed design choices deserve to be called “secrets” [35]. To the best of our knowledge, they are novel in the context of event-based optical flow, depth and ego-motion estimation, e.g., no prior work considers constant flow along its characteristic lines, designs the multi-reference focus loss to tackle overfitting, or has defined multi-scale (i.e., multi-resolution) contrast maximization on the raw events.
Related Work
A. Event-Based Optical Flow Estimation
Given the identified advantages of event cameras to estimate optical flow, extensive research on this topic has been carried out. Prior work has proposed adaptations of frame-based approaches (block matching [36], Lucas-Kanade [37]), filter-banks [38], [39], spatio-temporal plane-fitting [40], [41], time surface matching [42], variational optimization on voxelized events [43], and feature-based contrast maximization [7], [15]. For a detailed survey, we refer to [2].
Current state-of-the-art approaches are ANNs [27], [30], [34], [44], [45], [46], largely inspired by frame-based optical flow architectures [47], [48]. Non-spiking–based approaches need to additionally adapt the input signal, converting the events into a tensor representation (event frames, voxel grids, etc.). These learning-based methods can be classified into supervised, semi-supervised, or unsupervised (see Table I). In terms of architectures, the three most common ones are U-Net [34], [49], FireNet [28], and RAFT [44], [50].
Supervised methods train ANNs in simulation and/or real-data [44], [49], [50], [51], [52], [53], [54]. This requires accurate GT flow that matches the space-time resolution of event cameras. While this is not a problem in simulation, it incurs a performance gap when trained models are used to predict flow on real data, due to often a large domain gap between training and test data [52], [55]. Besides, real-world datasets have issues in providing accurate GT flow.
Semi-supervised methods use the grayscale images from a colocated camera (e.g., DAVIS [56]) as a supervisory signal: images are warped using the flow predicted by the ANN and their photometric consistency is used as loss function [34], [45], [46]. While such supervisory signal is easier to obtain than real-world GT flow, it may suffer from the limitations of frame-based cameras (e.g., motion blur and low dynamic range), consequently affecting the trained ANNs. EV-FlowNet [34] pioneered these approaches.
Unsupervised methods rely solely on event data. Their loss function consists of an event alignment error using the flow predicted by the ANN [27], [28], [30], [57], [58], [59]. Zhu et al. [27] extended EV-FlowNet [34] to the unsupervised setting using a motion-compensation loss inspired by the average timestamp images in [19]. This U-Net–like approach has been improved with recurrent blocks in [28], [30]. Paredes-Vallés et al. [28] also proposed FireFlowNet, a lightweight recurrent ANN with no downsampling. More recently, [30] has proposed several variants of EV-FlowNet and FireFlowNet models, and, enabled by the recurrent blocks, has replaced the usual voxel-grid input event representation by sequentially processing short-time event frames. Finally, concurrent work [59] builds upon [30] (sequential processing of event frames), proposing iterative event warping at multiple reference times in a multi-timescale fashion, which allows curved motion trajectories.
B. Event-Based Depth and Ego-Motion Estimation
Having estimated optical flow, one could try to fit a depth map and camera ego-motion a posteriori consistent with the flow [60]. Instead, it is better to incorporate the assumption of a still scene and a moving camera on the parameterization of the flow using the motion field equation [6]. While this connection exists, the topic of joint ego-motion and dense depth estimation via the motion field is not as explored as optical flow estimation. The problem is difficult, and often one settles for estimating depth alone, with or without knowledge of the camera motion [23], [61], [62].
Closest to our work are [27], [57] because they estimate a depth-parameterized motion field that best fits the event data. They do so by training ANNs in an unsupervised way. The loss functions are based on the energy of an average timestamp image [27] or on the photometric consistency of edge-maps warped by the predicted flow [57].
Similar to the above-mentioned unsupervised-learning works, our method produces dense optical flow and/or depth and does not need ground truth or additional supervisory signals. In contrast to prior work, we adopt a more classical modeling perspective to gain insights into the problem and discover principled solutions that can subsequently be applied to the learning-based setting. Stemming from an accurate and spatially-dependent contrast loss (the gradient magnitude [8]), we model the problem using a tile of patches (in flow or depth parameters) and propose solutions to several problems: overfitting, occlusions, and convergence. To the best of our knowledge, (i) no prior work has proposed to estimate dense optical flow and/or dense depth from a CM model-based perspective, and (ii) no prior unsupervised learning approach based on motion compensation has succeeded in estimating optical flow without the average timestamp image loss. The latter may be due to event collapse [25], but given recent advances on overcoming this issue [31], we show it is possible to succeed.
Method
In this section, first we briefly revisit the Contrast Maximization framework (Section III-A). Then, the proposed methods are explained in detail: Section III-B proposes the new data fidelity term of the objective function, which discourages event collapse. Section III-C proposes a principled model for optical flow that considers the space-time nature of events. We also explain the multi-scale parameterization of the flow (Section III-D), the composite objective function (Section III-E), and the application to the problem of depth and ego-motion estimation in monocular and stereo configurations (Section III-F).
A. Event Cameras and Contrast Maximization
Event cameras have independent pixels that operate continuously and generate “events”
The CM framework [7] assumes events
\begin{equation*}
e_{k} \doteq (\mathbf {x}_{k},t_{k},p_{k}) \;\,\mapsto \;\, e^{\prime }_{k} \doteq (\mathbf {x}^{\prime }_{k},t_\text{ref},p_{k}). \tag{1}
\end{equation*}
\begin{equation*}
I(\mathbf {x}; \mathcal {E}^{\prime }_{t_\text{ref}}, \boldsymbol{\theta }) \doteq \sum _{k=1}^{N_{e}} \delta (\mathbf {x}- \mathbf {x}^{\prime }_{k}), \tag{2}
\end{equation*}
\begin{equation*}
\operatorname{Var}\bigl (I(\mathbf {x};\boldsymbol{\theta })\bigr ) \doteq \frac{1}{|\Omega |} \int _{\Omega } \bigl (I(\mathbf {x};\boldsymbol{\theta })-\mu _{I}\bigr )^{2} d\mathbf {x}, \tag{3}
\end{equation*}
For dense optical flow motion, the warp used is [27], [28]
\begin{equation*}
\mathbf {x}^{\prime }_{k} = \mathbf {x}_{k} + (t_{k}-t_\text{ref}) \, \mathbf {v}(\mathbf {x}_{k}), \tag{4}
\end{equation*}
B. Multi-Reference Focus Objective Function
Zhu et al. [27] report that the contrast objective (variance) overfits to the events. This is in part because the warp (4) can describe very complex flow fields, which can push the events to accumulate in few pixels (i.e., event collapse [25], [26]). To mitigate event collapse, we reduce the complexity of the flow field by dividing the image plane into a tile of non-overlapping patches, defining a flow vector at the center of each patch, and interpolating the flow on all other pixels (see Section III-D). Interpolation confers smoothness of the flow field, hence lowering complexity.
However, reducing the complexity of the estimation parameters is not enough. Additionally, we discover that warps that produce sharp IWEs at any reference time
Multi-reference focus loss. Assume an edge moves from left to right. Flow estimation with single reference time (
Letting
\begin{equation*}
f(\boldsymbol{\theta }) \doteq \bigl (G(\boldsymbol{\theta }; t_{1}) + 2G(\boldsymbol{\theta }; t_{\text{ {mid}}}) + G(\boldsymbol{\theta }; t_{N_{e}})\bigr ) \,/\, 4 G(\mathbf {0}; -), \tag{5}
\end{equation*}
Remark: Warping to two reference times (min and max) was proposed in [27], but with important differences: (i) it was done for the average timestamp loss, hence it did not consider the effect on contrast or focus functions [8], and (ii) it had a completely different motivation: to lessen a back-propagation scaling problem, so that the gradients of the loss would not favor events far from
1) Objective Functions Based on the IWE Gradient
Among the contrast functions proposed in [7], [8], we use two functions based on the gradient of the IWE
\begin{equation*}
G(\boldsymbol{\theta }; t_\text{ref}) \doteq \frac{1}{|\Omega |} \int _{\Omega } \Vert \nabla I(\mathbf {x}; t_\text{ref})\Vert ^{q}\,d\mathbf {x}, \tag{6}
\end{equation*}
C. Time-Aware Flow
State-of-the-art event-based optical flow approaches are based on frame-based ones, and so they use the warp (4), which defines the flow
Time-aware Flow. Traditional flow (4), inherited from the frame-based one, assumes per-pixel constant flow
To define a space-time flow
\begin{equation*}
\frac{\partial \mathbf {v}}{\partial \mathbf {x}} \frac{d\mathbf {x}}{dt} + \frac{\partial \mathbf {v}}{\partial t} = \mathbf {0}, \tag{7}
\end{equation*}
\begin{equation*}
\mathbf {x}^{\prime }_{k} = \mathbf {x}_{k} + (t_{k}-t_\text{ref}) \, \hat{\mathbf {v}}(\mathbf {x}_{k},t_{k}). \tag{8}
\end{equation*}
D. Multi-Scale Flow Parameterization
Inspired by classical estimation methods, we combine our tile-based approach with a multi-scale strategy. The goal is to improve the convergence of the optimizer in terms of speed and robustness (i.e., avoiding local optima).
Some learning-based works [27], [28], [34] also have a multi-scale component, inherited from the use of a U-Net architecture. However, they work on discretized event representations (voxel grid, etc.) to be compatible with DNNs. In contrast, our tile-based approach works directly on raw events, without discarding or quantizing the temporal information in the event stream.
Our multi-scale CM approach is illustrated in Fig. 5. For an event set
E. Composite Objective Function
To encourage additional smoothness of the flow, even in regions with few events, we include a flow regularizer
\begin{equation*}
\boldsymbol{\theta }^{\ast } = \arg \min _{\boldsymbol{\theta }} \left(\frac{1}{f(\boldsymbol{\theta })} + \lambda \mathcal {R}(\boldsymbol{\theta })\right), \tag{9}
\end{equation*}
F. Depth and Ego-Motion Estimation
1) Monocular
For a still scene but with a moving camera, the motion induced on the image plane has fewer DOFs than the most general case considered so far. In this scenario, it is beneficial to parameterize the optical flow in terms of the scene depth
\begin{equation*}
\mathbf {v}(\mathbf {x}) = \frac{1}{Z(\mathbf {x})}A(\mathbf {x})\mathbf {V}+ B(\mathbf {x})\boldsymbol{\omega }, \tag{10}
\end{equation*}
Similarly to Section III-D, we parameterize the depth
Note that the motion field parameterization (10) is not supposed to handle independently moving objects (IMOs), although it is effective in many event-based optical flow benchmarks (e.g., Sections IV-B1 and IV-B2). We discuss the validity and the limitations of optical flow benchmarking in Section IV-B5, as well as the comprehensive results in Section IV-D.
2) Stereo
The proposed method can be extended to stereo configurations. Parameterizing scene depth and ego-motion on the left camera and using the extrinsic parameters of the stereo setup, we can compute the depth and the motion on the right camera (e.g., by warping the left depth map onto the right camera using the nearest neighbor interpolation). Having depth and ego-motion on each camera, we define the objective function as the sum
\begin{equation*}
\boldsymbol{\theta }^{\ast } = \arg \min _{\boldsymbol{\theta }} \left(\frac{1}{f_{\text{l}}(\boldsymbol{\theta })} + \lambda \mathcal {R}_{\text{l}}(\boldsymbol{\theta }) + \frac{1}{f_{\text{r}}(\boldsymbol{\theta })} + \lambda \mathcal {R}_{\text{r}}(\boldsymbol{\theta }) \right), \tag{11}
\end{equation*}
In prior works of stereo depth estimation [68], one of the main challenges is how to find correspondences between event streams from multiple cameras. This is a non-trivial problem and is prone to event noise. The proposed method bypasses the event-to-event correspondence problem by parameterizing the depth densely on the whole image plane of one camera and transferring it to the other camera.
Summarizing Remark: All the proposals in Section III are formulated in the form of an optimization problem, and they are theoretically extensible to learning-based approaches (DNNs), since they are fully differentiable. We will show an example of the learning-based flow estimation in Section IV-B3. Hence, our work provides model-based approaches that can act as baselines for the development of learning-based methods in the context of event-based optical flow, monocular depth, ego-motion, and stereo depth estimation problems.
Experiments
We assess the performance of our method on seven datasets, which are presented in Section IV-A. We provide a comprehensive evaluation of optical flow estimation in Section IV-B. Additionally, we demonstrate the learning-based extension (DNN) (Section IV-B3), discuss current optical flow benchmarks (Section IV-B5), and show downstream applications (Section IV-C). The results of depth and ego-motion estimation are presented in Section IV-D (monocular) and Section IV-E (stereo).
A. Datasets, Metrics and Hyper-Parameters
The proposed method works robustly on data comprising different camera motions, scenes, and spatial resolutions. We conduct experiments on the following seven datasets.
Datasets: First, we evaluate our method on sequences from the MVSEC dataset [4], [34], which is the de facto standard dataset used by prior works to benchmark optical flow. The dataset contains sequences recorded indoors with a drone, and outdoors with a car. It provides events, grayscale frames, IMU data, camera poses, and scene depth from a LiDAR [4]. The dataset was extended in [34] to provide ground truth (GT) optical flow, computed as the motion field [6] given the camera velocity and the depth of the scene. Notice that the indoor sequences do not have IMOs, and the outdoor sequences do not include scenes with IMOs in the benchmark evaluation. The event camera has
We also evaluate on a recent dataset that provides ground truth flow: DSEC [44]. It consists of sequences recorded with Prophesee Gen3 event cameras (stereo), of higher resolution (
Additionally, we carry out experiments on two HD resolution event camera datasets, TUM-VIE [32] and M3ED [33], recorded with stereo Prophesee Gen4 event cameras (
The ECD dataset [63] is a lower resolution, standard dataset to assess camera ego-motion [9], [16], [25], [69], [70], [71], [72]. Each sequence provides events, frames, calibration information, and IMU data from a DAVIS240 C camera (
Finally, we also test sequences from two motion segmentation datasets [20], [21]. The sequences in EMSMC [20] are recorded using a hand-held DAVIS240 C camera (
Evaluation Metrics: The metrics used to assess optical flow accuracy are the average endpoint error (AEE), the angular error (AE), and the percentage of pixels with
For depth accuracy evaluation, we use standard metrics following previous work on monocular depth estimation [57], [74]. The depth error metrics are SiLog, Absolute Relative Difference (denoted by “AbsRelDiff”), and the logarithmic RMSE (“logRMSE”). While SiLog is scale-invariant, we substitute the prediction using the mean of the GT for AbsRelDiff and logRMSE. We furthermore report depth accuracy metrics that compute the percentage of pixels whose relative depth with respect to GT is smaller than a threshold. We use three common thresholds:
Hyper-parameters: For flow estimation our method uses
The number of events was selected guided by the benchmarks and/or experimentally, based on the variables that affect the event generation (camera's spatial resolution, scene texture, motion, etc.) and the CM method (edges should displace enough, e.g., three pixels, see [18]). The estimated flow is scaled and aligned with the benchmark timestamps, if necessary (e.g., MVSEC). There is a trade-off: too few events, then CM does not work (scarce data and there is not enough displacement to have a good objective function landscape); too many events, and the method may not produce a good fit if the constant optical flow assumption does not hold during the time span of the events.
Since the motion-field parameterization reduces the complexity of the problem, we successfully use finer scales
B. Optical Flow Estimation
1) Results on the MVSEC Benchmark
We first report the results on the MVSEC benchmark (Table I). The different methods (rows) are compared on one outdoor and three indoor sequences (columns). This is because many learning-based methods train on the other outdoor sequence, which is therefore not used for testing. Following Zhu et al. outdoor_day1 is tested only on specified 800 frames [34]. The top part of Table I reports the flow corresponding to a time interval of
The table is comprehensive, showing where the proposed methods stand compared to prior work. Our methods provide the best results among all methods in all indoor sequences and are the best among the unsupervised and model-based methods in the outdoor sequence. The errors for
Among different variations of the proposed methods, we observe that (i) the motion field parameterization achieves better accuracy than the direct parameterization of the flow in indoor sequences, (ii) there are no significant differences between the three versions of the flow warp models, and (iii) the
Qualitative results are shown in Fig. 6, where we compare our method against the state-of-the-art learning-based methods. Our method provides sharper IWEs than the baselines, without overfitting, and the estimated flow resembles the GT. We display flow masked by the events, for consistency with the benchmark. Recall that our method interpolates the flow at pixels with zero events. The USL result [30] is obtained using its official implementation, comprising a recurrent model that sequentially processes sub-partitions of event data. Notice that we use the event mask of the full timestamps (
MVSEC results (
Ground truth is not available on the entire image plane (see Fig. 6), such as in pixels not covered by the LiDAR's range, FOV, or spatial sampling. Additionally, there may be interpolation issues in the GT, since the LiDAR works at 20 Hz and the GT flow is given at frame rate (45 Hz). In the outdoor sequences, the GT from the LiDAR and the camera motion cannot provide correct flow for IMOs. These issues of the GT are noticeable in the IWEs: they are not as sharp as expected. In contrast, the IWEs produced by our method are sharp.
2) Results on the DSEC Benchmark
Table II gives quantitative results on the DSEC Optical Flow benchmark. No GT flow is available for these test sequences. The proposed methods are compared with an unsupervised-learning method [59] (Section II) and a supervised-learning method E-RAFT [44]. E-RAFT is an ANN that extracts features in event correlation volumes via an iterative update scheme instead of using a U-Net architecture. This version of RAFT [48] was introduced along with the DSEC flow benchmark and showed it can estimate pixel correspondences for large displacements. As expected, E-RAFT is better than ours in terms of flow accuracy because (i) it has additional training information (GT labels), and (ii) it is trained using the same type of GT signal used in the evaluation. Nevertheless, our method provides sensible results and is better in terms of FWL, which exposes similar GT quality issues as those of MVSEC: many pixels have no GT (LiDAR's FOV and IMOs). This is also confirmed in the qualitative results (Fig. 7). Our method provides sharp IWEs, even for IMOs (car) and the road close to the camera. We further discuss the issue of IMOs in the flow benchmarks in Section IV-B5.
DSEC results on the interlaken_00b sequence (no GT available). Since GT is missing at IMOs and points outside the LiDAR's FOV, the supervised method [44] may provide inaccurate predictions around IMOs and road points close to the camera, whereas our method produces sharp edges. For visualization, we use 1 M events.
Remarkably, the proposed methods achieve competitive results in terms of flow accuracy with the unsupervised-learning method [59]. Among different variations, the “Flow (
Notice that both DNN methods [44], [59] train and evaluate on the DSEC dataset, which is dominantly forward driving motion. As a result, these learning-based methods may overfit to the driving data (i.e., tend to predict forward motion) and fail to produce good results in other motions and datasets [55] (e.g., see E-RAFT rows on the MVSEC indoor seqs. in Table I). On the contrary, the proposed methods rely on the principle of event alignment and generalize to various datasets, producing consistently good results.
Similarly to the MVSEC results, the
We observe that the evaluation intervals (100 ms) are large for optical flow standards. In the benchmark, 80% of the GT flow has up to 22px displacement, which means that 20% of the GT flow is larger than 22px (on VGA resolution). The apparent motion during such intervals is sufficiently large that it breaks the classical assumption of scene points flowing in linear trajectories (more details in Section IV-B5).
3) Application to Deep Neural Networks (DNN)
The proposed secrets are not only applicable to model-based methods, but also to unsupervised-learning methods. To this end, we train EV-FlowNet [34] in an unsupervised manner on the MVSEC dataset, using (9) as data-fidelity term and a Charbonnier loss [67] as the regularizer. We convert 40 k events into the voxel-grid representation [27] with 5 time bins. The network is trained for 50 epochs with a learning rate of 0.001 and its decay of 0.8 with Adam optimizer [80]. To ensure generalization, we train our network on indoor sequences and test on the outdoor_day1 sequence. Since the time-aware flow does not have a significant influence on the MVSEC benchmark (Table I), we do not port it to the learning-based setting.
Table III shows the quantitative comparison with unsupervised learning methods. Our model achieves the second best accuracy, following [27], and the best sharpness (FWL) among the existing methods. Notice that [27] was trained on the outdoor_day2 sequence, which is a similar driving sequence to the test one, while the other methods were trained on drone data [81]. Hence [27] might be overfitting to the driving data, while ours is not, by the choice of training data. The qualitative results of our unsupervised learning setting are shown in Fig. 8. We compare our method with the state-of-the-art unsupervised learning [30]. Our results resemble the GT flow.
Additionally, we train the architecture in [59] on DSEC data using the
4) Results on 1 Mpixel Datasets: TUM-VIE and M3ED
The proposed method generalizes to recent high spatial resolution event cameras. We show qualitative results on the TUM-VIE dataset [32] and the M3ED dataset [33] in Fig. 9. The flow looks realistic and produces sharp IWEs for various motions (forward motion, rotation, translation) and scenes (indoor and outdoor). Also, the flow estimation is stable regardless of the absolute scene intensity, while frames suffer from a limited dynamic range. Hence, we leverage the HDR advantages of event cameras.
5) Discussion on Optical Flow Benchmarks and “GT” Flow
Throughout the quantitative evaluation of the event-based optical flow (Sections IV-B1 and IV-B2), we observe some limitations for the current benchmarks: (i) size of the evaluation interval and (ii) independently moving objects (IMOs).
Evaluation intervals and the linearity of optical flow: The time-aware flow is designed to consider the space-time nature of events. Recently, there have also been other proposals aiming to leverage such nature for per-pixel motion estimation. The main difference between our flow (Section III-C) and concurrent proposals [53], [59], [82] is the motion hypothesis and its underlying assumptions: (7) assumes that the flow is constant along its streamlines within short time intervals, which produces linear motion trajectories (Fig. 4). The number DOFs of the motion is
On the other hand, [53], [59] propose non-linear trajectories (e.g., Bézier curves) for the “optical flow”. We suspect that the choice of assuming non-linear trajectories stems from the necessity of reporting good figures on the DSEC benchmark (Table II), which has relatively long evaluation intervals. While it is called an “optical flow” benchmark, the ground truth on time intervals of 100 ms at moderate vehicle speeds can result in curved trajectories. The increased complexity of the non-linear trajectory estimation problem has several challenges to be addressed: (i) accuracy is difficult to evaluate with existing benchmarks, which are based on the standard definition of flow, (ii) there is a trade-off between the increased complexity of possible motions and the tendency to overfit, (iii) it is important to assess the efficacy of the curved trajectory in terms of downstream applications. We show various applications of the linear trajectory in Sections IV-C, IV-D, and IV-E; for curved trajectories, beyond focusing on beating the current benchmark, it would be interesting to show new applications. Finally, it is worth reconsidering the terminology of the estimation task, such as “instantaneous” (short-baseline) optical flow, versus “non-instantaneous” (i.e., large-baseline) curved trajectory estimation.
IMOs: The de facto standard flow benchmarks MVSEC and DSEC ignore pixels corresponding to IMOs (because it is difficult to obtain GT labels for IMOs in the real-world). However, optical flow can describe such motions. Indeed, as Table I shows, the motion-field–parameterized flow achieves better accuracy in still scenes. Training ANNs using only flow from rigid scenes may affect their learning capabilities. To avoid potential pitfalls of optical flow algorithms, it is therefore important that the data used for (training and) evaluation contains IMOs and a variety of ego-motions.
C. Applications of Optical Flow
This section demonstrates three exemplary applications of the estimated optical flow: motion segmentation, intensity reconstruction, and denoising.
1) Motion Segmentation
Motion segmentation is the task of splitting a scene into objects moving with different velocities. Thus, it is natural to address it by clustering optical flow [20]. To this end, we show results on three sequences from [20], [21] in Fig. 10 using k-means with 2 to 3 clusters. In the corridor scene (first row of Fig. 10) there are 3 clusters: two people are walking in opposite directions while the camera is moving (background). In the second example, the scene includes cars with horizontal motion while the camera tilts. The third example (car) has a car moving at a different speed in the same direction as the background, which is the most challenging case among these examples. In all examples, our method successfully provides sensible segmentation masks (last column of Fig. 10) corresponding to the scene objects.
Fig. 11 provides detailed analyses of the clustering operation for the corridor and car examples. Since the proposed method uses a tile-based parameterization of the flow, the interpolation between tiles produces flow vectors that fill in the regions between the distinctive cluster centroids. One could use other clustering algorithms, such as DBSCAN [83], to treat such interpolation effects as outliers.
2) Image Reconstruction
Events encode the apparent motion of scene edges (e.g., optical flow) as well as their brightness. These two quantities are entangled, and it is possible to use computed optical flow to recover brightness, i.e., reconstruct intensity images [24]. We demonstrate it on a 1 Mpixel dataset in Fig. 12. The estimated flow provides sharp IWEs, which successfully aids reconstruct intensities such as the checkerboard on the wall, the light and its reflection on the corridor, and the complex structure of the stairs. The results are remarkable despite the noise in the corridor scene (see Section IV-C3). Due to the regularizer in [24], the very fine structure (e.g., the poster contents) might not be crisp.
3) Denoising Event Data
By extending the idea of [84], which classifies events for temporal upsampling into signal or noise based on a predicted 2-DOF motion, we use the estimated optical flow to identify noise events as those where the IWE is smaller than some value (e.g., 3 events). Fig. 13 shows qualitative results. The corridor scene has a large amount of noise due to lighting (i.e., flickering events). The denoised event data looks clearer, while it retains the edge structure of the scene.
Denoising. The data is the skate-easy sequence from the TUM-VIE dataset. The top row is the image representation of the events, while the bottom row shows them in space-time coordinates (for better visualization, only the bottom-right quarter of the image plane is displayed).
D. Monocular Depth and Ego-Motion Estimation
1) Results on MVSEC
Evaluation on Depth: Table IV summarizes the quantitative results of depth estimation on the MVSEC dataset [34]. Following the convention [57], we report the metrics for indoor as the average of the three indoor sequences. Although prior works use different strategies, such as additional sensor information, different train-test split, and different evaluations, we provide exhaustive comparisons across the existing methods to date: a model-based method where the pose information is given (EMVS) [23], a supervised-learning method [61] trained on real data (outdoor_day2, denoted “SL (R)”) or in simulation (“SL (S)”), and two unsupervised-learning methods [27], [57].
The proposed methods achieve overall better accuracy on the indoor sequences and competitive results on the outdoor sequence compared with ECN [57], the closest work to ours. However, ECN uses the 80/20 train-test split within each sequence (i.e., the training data consists of the same sequences as the test data), hence it might suffer from data leakage. For the outdoor sequence, our methods provide better results than the real-world supervised-learning method (“SL (R)”), and competitive results with the other learning-based approaches. We find that outdoor sequences are in general more challenging for the proposed approach. This can be attributed to the facts that (i) the MVSEC outdoor data has considerably sparse events, which affects the convergence of the method, and (ii) events in a scene comprise various displacements with uneven distribution on the image plane. Indeed, the
Qualitative results are shown in Fig. 14. For completeness, we show the flow (i.e., motion field) computed from the estimated depth and ego-motion. The estimated depth resembles the GT for both sequences, resulting in sharp IWEs. Moreover, similarly to the flow estimation (Section IV-B1), the proposed depth covers the pixels where the GT does not exist, such as the middle board in the indoor scene and poles in the outdoor scene. Also, the estimated depth looks reasonable where LiDAR may fail to produce reliable depth maps due to the differences in the sampling frequency (e.g., the left-most board in the indoor results). Overall, the results illustrate that the proposed method is effective in estimating depth for these standard, real-world sequences.
Depth estimation results on indoor_flying3 and outdoor_day1 sequences of the MVSEC dataset [34]. The 2nd and 3 rd columns show the estimation and GT, respectively.
Ego-Motion Estimation: Fig. 15 shows ego-motion estimation results on the indoor_flying1 sequence. The estimated linear velocity is scaled using the GT (IMU). The linear velocities resemble the GT, indicating that our method successfully estimates the camera motion of the freely-moving (6-DOF) drone. The pitch/yaw angular velocities are challenging to estimate since the motion field due to the pitch/yaw rotations is similar to that of a translation.
Ego-motion estimation results on the indoor_flying1 sequence from the MVSEC dataset [34].
Quantitative results are reported in Table V. Linear velocity errors are sensible: 20–30 cm/s for indoor (drone) sequences and 5.9 m/s for the outdoor (car) sequence. Forward-moving motion is more challenging for depth estimation, as the scene contains less parallax than lateral translational motions, which is also confirmed by our results. Angular velocity errors are small in all sequences, as they do not contain rotational-dominant motions. Few prior works report numerical values for comparison. As discussed in (Section IV-D1 and Table IV), ECN [57] might have overfit to this outdoor sequence that reports a very small error (0.7m/s). On the other hand, our results provide constantly reasonable/similar metrics for all sequences. We hope Table V will encourage more works to benchmark monocular ego-motion estimation on these datasets.
2) Results on ECD
Depth and ego-motion estimation results on the slider_depth sequence from the ECD dataset [63] are shown on Fig. 16. Our method produces a sharp IWE as well as reasonable depth map, flow and poses, handling complex objects with occlusion and at different distances. The camera pose RMS errors are: 0.11 m/s (in
Depth and Ego-motion estimation for the slider_depth sequence (real data) from the ECD dataset [63]. RMS errors: 0.11 m/s (in
Fig. 17 shows the results on a synthetic sequence from [63]. Since it has ground truth poses and depth, we also report these evaluation metrics, as SiLog[x100]: 1.16, AbsRelDiff: 0.09, logRMSE: 0.11, A1: 0.98, A2: 1.0 and A3: 1.0 for depth, and RMS: 0.30 m/s (in
3) Results on 1 Mpixel Datasets. TUM-VIE and M3ED
Fig. 18 shows the qualitative depth estimation results on the TUM-VIE and M3ED datasets [32], [33]. The estimated depth is realistic, even for the challenging corridor sequence, which contains a large amount of noise and large variations of contour displacement in the scene due to the forward motion. The resulting flows are reasonable and the IWEs are sharp. Since the datasets do not have GT depth, we cannot conduct the quantitative evaluation.
E. Stereo Depth Estimation
As explained in Section III-F2, our method can also tackle the event-based stereo scenario. Fig. 19 shows stereo depth estimation results on the DSEC and MVSEC datasets. By parameterizing the depth and ego-motion on one camera only, the proposed model-based method successfully converges and provides sharp IWEs for both event cameras. We observe that, while IMOs are not explicitly modeled, depth estimation becomes more robust against them in the stereo setting. We leave a detailed analysis, evaluation, and benchmarks for future work.
Ablation and Sensitivity Analysis
A. Effect of the Multi-Reference Focus Loss
The effect of the proposed multi-reference focus loss is shown in Fig. 20. The single-reference focus loss function can easily overfit to the only reference time, pushing all events into a small region of the image at
Effect of the multi-reference focus loss. Top row: single reference (
B. Effect of the Time-Aware Flow
To assess the effect of the proposed time-aware warp (8), we conducted experiments on MVSEC, DSEC and ECD [63] datasets. Accuracy results are already reported in Tables I and II. We now report values of the FWL metric in Table VI. For MVSEC,
Effect of the time-aware flow. Comparison between three flow models: Burgers’, upwind, and no time-aware (4). At occlusions (dartboard in slider_depth [63] and garage door in DSEC [5]), upwind and Burgers’ produce sharper IWEs. Due to the smoothness of the flow conferred by the tile-based approach, some small regions are still blurry.
C. Effect of the Multi-Scale Approach
The effect of the proposed multi-scale approach (Fig. 5) is shown in Fig. 22. This experiment compares the results of using multi-scale approaches (in a coarse-to-fine fashion) versus using a single (finest) scale. With a single scale, the optimizer gets stuck in a local extremal, yielding an irregular flow field (see the optical flow rows), which may produce a blurry IWE (e.g., outdoor_day1 scene). With three scales (finest tile and two downsampled ones), the flow becomes less irregular than with one single scale, but there may be regions with few events where the flow is difficult to estimate. With five scales the flow becomes smoother, more coherent over the whole image domain, while still being able to produce sharp IWEs.
Effect of the multi-scale approach. For each sequence, the top row shows the estimated flow and the bottom row shows the IWEs.
D. The Choice of Loss Function
Table VII shows the results on the MVSEC benchmark for different loss functions. We compare the gradient-based functions (
Remark: Maximization of (5) does not suffer from the problem mentioned in [30] that affects the average timestamp loss function, namely that the optimal flow warps all events outside the image so as to minimize the loss (undesired global optima shown in Fig. 23(c)–(d)). If most events were warped outside of the image, then (5) would be smaller than the identity warp, which contradicts maximization.
E. The Regularizer Weight
Table VIII shows the sensitivity analysis on the regularizer weight
Computational Performance
Each scale of our method has the same computational complexity as CM [7],
Limitations
Like previous unsupervised works [27], [30], our method is based on the brightness constancy assumption. Hence, it struggles to estimate flow from events that are not due to motion, such as those caused by flickering lights. SL and SSL methods may forego this assumption, but they require high quality supervisory signal, which is challenging due to the HDR and high speed of event cameras.
Like other optical flow methods, our approach may suffer from the aperture problem. The flow could still cause event collapse if tiles become too small (higher DOFs), or if the regularization is too small compared with the texture density that drives the data-fidelity term. This effect can be observed in Fig. 1, where the flow becomes irregular for the tree leaves (in the example on row 2). Optical flow is also difficult to estimate in regions with few events, such as homogeneous brightness regions and regions with small apparent motion. Regularization fills in the homogeneous regions, whereas recurrent connections could help with small apparent motion.
The monocular depth and ego-motion estimation approach considers each event packet (i.e., time interval) independently, hence it only recovers camera velocities. Absolute poses could be estimated if the camera velocities were simultaneously recovered over multiple event packets while sharing a common depth map. The stereo approach enables the recovery of the absolute scale.
While the computational effort of the proposed approach is high in our current (unoptimized) implementation, it allowed us to focus on modeling the problem and uncovering the “secrets” of event-based optical flow, i.e., identifying the successful ingredients for accurate motion estimation. Then, we showed how such knowledge could be transferred to learning-based settings, with the same computational cost and speed as prior work (ms inference time on GPUs).
Conclusion
We have extended the CM framework to estimate dense optical flow, depth and ego-motion from events alone. The proposed principled method overcomes problems of overfitting, occlusions, and convergence by sensibly modeling the space-time nature of event data. The comprehensive experiments show that our method achieves the best flow accuracy among all methods in the MVSEC indoor benchmark, and among the unsupervised and model-based methods in the outdoor sequence. It also provides competitive results in the DSEC optical flow benchmark and generalizes to various datasets, including the latest 1 Mpixel ones, delivering the sharpest IWEs. The method exposes the limitations of the current flow benchmarks and produces remarkable results when it is transferred to unsupervised learning settings. We show downstream applications of the estimated flow, such as motion segmentation, intensity reconstruction and event denoising. Finally, the method achieves competitive results in depth and ego-motion estimation in both monocular and stereo settings. As demonstrated, the proposed framework is able to handle a broad set of motion-related tasks across multiple datasets and event camera resolutions, hence we believe it is a cornerstone in event-based vision. We hope our work inspires future model-based and learning-based approaches in these motion-related problems.