Formulating Event-Based Image Reconstruction as a Linear Inverse Problem With Deep Regularization Using Optical Flow

Event cameras are novel bio-inspired sensors that measure per-pixel brightness differences asynchronously. Recovering brightness from events is appealing since the reconstructed images inherit the high dynamic range (HDR) and high-speed properties of events; hence they can be used in many robotic vision applications and to generate slow-motion HDR videos. However, state-of-the-art methods tackle this problem by training an event-to-image Recurrent Neural Network (RNN), which lacks explainability and is difficult to tune. In this work we show, for the first time, how tackling the combined problem of motion and brightness estimation leads us to formulate event-based image reconstruction as a linear inverse problem that can be solved without training an image reconstruction RNN. Instead, classical and learning-based regularizers are used to solve the problem and remove artifacts from the reconstructed images. The experiments show that the proposed approach generates images with visual quality on par with state-of-the-art methods despite only using data from a short time interval. State-of-the-art results are achieved using an image denoising Convolutional Neural Network (CNN) as the regularization function. The proposed regularized formulation and solvers have a unifying character because they can be applied also to reconstruct brightness from the second derivative. Additionally, the formulation is attractive because it can be naturally combined with super-resolution, motion-segmentation and color demosaicing. Code is available at https://github.com/tub-rip/event_based_image_rec_inverse_problem


INTRODUCTION
Event cameras are novel bio-inspired sensors that offer advantages over traditional frame-based cameras (high speed, high dynamic range (HDR), low power, etc.) [1], [2].However they acquire visual data in the form of asynchronous per-pixel brightness changes, called "events", instead of standard brightness images.We tackle the problem of image reconstruction from event-based data, i.e., recovering the brightness signal that caused the events.It is an interesting problem in itself and finds multiple applications: (i) it enables the creation of high-speed and/or HDR videos, (ii) if the reconstructed images are of sufficient quality they can be used as input to a large body of algorithms developed for traditional cameras.Hence, image reconstruction makes event data compatible with mainstream computer vision.Reconstructed images have applications from autonomous driving to smartphone applications for everyday use [3].
Image reconstruction is challenging because events are an unfamiliar representation of visual data, they depend on motion and exhibit a considerable amount of noise and non-ideal effects (caused by pixel fabrication mismatch, the amount of incident light, sub-threshold transistor operation, etc.).Recently, notable progress has been made.Early meth-

Optical flow
Events to voxel grid Fig. 1: Overview of the proposed method (bottom) in comparison with state-of-the-art event-to-image recurrent neural networks (RNNs) (top).End-to-end image reconstruction methods can be replaced by an explainable system that recovers both optical flow and image brightness.We show that, given optical flow, image reconstruction is a linear problem in the unknown brightness, and exploit this knowledge to estimate brightness by means of several classical and recent learning-based solvers.
ods performed image reconstruction as a means to aid other tasks, such as ego-motion estimation [4].They were modelbased, proposing a leaky integrator or filter that would perform some form of spatio-temporal smoothing and denoising [5], [6].Nowadays, due to the outstanding capacity of artificial neural networks (ANNs) for pixel regression, state-of-the-art methods are deep-learning based [7], [8].
They leverage large amounts of data in a supervised or unsupervised manner (e.g., via generative adversarial networks (GANs)) to recover surprisingly high-quality image brightness from a voxelized representation of the event data, at a significantly higher computational cost compared to model-based methods (Fig. 1, top).However, these approaches suffer from several problems: (i) the resulting event-to-image ANNs are black boxes, relying on the networks to learn how to recover images from events; (ii) the ANNs suffer from artifacts and there is no easy way to tune them except for training with more data variation; (iii) the approaches focus only on recovering image brightness, but motion is as equally important to recover and both variables are firmly interconnected in the event data; (iv) events are sparse but they are converted into a voxelized representation for compatibility with ANNs, which causes a filling-in effect that requires considerable memory while most of the voxels may be empty; (v) some methods use a first-order event generation model, which inevitably introduces linearization errors.
Our work provides a new point of view on image reconstruction from event data.If one considers brightness and motion (i.e., optical flow) as equally important entangled variables in the event stream, the goal should be to estimate both (not just one of them) and to leverage existing asymmetries in the process.Specifically, show that, given optical flow, brightness estimation is a linear problem.Therefore, (i) the difficult subproblem of the two is that of estimating accurate optical flow 1 , (ii) easier and more interpretable (explainable) methods than current ones shall be used for image reconstruction (Fig. 1, bottom).In particular, we leverage the fact that linear problems are extensively studied in science, and the corresponding solvers keep improving, benefiting from a growing body of work that exploits large image datasets.This naturally leads to lower dimensional, efficient intermediate representations.Additionally, the proposed image reconstruction method does not need ground truth event, image data pairs, which are required in stateof-the-art supervised learning approaches.
We evaluate the method on standard datasets and the experiments demonstrate that image reconstruction with the proposed technique (linear solver plus deep image priors) produces compelling results, on par with the state of the art in terms of quality, but without using any ground truth image and with notably better interpretability and flexibility.
In summary, our main contributions are: • A novel formulation of the problem of image reconstruction from events, aided by optical flow, in the form of a linear system of equations directly in the unknown brightness image (Sec.3.2).• We propose a variety of solvers and regularizers, including recent deep denoisers, to solve the above problem, using efficient input event representations and without using ground truth labels (Sec.3.4).• Explainability: Our method combines physics (the event generation model used to derive the system of equations) and machine learning (for artifact removal through image denoising) (Secs.3.2 to 3.4).
1.This admits analogies in other domains of computer vision, such as camera self-calibration [9]: given the plane at infinity, recovering the intrinsic parameters of the camera is also a linear problem.Hence, finding the plane at infinity is the difficult (sub)task.• We also show how our method can be combined with motion segmentation (e.g., to recover images of independently moving objects in the scene) and with color event cameras (to reconstruct color images), (Sec.3.7).• A thorough evaluation on standard datasets and comparing with three state-of-the-art event-to-image RNN methods (Sec.4).

IWE
The proposed approach is appealing not only because of the novel point of view (combined estimation of entangled variables while exploiting asymmetries) but also because: (i) it opens a new route to reconstruct brightness based on the first directional derivative (Fig. 2), (ii) its techniques can also be applied to the approach of reconstruction based on second derivatives of the brightness, and (iii) it can be combined with a variety of topics (super-resolution, motion segmentation, demosaicing) in a natural way due to its linear inverse problem formulation.

RELATED WORK
Image brightness reconstruction from events.Event cameras such as the DVS [10], [11], [12], [13] are bio-inspired sensors that capture pixelwise brightness changes, called events, instead of brightness images.An event e k .= (x k , t k , p k ) is triggered as soon as the logarithmic brightness L at a pixel exceeds a preset contrast sensitivity C > 0, where x k .= (x k , y k ) , t k (with µs resolution) and polarity p k ∈ {+1, −1} are the spatio-temporal coordinates and sign of the brightness change, respectively, and ∆t k is the time elapsed since the last event at the same pixel x k .
Early works.Cook et al. [15] defined a network where messages with local update rules were passed to estimate several visual quantities simultaneously.Image brightness was computed based on its relation to the spatial gradient, which is connected to other network quantities.Kim et al. [4], [16] used parallel Bayesian filters to simultaneously estimate camera motion and image gradients.Image brightness was recovered via Poisson integration and passed to the camera tracker.In [18] image brightness was obtained as the supporting variable of a costly, space-time variational functional optimization problem whose primary goal was optical flow estimation.The function comprised data-fidelity and regularization terms for the flow and the brightness, and the solver proceeded in an alternating fashion via a parallelized primal-dual algorithm [24].Barua et al. [22] used dictionary learning on space-time patches, learning atoms via K-SVD [25].Events were used to estimate the spatial brightness gradient via patch-wise sparse coding, and subsequently the Laplacian was computed and Poissonintegrated.
Many of the above methods require knowledge of the camera motion (pure rotation [4], [15]), and/or depth [16], [17].Only recently this requirement has been relaxed [5], [6], [18], [22], achieving brightness reconstruction via temporal filtering [6] or temporal integration with manifold denoising [5].These model-based approaches suffer from artifacts such as ghosting effects and bleeding edges.Compared to [18], our method assumes optical flow is obtained first, for example via contrast maximization [26], [27] or E-RAFT [28], and then it recovers brightness.The goal is to highlight and exploit the asymmetries in the combined problem of recovering flow and brightness, which are entangled variables in the event stream.
Deep learning event-to-image methods.State-of-theart methods outperform the above prior methods and operate by letting ANNs learn the mapping from events to brightness images.Events are typically converted into a dense voxelgrid representation and fed to modern ANNs.High quality results are obtained using supervised learning on recurrent neural networks (RNNs) trained on large amounts of synthetic data [7], [19], [20].A jump in image reconstruction quality was demonstrated by E2VID [7] and its improved trained version [20].Unsupervised generative adversarial network (GAN) [8], [21], [29] and domain adaptation [30] approaches have been also investigated, but results are not as good.Recently, a self-supervised learning approach based on a linearized version of the event generation model (1) has been proposed in [31] to try to leverage large amounts of unlabeled event data for image reconstruction.It consists of two ANNs, one for optical flow estimation [32] and another one for image reconstruction, and the latter is trained using brightness constancy as a loss.
The above event-to-image deep learning methods are to a large extent black boxes, with a lot of effort spent not only on loss design and architecture search but also on dataset preparation [20], [29], [33] to train the ANNs and make them learn the desired transformation.Different from these methods, our approach does not train any event-to-image ANN.We pose the problem as image reconstruction aided by optical flow [31] and show that (1) leads to a linear system of equations.Several explainable methods are used to solve such a system.To achieve best results, we reutilize state-ofthe-art image denoising ANNs, without retraining.

Intuitive 1D Example
To convey the main idea of our method let us consider the simple case of 1D motion (horizontally, along the x axis).For further simplicity, consider that the scene is planar and front-to-parallel with respect to the camera.Hence, all pixels on the image plane move with the same velocity u.
Figure 3 summarizes the main idea and steps of the method.Consider the set of events E .= {e k } Ne k=1 in a volume of the image plane (Fig. 3a).Since events are temporal brightness changes, pixel-wise accumulation of the event polarities gives the brightness increment image in Fig. 3b.If the accumulation interval ∆t = [t 1 , t Ne ] is large, then this image is blurred.Instead, assuming the optical flow is known we may motion-compensate the events, E → E , by transporting them to new pixel locations (horizontally, in this case), where they will be aligned and produce a sharp Image of Warped Events (IWE) (Fig. 3c).This sharp image represents the strength of the moving edges in the scene [26] and is entangled with the motion u.The IWE resembles the x-derivative of the ground truth brightness frame.Such a spatial derivative is obtained from the brightness image using a finite difference operator D x .
In the simple case considered, we have: where ) is a 2-point finite difference approximation to the x-derivative over the pixel grid.More elaborate operators, like Sobel's, are applicable.The key ideas to notice are that each pixel of the IWE provides one equation (3) to recover the unknown brightness L, and that equations are linear in L. Stacking (3) for all pixels, produces a linear system of equations D x = vec(IWE) in the unknown .= vec(L).Hence, we reformulate the problem of image reconstruction as that of solving a linear system  The events E in (a) are motion-compensated using optical flow to create the image of warped events (IWE) (c).The sharp IWE approximates the x-derivative of the ground truth frame better than the uncompensated image (b).Our method reconstructs brightness image (d) from the IWE (c) by solving a linear system of equations with regularization (CNN-based in this example).Texture details smaller than the contrast sensitivity C in (1) cannot be recovered since they do not trigger events.The event data in the figure is from the slider far sequence in [14], consisting of rocks and tree-like textures.
of equations.In the example, it is the problem of finding L whose x-derivative is the IWE.Thus we may study how well-posed the problem is, and test the large collection of linear solvers (GMRES [34], BiCGStab [35], etc.) to estimate L. If there are measurement errors in the equations, we may reformulate the problem as a least-squares one: Figure 3d shows an example of a reconstructed brightness image L by solving the above system of equations with an additional regularization term.This is discussed next.

Generalization to arbitrary Motions and Scenes
In the general case of arbitrary scene and motion we still follow the reasoning above, but with minor modifications.Events E → E are warped (2) according to an image velocity field that may change at each pixel location, u(x).
The resulting IWE is still sharp [26], but to account for different optical flow magnitudes u(x) > 0 (e.g., objects at different depths trigger a different number of events), we divide by the magnitude of the motion.Warped events x k have floating point precision.The delta δ in ( 5) is an idealized model amenable for integration.In practice it is replaced by a kernel, such as a Gaussian δ(x) ≈ N (x; 0, 2 Id), where ≈ 1 pixel and Id is the identity matrix [26], [36].For speed-up, it is implemented using bilinear voting and Gaussian convolution [37].This produces the normalized IWE (NIWE): If u(x) = 0, no events are triggered at pixel location x and we set ∆L (x) = 0. Assuming brightness constancy, the NIWE corresponds to the edge strength of the scene, i.e., instantaneous rate of brightness change in the optical flow direction at pixel x: where ∇L = (∂ x L, ∂ y L) is the spatial gradient and û = u/ u is the unit vector in the direction of u.The 1D example in Sec.3.1 corresponds to the case û(x) = −(1, 0) , so that −∇L • û(x) = ∂ x L, and just like ∂ x L was discretized using a finite difference formula with a 1 pixel step over the pixel grid, we may also discretize ∇L • û(x), leading to: where L(x + û(x)) can be computed by interpolation (e.g., bilinear interpolation).Stacking all equations (8), with the NIWE b .= ∆L (x) computed from the input data (6), produces the anticipated linear system of equations where D is a compact notation for the (directional derivative approximation given by the) right hand side of (8).
Notation: we use D to represent the operator acting on L or on its stacked version .If L ∈ R m×n , ∈ R mn , and D in D is a sparse matrix of size mn × mn, which can be constructed given the pixel-wise optical flow û(x).

The Operator in the Linear System
One main challenge in solving the linear system ( 9) is that the operator D is not full rank.Hence, besides choosing a numerically good discretization for D, additional equations are needed to ensure that the problem is well posed.
Intuitively, in the 1D example of Sec.3.1 we look for the image L whose x-derivative agrees with the data (the NIWE), and we know that inverting a derivative may be illposed.If D is discretized using the 2-point finite difference formula, DL(x, y) ≈ L(x + 1, y) − L(x, y), each pixel of L is only strongly connected to one of its side neighbors, hence information in the linear solvers only propagates along the rows of L. This causes streamline-like artifacts in the reconstructed image (Fig. 4a).
A simple way to mitigate these artifacts is to choose a better numerical discretization of D that encourages crosstalking (i.e., coupling) among more pixel neighbors.Figure 4b shows the effect of replacing the 2-point finite difference formula with Sobel's 9-point formula, which couples each pixel to its 8 neighbors (matrix D is sparse and has at most 9 non-zero diagonals), thus encouraging smoothness of the solution in the direction perpendicular to the flow.
More advanced regularization techniques exist in the literature, such as Tikhonov regularization, total variation (TV), Beltrami, etc., as we discuss in the next section.
Remark: At border pixels, some optical flow vectors may point out of the image boundary.To tackle this issue, we set the optical flow and NIWE at border pixels to zero so that both sides of (8) automatically agree.This operation, however, does not improve the rank-deficiency of matrix D.

Probabilistic Formulation and Priors
If we assume the measurement vector b is corrupted by zero-mean additive white Gaussian noise n with variance σ 2 , the observation process may be written as The Maximum A Posteriori (MAP) estimate of the reconstructed brightness image is given by ˆ = arg max log p(b | ) + log p( ), where log p( ) represents the prior on .The estimation problem can be rewritten as where the first term is the data fidelity term, and the second term is the regularization term.λ ≥ 0 is introduced to control the degree of regularization on the reconstructed image.
The data term makes the estimation satisfy the observation process (i.e., the events), while the regularization term models structural information about the desired solution.
Essentially we formulate the problem in a form that opens the door for the application of many regularizers, leveraging years of development in computer vision.Moreover, we can benefit from learning-based regularizers (i.e., image priors).Priors.Generally, the regularizer in ( 10) can be classified into two categories: model-based priors and learningbased priors.In model-based regularization, the image is assumed to be smooth.Tikhonov regularization [38] penalizes the squared norm of the gradient of the solution, i.e., R( ) = ∇ 2 , where the symbol ∇ represents the spatial gradient of L. Total variation (TV) regularization [39] penalizes the norm, R( ) = ∇ .Alternatively, the Beltrami regularization [40], which interpolates between the TV and Tikhonov, can be used as regularizer.The results of some of these regularizers are shown in Fig. 4.
Among the learning-based priors, convolutional neural network (CNN) image denoisers are a popular choice in recent research.To use a CNN image denoiser in the process of solving (10), we adopt the half quadratic splitting method (HQS) proposed in [41], [42].To this end, we introduce an auxiliary variable z that allows us to split the problem into two decoupled ones, and we couple them by increasing a penalty term linking z to as the iterations proceed.We replace (10) with as µ → ∞.The problem is iteratively solved, alternating: where the data fidelity and regularization terms are decoupled into two separate optimization problems.Notably, (12a) admits a closed-form solution: From a Bayesian estimation point of view, (12b) is an image denoising problem with additive Gaussian noise (of standard deviation λ/µ).Theoretically, any Gaussian denoiser can be applied to (12b) to obtain a solution: Since CNN image denoisers are shown to have a superior performance than other image denoisers, such as BM3D [43], we adopt CNN image denoisers to solve (14).Hence, the algorithm is not strictly linear in its competitive form.The effect of the alternating solver is that (12a) enforces the data fidelity equations while the denoiser (12b) pulls the solution towards the space of natural-looking images.

Image Reconstruction and Super-resolution
The proposed image reconstruction method can be naturally extended to handle super-resolution.This is a consequence of the fact that the linear system (9) holds at any resolution.To achieve super-resolution we just need to provide the input NIWE b and the operator D at the desired resolution.This completely departs from end-to-end super-resolution image reconstruction approaches, which use supervised learning with four residual networks to realize a large encoder-decoder super-resolving network [44], or a 3-phase GAN with tailored and complex training datasets [45].The operator D can be computed by optical flow interpolation.The NIWE is computed by warping events (2), (5), which is done in floating point precision [37].With enough warped events it is possible to provide a suitable NIWE at moderate upscaling factors, e.g., 2×-4×.

Image Reconstruction from its Laplacian
An alternative approach to reconstruct brightness from events is based on Poisson reconstruction [4], [17], [22], [23].In the first step, events are used to estimate the Laplacian image, E → ∇ 2 L (Fig. 2), e.g., by means of an ANN [23].In the second step, the Laplacian image is fed to a Poisson PDE solver to recover image brightness.The key idea is that the first step is local whereas the second step is global.That is, events only need to affect nearby pixels to create the edgemap-like image ∇ 2 L, which can be achieved with an ANN (few layers, small receptive fields).The global step integrates the edgemap, filling in the brightness in regions with no events.
The relation between an image and its Laplacian can be written as a convolution ∇ 2 L(x) = κ(x) * L(x), where κ is the Laplacian kernel.In vectorized form, it is c = k ⊗ , where c .= vec(∇ 2 L), k denotes the Laplacian kernel and ⊗ denotes convolution.It is key to notice that this is, like (9), a linear system of equations in .Hence, one may combine fast Poisson solvers with image priors to recover more perceptually appealing images.The optimization problem to recover from the Laplacian c becomes: This problem can be solved by the HQS method, with similar steps as those in Sec.3.4.Equation ( 14) still applies, and if we assume the reconstructed image has periodic boundary conditions, the subproblem in the data fidelity term has closed-form solution (the analogue of ( 13)): where F(•) represents the Discrete Fourier Transform (DFT), F −1 (•) is the inverse DFT, and F(•) denotes the complex conjugate of F(•).Without periodic boundary conditions, one may replace the DFT with the Discrete Cosine Transform (DCT) or the Discrete Sine Transform (DST) based on different symmetry assumptions.In summary, estimation problems (10) and ( 15) share many similarities: both arise from linear systems of equations on the derivatives of the brightness, encoded by events, and both can be solved using classical techniques (TV regularizer, etc.) as well as more modern ones (image priors), without event-to-image ANNs.There are, of course, differences, since (10) is based on the first derivative and ( 15) is based on the second derivative: (i) the Laplacian c is a motion-invariant representation, whereas the NIWE is not (it is motion dependent); hence (ii) the Laplacian operator k⊗ in ( 15) is isotropic, whereas D in (10) is not.

Extensions
The proposed method is versatile: it can not only handle super-resolution, but also be combined with motion segmentation approaches and used to recover color images.

Motion Segmentation
Our method can be combined with event-based motion segmentation methods [46], [47].Such methods simultaneously classify the input events E into N c clusters corresponding to different moving parts of the scene and estimate the motion of the clusters: Each cluster has an associated IWE, whose contrast is maximized by the events E j that belong to the cluster.The motion parameters θ j provide accurate information about optical flow.Hence, the IWEs and motion parameters obtained via motion segmentation can be used as input to our method.As a result, the brightness of each moving part of the scene is recovered.Event-based motion segmentation with subsequent image reconstruction has the effect of splitting a reconstructed image into its moving parts.

Color Image Reconstruction
Color image reconstruction from events is possible if events are produced by a color event camera such as the color-DAVIS346 [48], which has a Bayer filter mosaic that makes each pixel sensitive to red, green or blue light.A simple strategy to perform demosaicing from the RGB color channels is to reconstruct each channel separately at their undersampled resolution and then upsample to the original resolution using bicubic interpolation [49].However, since our method can naturally handle super-resolution, we perform demosaicing by combining the reconstruction and interpolation steps, as follows: (i) reconstruct and superresolve (by a factor of 2×) each color channel (R,G,B) from their corresponding events (Sec.3.5), (ii) concatenate the resulting (R,G,B) channels to generate a full-size color image.Color may be useful in applications such as object recognition and microscopy imaging.

EXPERIMENTS
This section presents the evaluation of our proposal.Mimicking prior work, a large portion of the experiments (Sec.4.1) focus on assessing the quality of the reconstructed images on standard datasets and comparing to competing baselines in the state of the art.Section 4.2 shows super-resolution experiments.Section 4.3 demonstrates image reconstruction via the brightness Laplacian.Section 4.4 analyzes: the evolution of the three solvers (CNN, TV, Tikhonov) as iterations proceed, the effect of varying the amount of regularization, and a serendipitous application to monitoring.Section 4.5 shows the capabilities of the IWE method to be combined with motion segmentation and color events.Section 4.6 explores the combination of the IWE method with recent dense optical flow estimation methods and evaluates the influence of erroneous optical flow.Finally, Sec.4.7 reports computational performance and Sec.4.8 points outs the limitations of the method.

Datasets, Metrics and Baselines
We evaluate the performance of our approaches on sequences from standard datasets [14], [50].We compare with state-of-the-art image reconstruction methods [7], [20], [31].E2VID [7] and ECNN [20] are event-to-image RNNs, trained in a supervised manner, with ground truth images.ECNN is an improved version of [7] trained on augmented data to reduce the sim-to-real gap.BTEB [31] comprises both an optical flow estimation CNN [32] and an image reconstruction RNN (same architecture as [7]), where the latter is trained in a self-supervised manner, without ground-truth images.
As is standard, we reconstruct images at the timestamps of the ground truth images (e.g., DAVIS frames) and compare in terms of mean square error (MSE), structural similarity (SSIM) [51] and perceptual similarity (LPIPS) [52].The sequences from [14] have 60s duration and an increasing motion speed.Following the sequence cuts proposed by [20], we use the timestamps of the 300 frames in [5,20]s before motion blur starts to corrupt the ground truth.For the slider sequences we reconstructed images at all > 135 timestamps of the frames because there is little motion blur.
We use the last 20k-50k events per IWE, depending on the amount of texture in the scene.Motion was estimated by means of contrast maximization [26], [53] on the same set of events, from which optical flow was computed and fed to our method.The range of values of the regularizer weight is: λ ∈ [0.03, 0.05] for Tikh.and TV, and λ ∈ [0.19, 0.55] (see [41]) for the image prior denoiser.

Results on the Event Camera Dataset
Figure 5 shows qualitative results on the Event Camera Dataset [14].As observed, our method (12) produces results on par with the state of the art.E2VID and ECNN produce high quality results, partly due to the large capacity of these RNNs to predict pixel intensities and because they are supervised methods.BTEB suffers from motion blur and "ghosting" artifacts (most noticeably in textureless regions) [31].Our method also suffers from some visual artifacts, but they are less pronounced than those of BTEB.The reconstructed images have a clean, smooth appearance, which is conferred by the image denoiser within the solver (12b).Our method fills with a smooth clean look the textureless regions (we also observe this in the different variants presented in Fig. 4), whereas BTEB is reported to have limited extrapolation capacity of edge information.This is specially noticeable in the shapes and dynamic sequences.However, our regularizer tends to smooth fine-grained details (e.g., within-rock texture) as they are considered noise.
The last two rows of Fig. 5 show HDR results.In these sequences comparison is only qualitatively since ground truth images are not HDR.Our method is able to reconstruct high quality HDR images, on par with learning-based methods, TABLE 1: Quantitative evaluation of our method and the state of the art on sequences from [14].We report median values (since they are more robust to outliers than the mean) of MSE, SSIM and LPIPS quality metrics over all reconstructed images.Images are equalized before computing the metrics.and showing a smooth look, which can be controlled via the regularizer weight.
Quantitative results using the above metrics for six different methods are provided in Tab. 1.The choice of evaluation metric and protocol highly influence the numbers.For example, our image prior method produces the best results in terms of SSIM, whereas E2VID and ECNN are better in other metrics.Despite not using ground truth images and not having recurrent connections to past events beyond the IWE, our three methods (Tikh., TV and CNN) produce results in line with the state of the art in reconstruction quality.Figure 6  LPIPS values for the six methods in Tab. 1.These violin plots are more informative than the median numbers in Tab. 1. Surprisingly, solutions as simple as classical Tikhonov or TV regularizers acting on a linear system of equations produce already similar quality results as complex learningbased approaches.In scenes like dynamic and shapes, the proposed CNN regularizer provides better SSIM and LPIPS distributions than ECNN.The distributions of the proposed regularizers (Tikh.-TV-CNN)follow monotonic trends in MSE and SSIM, but the trend is not clear in terms of LPIPS.

Analysis without Histogram Equalization
Histogram equalization makes the results of the comparison metrics more invariant to changes in average image intensity.For example, the SSIM metric comprises two parts: one that depends on the means and one that depends on the variances.Without histogram equalization, both terms contribute to the SSIM.However, with histogram equalization, the effect of the term that depends on the means is greatly reduced, and therefore the term that depends on the variances dominates the metric.
Table 2 and Fig. 7 are analogous to Tab. 1 and Fig. 6, but without histogram equalization.The changes in average intensity are noticeable in the MSE metric.ECNN tends to produce dark images, like those of the DAVIS camera in [14], which often only span a 7-bit range of intensities (from 0 to 127).Hence, without histogram equalization ECNN tends to outperform other methods on these sequences just by producing darker images.With histogram equalization, E2VID and ECNN are best in terms of MSE (Tab.1), but without it we see that ECNN is the top one and our image prior approach (CNN) is consistently the second best (Tab.2).
Histogram equalization is a non-linear transformation unrelated to the event generation model (1).Hence, it is reasonable that our methods (Tikhonov, TV and CNN image prior), which aim at solving a system of equations based on (1), produce better values on metrics without image equalization than on metrics with it.The established evaluation protocol [7], [20], [31] is somehow contradictory: estimators are designed to solve some equations (i.e., based on (1)) and then their goodness of fit is measured using a different set of equations, which are non-linearly related.
In summary, the fact that the quantitative results highly depend on the metric, evaluation protocol (e.g., equalized vs. not equalized) and scene texture calls for prudence when interpreting the numbers as the main means to finely rank the performance of image reconstructions methods.

Results on the N-Caltech 101 Dataset
Figure 8 reports results on the N-Caltech 101 dataset [50].
It is complemented by Figs.25-26 and Tabs.4-5 in the supplementary material.We aligned the images with respect to the ground truth using Enhanced Correlation Coefficient maximization [54] to be able to report quantitative results on the usual metrics (MSE, SSIM, LPIPS).This dataset comprises sequences of 300 ms, with three saccades of 100 ms each.In the tests, we leave the first saccade as initialization for RNN baseline methods [7], [20], [31], and compare the results in the middle of the second saccade.In our method, we only use 60 ms of data from the middle saccade.Quantitatively the total variation (TV) regularizer produces very good results in all three metrics using unequalized images (Tab.5), with the CNN image prior being a competitive option in terms of MSE and SSIM, and E2VID in terms of LPIPS.The performance of E2VID and ECNN improves if equalized images are used in the evaluation protocol.Qualitatively, we observe in Figs. 8, 25 and 26 compelling high-quality reconstructed images by our TV and image-prior methods compared to the state of the art.

Super-resolution
Super-resolution is achieved by converting part of the high temporal resolution of the event camera into spatial resolution.Figure 9 shows results on simultaneous brightness reconstruction and super-resolution (Sec.3.5) on sample sequences from [14], [50].Input resolutions (1×) are 240 × 180 pixels for [14] (DAVIS camera) and 304 × 240 pixels for [50] (ATIS camera).The super-resolved images have 2× and 4× more pixels in each dimension (i.e., ≈ 1Mpixel).To build the IWE at resolutions 2×, 4× we slightly increased the number of events (less than doubled).There is a trade-off: using more events improves the SNR and fills in the superresolved pixels of the IWE, but it increases the time spanned, and motion compensation may not be effective assuming constant optical flow [53].In the images, we observe that increasing from 1× to 2× makes a big difference: at 2× one can read the number "48" on the fin of the plane.In the boxes scene, the checkerboard and the texture of the boxes become sharper at 2×.From 2× to 4× the improvement is not as striking: the checkerboard and the contours of the folded foam mattress behind it become sharper, but we cannot read the text below the plane's fin.This may be difficult given the input spatial resolution.
Additionally, we compare with state-of-the-art method ESRI [44] on the same data.The results for the best ESRI 2× profile (7 bins) are shown in Fig. 9. Our method produces overall better-looking images and can handle various super-TABLE 3: Image reconstruction from the Laplacian, using [23] vs. our method (15).

Image Reconstruction via the Laplacian
Here we demonstrate the method of Sec.3.6.First we use the ANN in [23] to predict the Laplacian image c from the events.Subsequently this image is passed to our solver (15).Table 3 and Fig. 10 report results, where we also add Gaussian noise to the Laplacian.The Laplacian is given in the range [−0.6, 0.6], and the noise has standard deviation σ = 0.02.Given the limited spatial input resolution of c provided by the ANN (120 × 90 pixels), the differences between [23] and our method are most noticeable in the way they deal with noise.Our method yields higher contrast, sharper edges and smoother homogeneous regions due to the CNN denoiser.Regarding Tab. 3, the differences are not as large and depend on the metric.Nevertheless, our method is clearly better in terms of SSIM.

Evolution of the Solution with the Iterations
Figure 11 shows the inner workings of the iterative solver (12) on sample sequences.We run the image-prior solver (12) (CNN) for 16 iterations.The figure depicts the evolution of the two variables k , z k as the iterations proceed.They start far apart, but converge to each other with the iterations.The auxiliary variable z k starts as a very smooth and denoised version of the solution, and it achieves higher level of detail (less smoothing) with the iterations.The brightness image k starts with very fine details and also artifacts in the direction of the optical flow (e.g., Fig. 4a), and some of the details get lost along with the artifacts as it is denoised.
In the alternating process, the iterative solver (12a) enforces the data fidelity equations while the denoiser (12b) pulls the solution towards satisfying structural constraints (i.e., in the space of natural-looking images).Figures 12 and 13 show the evolution of cost function (energy) and brightness when TV and Tikhonov regularization are used to guide the solution of the linear system.The objective value is measured in squared units of the IWE (5).We use event count as IWE units, thus discarding the contrast sensitivity C (which can be regarded as an unknown scaling factor).The split Bregman method [55] is used to solve the cost function in the case of the TV regularization, while a sparse least square solver [56], [57] is used to solve the cost function with Tikhonov regularization.The optimization starts from zero brightness (b).The algorithms converge in few iterations and the majority of scene structure comes into view.The regularizer (prior term) hinders the total energy from dropping sharply along with the data fidelity term and thus guides the solution.The refinement of brightness is less pronounced quantitatively as iterations proceed.The evolution in Fig. 13 resembles a heat diffusion process; the integration effect, which decreases the cost of the data term, happens from the edges to the homogeneous regions.The regularization factor λ (10) is set to 0.1 for TV and 0.3 for Tikhonov to create the plots.

Short-time Image Reconstruction for Monitoring
There are scenarios where short-time reconstruction (i.e., image reconstruction, as opposed to video reconstruction) can be particularly useful, such as surveillance or monitoring.Figure 14 shows an example of animal behavior monitoring.In this scenario, the event camera stays almost stationary and provides data bandwidth savings for long term and power-constrained monitoring.Small vibrations of the camera (unintended or triggered) are useful to reconstruct the brightness of the scene sparingly in time.This is a difficult scenario for RNN methods (Fig. 14a), which require an initialization phase with continuous motion to build up the internal state and fill in homogeneous brightness regions, where no events are triggered.Our method (Fig. 14b) is able to fill in homogeneous regions by decreasing the datafidelity cost, and therefore produces good results.

Effect of Varying the Regularizer Weight
The main hyperparameters of the method are N e and the regularizer weight λ.N e is set depending on the scene texture, as mentioned in Sec.4.1.1.The effect of varying the regularizer weight λ is shown in Fig. 15, which reports a trade-off between artifact removal, detail preservation and oversmoothing.A small λ (Fig. 15a) preserves details but does not remove the artifacts in the flow streamlines that arise from the rank deficiency of the operator D. A large λ (Fig. 15c) produces an over-smooth reconstructed image, where only strong edges are reconstructed, albeit in a non-linear way in case of the CNN regularizer due to its anisotropic nature.An intermediate λ (Fig. 15b) provides a good compromise among the two above extreme cases.
Fig. 11: Evolution of the solution ( k ,z k ) in the alternating, iterative solver ( 12) with CNN regularizer.We start 0 from the solution provided by the total variation (TV).As iterations proceed, z k is refined, capturing more fine details, becoming less smooth.The first column contains animations if opened (and clicked on) with Acrobat Reader, with the iteration counter in red.

Motion Segmentation and Image Reconstruction
Figure 16 provides qualitative results on the combination of our method and motion segmentation.The scene consists of a moving car on a street while the camera is panning [46].Figure 16a shows the input events in space-time, which are clustered in two by running [47] (Fig. 16b): events {E 1 , E 2 } and their motion parameters {θ 1 , θ 2 }.The majority of events (E 1 ) are due to the apparent motion of the static parts of the scene (e.g., buildings) as the camera moves.The car, segmented in a separate cluster as an independent moving object with respect to the background, produces fewer events (E 2 ).Our image reconstruction method can be applied to each cluster independently, i.e., to the IWE produced by each cluster (Fig. 16c).As Fig. 16d shows, the method is able to recover the brightness of the foreground and the background separately, as if it had split a grayscale image into its independently moving components.

Color Image Reconstruction
Figure 17 shows that our method can be used to reconstruct color images using data from a color event camera [49], [58].The reconstruction quality may not be as good as that of the recurrent neural network [7], which is a high-capacity supervised learning method that exploits the history of all past events and handles color in a different color space (CIELAB color, by combining full and low-resolution reconstructions for the luminance and chroma channels).Nevertheless, the reconstructed images exhibit HDR characteristics (e.g., one can see through the overexposed windows) and suffer from less spike color spots than the RNN method.While color is not the target of this work, we think it is interesting to highlight the connection: color, super-resolution and reconstruction can be easily combined in our method.

Combination with State-of-the-art Dense Optical Flow Methods
Our IWE method is based on the assumption of known optical flow, like [31].This allowed us to focus on reformulating the image reconstruction part as a linear inverse problem and propose efficient, explainable solvers.The experiments so far used the flow estimated by methods designed for low degrees-of-freedom (DOF) motions [26], [60], [61], [62], which are often a good approximation and produce very accurate results if the true motion satisfies their assumptions [53], [63].More complex scenes may require dense optical flow estimation, as given by [27], [28], [31] (i.e., higher DOFs).
Figure 18 shows the result of our IWE method when combined with two state-of-the-art event-based dense op-  tical flow methods: E-RAFT [28] (top row) and Multireference Contrast Maximization [27] (MCM, bottom row).E-RAFT is a supervised deep learning method based on RAFT [64], and it uses the events in a time window of 100 ms.MCM is a model-based method that uses a fixed number of events (500k in the example).The data comes from the DSEC dataset [59], recorded with a VGA resolution Prophesee Gen3 CD event camera (which does not have grayscale output for comparison).The results are qualitatively similar (both IWEs in Fig. 18 are reasonably sharp and the reconstructed images have many of the natural details of the city scene), showing that our method can be easily combined with modern dense optical flow estimation methods.Looking into the differences, E-RAFT receives more events than MCM and is trained on DSEC data, so its optical flow produces an IWE with higher SNR in stationary parts of the scene, yielding slightly better reconstructions than the flow by MCM.However E-RAFT does not produce accurate flow for independently moving objects (IMOs) such as the pedestrian and the bicycle (since it is trained using the motion field induced by the vehicle), hence the reconstructed IMOs are more blurred than those by MCM.
In short, image reconstruction results are very good if the motion is accurately provided, as shown in low-DOF scenarios [26], [60], [61], which allowed us to explore the possibilities of the framework (Secs.4.1-4.5).The method delivers good results, i.e., degrades gracefully, if combined with modern dense optical flow methods (high-DOF scenar-

Effect of Erroneous Optical Flow
We further analyze the effect that estimated optical flow quality has on image reconstruction.To this end, we use data from [14], with high quality flow ("ground truth") given by CMax [37], and corrupt the optical flow with increasing values of uniform white noise in [−b, b], with b = 1, . . ., 5 pixels (in each coordinate independently).Then, we assess how the quality of the reconstructed image degrades as a function of the noise power.Figure 19 quantitatively summarizes the results (the experiment was repeated 10 times for each noise level and mean values of image quality metrics are reported), while Fig. 20 provides sample reconstructed images (see also suppl.material).
The results look clean up to ±1 pixel additional noise (for a 240 × 180 pixel image).Afterwards, visual image quality may degrade.On the dynamic scene, the degradation of the visual quality is not well captured by the standard metrics used (MSE, SSIM, LPIPS): they all seem to stay approximately constant.The effect is most noticeable quantitatively on the boxes scene (for b > 2 pixels), specially using the SSIM and LPIPS metrics (SSIM decreases and LPIPS increases with noise).The images reconstructed with the CNN regularizer degrade more abruptly than with other regularizers.While this is a contrived experiment (no algorithm would produce flow corrupted by large uniform white noise), it reassures the motivation of this work: a large portion of the problem is already solved by providing accurate optical flow.Hence, this stimulates future research on better event-based dense optical flow estimation methods.

Computational Performance
The proposed method consists of multiple iterations of linear solvers with regularizers (e.g., Tikhonov, TV or CNN).
Runtime is roughly proportional to the number of iterations.We implement the method using the standard library Pylops [65], running on a laptop's CPU (Intel i7-8650U CPU at 1.90 GHZ, single-threaded).The Plug-and-Play module provided by [41] is used as CNN regularizer.The following numbers are calculated by reconstructing 100 images using 30k events/image from [14] (240 × 180 pixel resolution), collecting the runtime and taking the average.Assuming the optical flow and NIWE are known, the image reconstruction time of our off-the-shelf implementation takes 0.4401 s (Tikhonov by regularized inversion algorithm with 100 iterations), 4.0443 s (TV by the split E-RAFT [28] MCM [27] (a) IWE (events) (b) Ours (CNN) Fig. 18: Image reconstruction in combination with off-theshelf dense optical flow methods [28] and [27].The driving scene is from the DSEC dataset [59] (640 × 480 pixels).Bregman algorithm with 20 outer and 10 inner iterations) and 28.3904 s (CNN, 16 iterations).For comparison, RNNs using optimized implementations take: 0.2448 s (E2VID), 0.2839 s (ECNN) and 0.4059 s (BTEB) on the same CPU.The time to covert the events into a voxel grid to feed to the RNNs is 0.01 s.Likewise, the unoptimized MCM method [27] takes 54.544 s to compute optical flow, while the Ev-FlowNet CNN [66] (trained in [20], [27] using different loss functions) takes 0.1521 s (i.e., 350× speed-up).Additional speed gains can be obtained when running the ANNs on a GPU, as reported by the corresponding publications.The proposed method contains elementary operations that are local and parallelizable, hence we envision that with the appropriate hardware and software implementation considerable speed gains are to be achieved.

Limitations
Artifacts: Despite our method not explicitly modeling temporal consistency, some degree of consistency between consecutive image reconstructions is attained, as shown in Fig. 21.We use CNN image priors for regularization and to remove artifacts.While the majority of artifacts are removed, some remain (Fig. 22).For example, when the flow changes direction abruptly within an event packet, motion compensation with a single motion model may fail to give a good NIWE.Hence, two consecutively reconstructed images may change appearance considerably (Fig. 22(a)).Another artifact that may arise is a black/white spot (Fig. 22(b)), produced if the center of rotation lands in the middle of the image (no events in the NIWE).These artifacts are due to the lack of a temporal consistency prior or recurrent connections.Future research could look into mitigating these artifacts via temporal filtering, i.e., targeting video reconstruction instead of image reconstruction.Brightness Constancy: Like other reconstruction methods, ours is based on the brightness constancy assumption (e.g., in (7)).This has pros and cons.Among the advantages, the method does not require supervisory signal (ground truth and/or grayscale frames), and it does not even need to train a network because it can leverage existing ANNs for image denoising.The downside is that it only recovers brightness of moving edges, hence (i) it may struggle to handle events caused by flickering lights, and (ii) it might lead to inconsistencies [31] if the camera is stationary.Supervised or semisupervised learning methods (like [7], [20], [31]) may forego this assumption given sufficient high quality data, which may be difficult to obtain.The stationary camera problem may be solved by combining events and grayscale frames [6].
Slow and fast objects: Dealing with slow and fast motions in the same scene could be problematic in image reconstruction because, in the same time window, few events would be present from the slow motion while many events would be present from the faster moving parts of the scene.While the NIWE partially accounts for this, the brightness of the slow object would not be as reliably reconstructed as that of the fast object if the speeds were too different.Incorporating recurrent connections that propagate information through time could mitigate this issue, but it would introduce others, such as increasing the initialization phase (i.e., not being able to produce short-time reconstructions Fig. 14).There is a trade-off.We think that the above issue could be addressed in a more sensible manner using motion segmentation [46], [47], before building the NIWE (Sec.3.7.1).
Despite the limitations, the proposed method has many appealing properties and we hope it will foster novel research on estimating both flow and intensity, and in noting that flow, while difficult, facilitates the recovery of intensity.

CONCLUSION
Instead of mainstream end-to-end brightness reconstruction from events based on RNNs, we have emphasized the framework of simultaneously estimating both physically entangled quantities in the events: brightness and motion (optical flow).By exploiting the asymmetries of the estimation problem, we have shown how, using motion to generate images of warped events, brightness reconstruction becomes a linear problem, which we have tackled with classical regularizers and image priors.This novel approach is explainable, combines the physics of event generation with years of developments in linear inverse problem solvers, and leverages existing image denoising networks (i.e., does not require ground truth data).In its competitive form (CNN image denoising prior), the resulting method is not

Fig. 2 :
Fig. 2: From events to brightness.Paths followed by image reconstruction methods in terms of the visual quantities estimated.The two predominant categories of methods are either end-to-end or estimate the Laplacian ∇ 2 L and subsequently solve Poisson's equation.We show (i) a new path based on the image of warped events (IWE), and (ii) how to improve Poisson's path by incorporating image priors.Paths are blue; visual quantities are black.

( a )
Fig.3:The events E in (a) are motion-compensated using optical flow to create the image of warped events (IWE) (c).The sharp IWE approximates the x-derivative of the ground truth frame better than the uncompensated image (b).Our method reconstructs brightness image (d) from the IWE (c) by solving a linear system of equations with regularization (CNN-based in this example).Texture details smaller than the contrast sensitivity C in (1) cannot be recovered since they do not trigger events.The event data in the figure is from the slider far sequence in[14], consisting of rocks and tree-like textures.

Fig. 4 :
Fig. 4: Image reconstruction using different regularization terms with increasing complexity (continued from Fig. 3).Solving the linear system (9) without regularization already captures most of the scene content (a), but the solution suffers from discretization artifacts along the optical flow streamlines.The regularization terms (b)-(e) mitigate the artifacts, thus improving image quality, by enforcing structural information in different ways.The last row shows results from state-of-the-art baselines (Sec.4).

Fig. 5 :
Fig. 5: Qualitative comparison with the state of the art on sequences from [14].Histogram equalization is not used.

Fig. 8 :
Fig. 8: Qualitative comparison with the state of the art on sequences from [50].Histogram equalization is not used.

Fig. 9 :
Fig. 9: Super-resolution.Reconstructing and super-resolving brightness by factors of 2× and 4× in each spatial dimension.Details are best viewed by zooming in.

Fig. 10 :
Fig. 10: Reconstruction using the Laplacian.Results using the method of Sec.3.6 with the CNN regularizer.

19 Fig. 12 :
Fig. 12: Evolution of the objective value and brightness image during reconstruction with total variation (TV) regularization (10).(a) The value of the prior term (i.e., regularizer) is λ ∇ .The value of the data term drops significantly in the first iteration and the structure of the scene shows up (c).Then, image details are refined in subsequent iterations (d)-(g).The prior term encourages image smoothness, hence preventing the solution from completely fitting the data term.The first image (b) contains an animation of the evolution.Best seen in Acrobat Reader (possibly clicking on image (b)).

Fig. 13 :
Fig. 13: Evolution of the objective value and brightness image during reconstruction with Tikhonov regularization (i.e., with prior term λ ∇ 2 ).Similar comments as in Fig. 12 apply.Starting from zero brightness (b), the value of the data term drops significantly during the first iterations and the structure of the scene (mostly around edges) appears (c)-(d).Homogeneous regions are filled in subsequent iterations (e)-(g).The animation (b) is best seen in Acrobat Reader.

Fig. 14 :Fig. 15 :
Fig. 14: Brightness reconstruction from a stationary event camera (346 × 260 pixels) observing penguins in Antarctica while it is moved by a wind gust.

Fig. 19 :
Fig. 19: Sensitivity study: Image quality metrics vs. noise in the flow field for the three regularizers: CNN (solid line), TV (dashed line) and Tikhonov (solid-circled line).

TABLE 4 :
[50]titative evaluation of our method and state-of-the-art methods on sequences from N-Caltech[50].We report median values of MSE, SSIM and LPIPS, since they are more robust to outliers than the mean.Images are histogramequalized before computing the metrics.