Event-based Vision: A Survey

Event cameras are bio-inspired sensors that work radically different from traditional cameras. Instead of capturing images at a fixed rate, they measure per-pixel brightness changes asynchronously. This results in a stream of events, which encode the time, location and sign of the brightness changes. Event cameras posses outstanding properties compared to traditional cameras: very high dynamic range (140 dB vs. 60 dB), high temporal resolution (in the order of microseconds), low power consumption, and do not suffer from motion blur. Hence, event cameras have a large potential for robotics and computer vision in challenging scenarios for traditional cameras, such as high speed and high dynamic range. However, novel methods are required to process the unconventional output of these sensors in order to unlock their potential. This paper provides a comprehensive overview of the emerging field of event-based vision, with a focus on the applications and the algorithms developed to unlock the outstanding properties of event cameras. We present event cameras from their working principle, the actual sensors that are available and the tasks that they have been used for, from low-level vision (feature detection and tracking, optic flow, etc.) to high-level vision (reconstruction, segmentation, recognition). We also discuss the techniques developed to process events, including learning-based techniques, as well as specialized processors for these novel sensors, such as spiking neural networks. Additionally, we highlight the challenges that remain to be tackled and the opportunities that lie ahead in the search for a more efficient, bio-inspired way for machines to perceive and interact with the world.

high dynamic range (140 dB versus 60 dB of standard cameras), and low power consumption.Hence, event cameras have a large potential for robotics and wearable applications in challenging scenarios for standard cameras, such as high speed and high dynamic range.Although event cameras have become commercially available only since 2008 [2], the recent body of literature on these new sensors 2 as well as the recent plans for mass production claimed by companies, such as Samsung [5] and Prophesee, 3 highlight that there is a big commercial interest in exploiting these novel vision sensors for mobile robotic, augmented and virtual reality (AR/VR), and video game applications.However, because event cameras work in a fundamentally different way from standard cameras, measuring per-pixel brightness changes (called "events") asynchronously rather than measuring "absolute" brightness at constant rate, novel methods are required to process their output and unlock their potential.

PRINCIPLE OF OPERATION OF EVENT CAMERAS
In contrast to standard cameras, which acquire full images at a rate specified by an external clock (e.g., 30 fps), event cameras, such as the Dynamic Vision Sensor (DVS) [2], [31], [32], [33], [34], respond to brightness changes in the scene asynchronously and independently for every pixel (Fig. 1b).Thus, the output of an event camera is a variable data-rate sequence of digital "events" or "spikes", with each event representing a change of brightness (log intensity) 4 of predefined magnitude at a pixel at a particular time 5 (Fig. 1b) (Section 2.4).This encoding is inspired by the spiking nature of biological visual pathways (Section 3.3).
Each pixel memorizes the log intensity each time it sends an event, and continuously monitors for a change of sufficient magnitude from this memorized value (Fig. 1a).When the change exceeds a threshold, the camera sends an event, which is transmitted from the chip with the x; y location, the time t, and the 1-bit polarity p of the change (i.e., brightness increase ("ON") or decrease ("OFF")).This event output is illustrated in Figs.1b, 1e and 1f.
The events are transmitted from the pixel array to periphery and then out of the camera using a shared digital output bus, typically by using address-event representation (AER) readout [37], [38].This bus can become saturated, which perturbs the times that events are sent.Event cameras have readout rates ranging from 2 MHz [2] to 1200 MHz [39], depending on the chip and type of hardware interface.
Event cameras are data-driven sensors: their output depends on the amount of motion or brightness change in the scene.The faster the motion, the more events per second are generated, since each pixel adapts its delta modulator sampling rate to the rate of change of the log intensity signal that it monitors.Events are timestamped with microsecond resolution and are transmitted with sub-millisecond latency, which make these sensors react quickly to visual stimuli.
The incident light at a pixel is a product of scene illumination and surface reflectance.If illumination is approximately constant, a log intensity change signals a reflectance change.These changes in reflectance are mainly the result of the movement of objects in the field of view.That is why the DVS brightness change events have a built-in invariance to scene illumination [2].
Comparing Bandwidths of DVS Pixels and Frame-Based Camera.Although DVS pixels are fast, like any physical (e) A white square on a rotating black disk viewed by the DAVIS produces grayscale frames and a spiral of events in space-time.Events in space-time are color-coded, from green (past) to red (present).(f) Frame and overlaid events of a natural scene; the frames lag behind the low-latency events (colored according to polarity).Images adapted from [4], [35].A more in-depth comparison of the DVS, DAVIS, and ATIS pixel designs can be found in [36].
5. Nomenclature: "Event cameras" output data-driven events that signal a place and time.This nomenclature has evolved over the past decade: originally they were known as address-event representation (AER) silicon retinas, and later they became event-based cameras.In general, events can signal any kind of information (intensity, local spatial contrast, etc.), but over the last five years or so, the term "event camera" has unfortunately become practically synonymous with the particular representation of brightness change output by DVS's.
transducer, they have a finite bandwidth: if the incoming light intensity varies too quickly, the front-end photoreceptor circuits filter out the variations [40].The rise and fall time that is analogous to the exposure time in standard image sensors is the reciprocal of this bandwidth.Fig. 2 shows an example of measured DVS pixel frequency response (DVS128 in [2]).The measurement setup (Fig. 2a) uses a sinusoidally-varying generated signal to measure the response.Fig. 2b shows that, at low frequencies, the DVS pixel produces a certain number of events per cycle.Above some cutoff frequency, the variations are filtered out by the photoreceptor dynamics, and thus the number of events per cycle drops.This cutoff frequency is a monotonically increasing function of light intensity.At the brighter light intensity, the DVS pixel bandwidth is about 3 kHz, equivalent to an exposure time of about 300 ms.At 1000Â lower intensity, the DVS bandwidth is reduced to about 300 Hz.Even when the LED brightness is reduced by a factor of 1,000, the frequency response of DVS pixels is ten times higher than the 30 Hz Nyquist frequency from a 60 fps image sensor.Also, the framebased camera aliases frequencies above the Nyquist frequency back to the baseband, whereas the DVS pixel does not due to the continuous time response.

Event Camera Designs
This section presents the most common event camera designs.The actual devices (commercial or prototype cameras such as the DAVIS240) are summarized in Section 2.5.
The first silicon retina was developed by Mahowald and Mead at Caltech during the period 1986-1992, in Ph.D. thesis work [41] that was awarded the prestigious Clauser prize. 6 Mahowald and Mead's sensor had logarithmic pixels, was modeled after the three-layer Kufler retina, and produced as output spike events using the AER protocol.However, it suffered from several shortcomings: each wire-wrapped retina board required precise adjustment of biasing potentiometers; there was considerable mismatch between the responses of different pixels; and pixels were too large to be a device of practical use.Over the next decade the neuromorphic community developed a series of silicon retinas.These developments are summarized in [36], [38], [42], [43].
The DVS event camera [2] had its genesis in a frame-based silicon retina design where the continuous-time photoreceptor was capacitively coupled to a readout circuit that was reset each time the pixel was sampled [44].More recent event camera technology has been reviewed in the electronics and neuroscience literature [10], [36], [38], [45], [46], [47].Although surprisingly many applications can be solved by only processing DVS events (i.e., brightness changes), it became clear that some also require some form of static output (i.e., "absolute" brightness).To address this shortcoming, there have been several developments of cameras that concurrently output dynamic and static information.
The Asynchronous Time Based Image Sensor (ATIS) [3], [48] has pixels that contain a DVS subpixel (called change detection CD) that triggers another subpixel to read out the absolute intensity (exposure measurement EM).The trigger resets a capacitor to a high voltage.The charge is bled away from this capacitor by another photodiode.The brighter the light, the faster the capacitor discharges.The ATIS intensity readout transmits two more events coding the time between crossing two threshold voltages, as in [49].This way, only pixels that change provide their new intensity values.The brighter the illumination, the shorter the time between these two events.The ATIS achieves large static dynamic range ( > 120 dB).However, the ATIS has the disadvantage that pixels are at least double the area of DVS pixels.Also, in dark scenes the time between the two intensity events can be long and the readout of intensity can be interrupted by new events ( [50] proposes a workaround to this problem).
The widely-used Dynamic and Active Pixel Vision Sensor (DAVIS) [4], [51] illustrated in Fig. 1 combines a conventional active pixel sensor (APS) [52] in the same pixel with DVS.The advantage over ATIS is a much smaller pixel size since the photodiode is shared and the readout circuit only adds about 5 percent to the DVS pixel area.Intensity (APS) frames can be triggered at a constant frame rate or on demand, by analysis of DVS events, although the latter is seldom exploited. 7However, the APS readout has limited dynamic range (55dB) and like a standard camera, it is redundant if the pixels do not change.
Since the ATIS and DAVIS pixel designs include a DVS pixel (change detector) [36] we often use the term "DVS" to refer to the binary-polarity event output or circuitry, regardless of whether it is from a DVS, ATIS or DAVIS design.

Advantages of Event Cameras
Event cameras offer numerous potential advantages over standard cameras: High Temporal Resolution.monitoring of brightness changes is fast, in analog circuitry, and the read-out of the events is digital, with a 1 MHz clock, i.e., events are detected and timestamped with microsecond resolution.Therefore, event cameras can capture very fast motions, without suffering from motion blur typical of frame-based cameras.
Low Latency.Each pixel works independently and there is no need to wait for a global exposure time of the frame: as soon as the change is detected, it is transmitted.Hence, event cameras have minimal latency: about 10 ms on the lab bench, and sub-millisecond in the real world.
High Dynamic Range (HDR).The very high dynamic range of event cameras ( > 120 dB) notably exceeds the 60 dB of high-quality, frame-based cameras, making them able to acquire information from moonlight to daylight.It is due to the facts that the photoreceptors of the pixels operate in logarithmic scale and each pixel works independently, not waiting for a global shutter.Like biological retinas, DVS pixels can adapt to very dark as well as very bright stimuli.

Challenges Due to the Novel Sensing Paradigm
Event cameras represent a paradigm shift in acquisition of visual information.Hence, they pose the challenge of designing novel methods (algorithms and hardware) to process the acquired data and extract information from it in order to unlock the advantages of the camera.Specifically: 1) Coping with different space-time output: The output of event cameras is fundamentally different from that of standard cameras: events are asynchronous and spatially sparse, whereas images are synchronous and dense.Hence, frame-based vision algorithms designed for image sequences are not directly applicable to event data.2) Coping with different photometric sensing: In contrast to the grayscale information that standard cameras provide, each event contains binary (increase/decrease) brightness change information.Brightness changes depend not only on the scene brightness, but also on the current and past relative motion between the scene and the camera.3) Coping with noise and dynamic effects: All vision sensors are noisy because of the inherent shot noise in photons and from transistor circuit noise, and they also have non-idealities.This situation is especially true for event cameras, where the process of quantizing temporal contrast is complex and has not been completely characterized.Therefore, new methods need to rethink the space-time, photometric and stochastic nature of event data.This poses the following questions: What is the best way to extract information from the events relevant for a given task? and How can noise and non-ideal effects be modeled to better extract meaningful information from the events?

Event Generation Model
An event camera [2] has independent pixels that respond to changes in their log photocurrent L ¼ : log ðIÞ ("brightness").
Specifically, in a noise-free scenario, an event e k ¼ : and at time t k as soon as the brightness increment since the last event at the pixel, i.e., reaches a temporal contrast threshold AEC (Fig. 1b), i.e., where C > 0, Dt k is the time elapsed since the last event at the same pixel, and the polarity p k 2 fþ1; À1g is the sign of the brightness change [2].
The contrast sensitivity C is determined by the pixel bias currents [56], [57], which set the speed and threshold voltages of the change detector in Fig. 1 and are generated by an on-chip digitally-programmed bias generator.The sensitivity C can be estimated knowing these currents [56].In practice, positive ("ON") and negative ("OFF") events may be triggered according to different thresholds, C þ ; C À .Typical DVS's [2], [5] can set thresholds between 10 to 50 percent illumination change.The lower limit on C is determined by noise and pixel-to-pixel mismatch (variability); setting C too low results in a storm of noise events, starting from pixels with low values of C. Experimental DVS's with higher photoreceptor gain are capable of lower thresholds, e.g., 1 percent [58], [59], [60]; however these values are only obtained under very bright illumination and ideal conditions.Fundamentally, the pixel must react to a small change in the photocurrent in spite of the shot noise present in this current.This shot noise limitation sets the relation between threshold and speed of the DVS under a particular illumination and desired detection reliability condition [60], [61].
Events and the Temporal Derivative of Brightness.Eq. ( 2) states that event camera pixels set a threshold on magnitude of the brightness change since the last event happened.For a small Dt k , such an increment (2) can be approximated using Taylor's expansion by DLðx k ; t k Þ % @L @t ðx k ; t k ÞDt k , which allows us to interpret the events as providing information about the temporal derivative This is an indirect way of measuring brightness, since with standard cameras we are used to measuring absolute brightness.Note that DVS events are triggered by a change in brightness magnitude (2), not by the brightness derivative (3) exceeding a threshold.The above interpretation may be taken into account to design physically-grounded event-based algorithms, such as [7], [23], [24], [28], [62], [63], [64], [65], as opposed to algorithms that simply process events as a collection of points with vague photometric meaning.
Events are Caused by Moving Edges.Assuming constant illumination, linearizing (2) and using the brightness constancy assumption one can show that events are caused by moving edges.For small Dt, the intensity increment (2) can be approximated by 8   DL % ÀrL Á vDt; (4) that is, it is caused by a brightness gradient rLðx k ; t k Þ ¼ ð@ x L; @ y LÞ > moving with velocity vðx k ; t k Þ on the image plane, over a displacement Dx ¼ : vDt.
model takes into account sensor noise and transistor mismatch, yielding a mixture of frozen and temporally varying stochastic triggering conditions represented by a probability function, which is itself a complex function of local illumination level and sensor operating parameters.The measurement of such probability density was shown in [2] (for the DVS128), suggesting a normal distribution centered at the contrast threshold C. The 1s width of the distribution is typically 2-4 percent temporal contrast.This event generation model can be included in emulators [73] and simulators [74] of event cameras, and in event processing algorithms [24], [66].Other probabilistic event generation models have been proposed, such as: the likelihood of event generation being proportional to the magnitude of the image gradient [75] (for scenes where large intensity gradients are the source of most event data), or the likelihood being modeled by a mixture distribution to be robust to sensor noise [7].Future even more realistic models may include the refractory period (i.e., the duration in time that the pixel ignores log brightness changes after it has generated an event; the larger the refractory period the fewer events are produced by fast moving objects), and bus congestion [76].

Event Camera Availability
Table 1 summarizes the most popular or recent cameras.The numbers therein are approximate since they were not measured using a common testbed.Event camera characteristics are considerably different from other CMOS image sensor (CIS) technology, and so there is a need for an agreement on standard specifications to be better used by researchers.As Table 1 shows, since the first practical event camera [2] there has been a trend mainly to increase spatial resolution, increase readout speed, and add features, such as: gray level output (in ATIS and DAVIS), integration with an Inertial Measurement Unit (IMU) [77] and multi-camera timestamp synchronization [78].IMUs act as a vestibular sense that may improve camera pose estimation, as in visual-inertial odometry.Only recently has the focus turned more towards the difficult task of reducing pixel size for economical mass production of sensors with large pixel arrays.In this respect, 3D wafer stacking fabrication has the biggest impact in reducing pixel size and increasing the fill factor.
Pixel Size.The most widely used event cameras have quite large pixels: 40 mm (DVS128), 30 mm (ATIS), 18.5 mm (DAVIS240, DAVIS346) (Table 1).The smallest published DVS pixel [68] is 4.86 mm; while conventional global shutter industrial APS are typically in the range of 2 mm to 4 mm.Low spatial resolution is certainly a limitation for application, although many of the seminal publications are based on the 128 Â 128 pixel DVS128 [2].The DVS with largest published array size has only about 1Mpixel spatial resolution (1280 Â 960 pixels [39]).Event camera pixel size has shrunk pretty closely following feature size scaling, which is remarkable considering that a DVS pixel is a mixed-signal circuit, which generally do not scale following technology.However, achieving even smaller pixels is difficult and may require abandoning the strictly asynchronous circuit design philosophy that the cameras started with [79].Camera cost is constrained by die size (since silicon costs about $5-$10/cm 2 in mass production), and optics (designing new mass production miniaturized optics to fit a different sensor format can cost tens of millions of dollars).
Fill Factor.A major obstacle for early event camera mass production prospects was the limited fill factor of the pixels (i.e., the ratio of a pixel's light sensitive area to its total area).Because the pixel circuit is complex, a smaller pixel area can be used for the photodiode that collects light.For example, a pixel with 20 percent fill factor throws away 4 out of 5 photons.Obviously this is not acceptable for optimum performance; nonetheless, even the earliest event cameras could sense high contrast features under moonlight illumination [2].Early CIS sensors dealt with this problem by including microlenses that focused the light onto the pixel photodiode.What is probably better, however, is to use back-side illumination technology (BSI).BSI flips the chip so that it is illuminated from the back, so that in principle the entire pixel area can collect photons.Nearly all smartphone cameras are now back illuminated, but the additional cost of BSI fabrication has meant that only recently BSI event cameras were demonstrated [39], [68], [69], [80].BSI also brings problems: light can create additional 'parasitic' photocurrents that lead to spurious 'leak' events [56].
Cost.Currently, a practical obstacle to adoption of event camera technology is the high cost of several thousand dollars per camera, similar to the situation with early time of flight, structured lighting and thermal cameras.The high costs are due to non-recurring engineering costs for the silicon design and fabrication (even when much of it is provided by research funding) and the limited samples available from prototype runs.It is anticipated that this price will drop precipitously once this technology enters mass production, as shown by the "Samsung SmartThings Vision" consumer-grade home monitoring device: it contains an event camera [5] and sells for 100 dollars.

EVENT PROCESSING
One of the key questions of the paradigm shift posed by event cameras is how to extract meaningful information from the event data to fulfill a given task.This is a very broad question, since the answer is application dependent, and it drives the algorithmic design of the task solver.
Event cameras acquire information in an asynchronous and sparse way, with high temporal resolution and low latency.Hence, the temporal aspect, specially latency, plays an essential role in the way events are processed.Depending on how many events are processed simultaneously, two categories of algorithms can be distinguished: (i) methods that operate on an event-by-event basis, where the state of the system (the estimated unknowns) can change upon the arrival of a single event, thus achieving minimum latency, and (ii) methods that operate on groups or packets of events, which introduce some latency.Discounting latency considerations, methods based on groups (i.e., temporal windows) of events can still provide a state update upon the arrival of each event if the window slides by one event.Hence, the distinction between both categories is more subtle: an event alone does not provide enough information for estimation, and so additional information, in the form of past events or extra knowledge, is needed.We review this categorization.
Orthogonally, depending on how events are processed, we can distinguish between model-based approaches and model-free (i.e., data-driven, machine learning) approaches.Assuming events are processed in an optimization framework, another classification concerns the type of objective or loss function used: geometric-versus temporal-versus photometric-based (e.g., a function of the event polarity or the event activity).Each category presents methods with advantages and disadvantages and current research focuses on exploring the possibilities that each method can offer.

Event Representations
Events are processed and often transformed into alternative representations (Fig. 3) that facilitate the extraction of meaningful information ("features") to solve a given task.Here we review popular representations of event data.Several of them arise from the need to aggregate the little information conveyed by individual events in the absence of additional knowledge.Some representations are simple, hand-crafted data transformations whereas others are more elaborate.
Individual events e k ¼ : ðx k ; t k ; p k Þ are used by event-byevent processing methods, such as probabilistic filters and Spiking Neural Networks (SNNs) (Section 3.3).The filter or SNN has additional information, built up from past events or given by additional knowledge, that is fused with the incoming event asynchronously to produce an output.Examples include: [7], [24], [62], [84], [85].Event Packet.Events E¼ : fe k g N e k¼1 in a spatio-temporal neighborhood are processed together to produce an output.Precise timestamp and polarity information is retained by this representation.Choosing the appropriate packet size N e is critical to satisfy the assumptions of the algorithm (e.g., constant motion speed during the span of the packet), which varies with the task.Examples are [18], [19], [86], [87].
Event Frame/Image or 2D Histogram.The events in a spatio-temporal neighborhood are converted in a simple way (e.g., by counting events or accumulating polarity pixelwise) into an image (2D grid) that can be fed to image-based computer vision algorithms.Some algorithms may work in spite of the different statistics of event frames and natural images.Such histograms can provide a natural activitydriven sample rate; see [88] for methods to accumulate such frames for computing flow.However, this practice is not ideal in the event-based paradigm because it quantizes event timestamps, can discard sparsity (but see [89]), and the resulting images are highly sensitive to the number of events used.Nevertheless the high impact of event frames in the literature [23], [26], [64], [88], [90], [91] is clear because (i) they are a simple way to convert an unfamiliar event stream into a familiar 2D representation containing spatial information about scene edges, which are the most informative regions in natural images, (ii) they inform not only about the presence of events but also about their absence (which is informative), (iii) they have an intuitive interpretation (e.g., an edge map, a brightness increment image) and (iv) they are the data structure compatible with conventional computer vision.
Time Surface (TS).A TS is a 2D map where each pixel stores a single time value (e.g., the timestamp of the last event at that pixel [92], [93]).Thus events are converted into an image whose "intensity" is a function of the motion history at that location, with larger values corresponding to a more recent motion.TSs are called Motion History Images in classical computer vision [94].They explicitly expose the rich temporal information of the events and can be updated asynchronously.3. Several event representations (Section 3.1) of the slider_depth sequence [81].From let to right: events in space time, colored according to polarity (positive in blue, negative in red).Event frame (brightness increment image DLðxÞ).Time surface with last timestamp per pixel (darker pixels indicate recent time), only for negative events.Interpolated voxel-grid (240 Â 180 Â 10 voxels), colored according to polarity, from dark (negative) to bright (positive).Motion-compensated event image [82] (sharp edges obtained by event accumulation are darker than pixels with no events, in white).Reconstructed intensity image by [8].Grid-like representations are compatible with conventional computer vision methods [83].
Using an exponential kernel, TSs emphasize recent events over past events.To achieve invariance to motion speed, normalization is proposed [95], [96].Compared to other grid-like representations of events, TSs highly compress information as they only keep one timestamp per pixel, thus their effectiveness degrades on textured scenes, in which pixels spike frequently.To make TSs less sensitive to noise, each pixel value may be computed by filtering the events in a space-time window [97].More examples include [21], [98], [99], [100].
Voxel Grid. is a space-time (3D) histogram of events, where each voxel represents a particular pixel and time interval.This representation preserves better the temporal information of the events by avoiding to collapse them on a 2D grid (Fig. 3).If polarity is used the voxel grid is an intuitive discretization of a scalar field (polarity pðx; y; tÞ or brightness variation @Lðx; y; tÞ=@t) defined on the image plane, with absence of events marked by zero polarity.Each event's polarity may be accumulated on a voxel [101], [102] or spread among its closest voxels using a kernel [8], [103], [104].Both schemes quantize event timestamps but the latter (interpolated voxel grid) provides sub-voxel accuracy.
3D Point Set.Events in a spatio-temporal neighborhood are treated as points in 3D space, ðx k ; y k ; t k Þ 2R 3 .Thus the temporal dimension becomes a geometric one.It is a sparse representation, and is used on point-based geometric processing methods, such as plane fitting [21] or PointNet [105].
Point Sets on Image Plane.Events are treated as an evolving set of 2D points on the image plane.It is a popular representation among early shape tracking methods based on mean-shift or ICP [106], [107], [108], [109], [110], where events provide the only data needed to track edge patterns.
Motion-compensated event image [111], [112]: is a representation that depends not only on events but also on motion hypothesis.The idea of motion compensation is that, as an edge moves on the image plane, it triggers events on the pixels it traverses; the motion of the edge can be estimated by warping the events to a reference time and maximizing their alignment, producing a sharp image (i.e., histogram) of warped events (IWE) [112].Hence, this representation (IWE) suggests a criterion to measure how well events fit a candidate motion: the sharper the edges produced by warping events, the better the fit [82].Moreover, the resulting motion-compensated images have an intuitive meaning (i.e., the edge patterns causing the events) and provide a more familiar representation of visual information than the events.In a sense, motion compensation reveals a hidden ("motion-invariant") map of edges in the event stream.The images may be useful for further processing, such as feature tracking [64], [113].There are motion-compensated versions of point sets [114], [115] and time surfaces [116], [117].
Reconstructed Images.Brightness images obtained by image reconstruction (Section 4.5) can be interpreted as a more motion-invariant representation than event frames or TSs, and be used for inference [8] yielding first-rate results.
A general framework for converting event data into some of the above grid-based representations is presented in [83].It also studies how the choice of representation passed to an artificial neural network (ANN) affects task performance and consequently proposes to automatically learn the representation that maximizes such performance.

Methods for Event Processing
Event processing systems consist of several stages: pre-processing (input adaptation), core processing (feature extraction and analysis) and post-processing (output creation).The event representations in Section 3.1 may occur at different stages: for example, in [111] an event packet is used at preprocessing, and motion-compensated event images are the internal representation at the core processing stage.
The methods used to process events are influenced by the choice of representation and hardware platform available.These three factors influence each other.For example, it is natural to use dense representations and design algorithms accordingly that are executed on standard processors (e.g., CPUs or GPUs).At the same time, it is also natural to process events one-by-one on SNNs (Section 3.3) that are implemented on neuromorphic hardware (Section 5.1), in search for more efficient and low-latency solutions.Major exponents of event-by-event methods are filters (deterministic or probabilistic) and SNNs.For events processed in packets there are also many methods: hand-crafted feature extractors, deep neural networks (DNNs), etc. Next, we review some of the most common methods.
Event-by-Event-Based Methods.Deterministic filters, such as (space-time) convolutions and activity filters have been used for noise reduction, feature extraction [118], image reconstruction [62], [119] and brightness filtering [63], among other applications.Probabilistic filters (Bayesian methods), such as Kalman-and particle filters have been used for pose tracking in SLAM systems [7], [24], [25], [75], [84].These methods rely on the availability of additional information (typically "appearance" information, e.g., grayscale images or a map of the scene), which may be provided by past events or by additional sensors.Then, each incoming event is compared against such information and the resulting mismatch provides innovation to update the filter state.Filters are a dominant class of methods for event-by-event processing because they naturally (i) handle asynchronous data, thus providing minimum processing latency, preserving the sensor's characteristics, and (ii) aggregate information from multiple small sources (e.g., events).
The other dominant class of methods takes the form of a multi-layer ANN (whether spiking or not) containing many parameters which must be computed from the event data.Networks trained with unsupervised learning typically act as feature extractors for a classifier (e.g., SVM), which still requires some labeled data for training [15], [93], [120].If enough labeled data is available, supervised learning methods such as backpropagation can be used to train a network without the need for a separate classifier.Many approaches use packets of events during training (deep learning on frames), and later convert the trained network to an SNN that processes data event-by-event [121], [122], [123], [124], [125].Event-by-event model-free methods have mostly been applied to classify objects [15], [93], [121], [122] or actions [16], [17], [126], and have targeted embedded applications [121], often using custom SNN hardware [15], [17] (Section 5.1).SNNs trained with deep learning typically provide higher accuracy than those relying on unsupervised learning for feature extraction, but there is growing interest in finding efficient ways to implement supervised learning directly in SNNs [126], [127] and in embedded devices [128].
Methods for Groups of Events.Because each event carries little information and is subject to noise, several events are often processed together to yield a sufficient signal-to-noise ratio for the problem considered.Methods for groups of events use the above representations (event packet, event frame, etc.) to gather the information contained in the events in order to estimate the problem unknowns, usually without requiring additional data.Hence, events are processed differently depending on their representation.
Many representations just perform data pre-processing to enable the re-utilization of image-based computer vision tools.In this respect, event frames are a practical representation that has been used by multiple methods on various tasks.In [90], [129] event frames allow to re-utilize traditional stereo methods, providing modest results.They also provide an adaptive frame rate signal that is profitable for camera pose estimation [26] (by image alignment) or optical flow computation [88] (by block matching).Event frames are also a simple yet effective input for image-based learning methods (DNNs, SVMs, Random Forests) [22], [91], [130], [131].Few works design algorithms taking into account their photometric meaning (4).This was done in [23], showing that such a simple representation allows to jointly compute several visual quantities of interest (optical flow, brightness, etc.).Intensity increment images (4) are also used for feature tracking [64], image deblurring [28] or camera tracking [65].
Because time surfaces (TSs) are sensitive to scene edges and the direction of motion they have been utilized for many tasks involving motion analysis and shape recognition.For example, fitting local planes to the TS yields optical flow information [21], [132].TSs are used as building blocks of hierarchical feature extractors, similar to neural networks, that aggregate information from successively larger spacetime neighborhoods and is then passed to a classifier for recognition [93], [97].TSs provide proxy intensity images for matching in stereo methods [100], [133], where the photometric matching criterion becomes temporal: matching pixels based on event concurrence and similarity of event timestamps across image planes.Recently, TSs have been probed as input to convolutional ANNs (CNNs) to compute optical flow [22], where the network acts both as feature extractor and velocity regressor.TSs are popular for corner detection using adaptations of image-based methods (Harris, FAST) [95], [98], [99] or new learning-based ones [96].However, their performance degrades on highly textured scenes [99] due to the "motion overwriting" problem [94].
Methods working on voxel grids include variational optimization and ANNs (e.g., DNNs).They require more memory and often more computations than methods working on lower dimensional representations but are able to provide better results because temporal information is better preserved.In these methods voxel grids are used as an internal representation [101] (e.g., to compute optical flow) or as the multichannel input/output of a DNN [103], [104].Thus, voxel grids are processed by means of convolutions [103], [104] or the operations derived from the optimality conditions of an objective function [101].
Once events have been converted to grid-like representations, countless tools from conventional vision can be applied to extract information: from feature extractors (e.g., CNNs) to similarity metrics (e.g., cross-correlation) that measure the goodness of fit or consistency between data and task-model hypothesis (the degree of event alignment, etc.).Such metrics are used as objective functions for classification (SVMs, CNNs), clustering, data association, motion estimation, etc.In the neuroscience literature there are efforts to design metrics that act directly on spikes (e.g., event stream), to avoid the issues that arise due to data conversion.
Deep learning methods for groups of events consist of a deep neural network (DNN).Sample applications include classification [134], [135], image reconstruction [8], [102], steering angle prediction [91], [136], and estimation of optical flow [22], [103], [137], depth [137] or ego-motion [103].These methods differentiate themselves mainly in the representation of the input and in the loss functions optimized during training.Several representations have been used, such as event images [91], [131], TSs [22], [117], [137], voxel grids [103], [104] or point sets [105] (Section 3.1).While loss functions in classification tasks use manually annotated labels, networks for regression tasks from events may be supervised by a third party ground truth (e.g., a pose) [91], [131] or by an associated grayscale image [22] to measure photoconsistency, or be completely unsupervised (depending only on the training input events) [103], [137].Loss functions for unsupervised learning from events are studied in [82].In terms of architecture, most networks have an encoder-decoder structure, as in Fig. 4.Such a structure allows the use of convolutions only, thus minimizing the number of network weights.Moreover, a loss function can be applied at every spatial scale of the decoder.
Finally, motion compensation is a technique to estimate the parameters of the motion that best fits a group of events.It has a continuous-time warping model that allows to exploit the fine temporal resolution of events (Section 3.1), and hence departs from conventional image-based algorithms.Motion compensation can be used to estimate ego-motion [111], [112], optical flow [103], [112], [114], [138], depth [19], [82], [112], motion segmentation [116], [138], [139] or feature motion for VIO [113], [115].The technique in [87] also has a continuous-time motion model, albeit not used for motion compensation but rather to fuse event data with IMU data.To find the parameters of the continuous-time motion models [82], [87], standard optimization methods, e.g., conjugate gradient or Gauss-Newton, may be applied.4. Events in a space-time volume are converted into an interpolated voxel grid (left) that is fed to a DNN to compute optical flow and egomotion in an unsupervised manner [103].Thus, modern tensor-based DNN architectures are re-utilized using novel loss functions (e.g., motion compensation) adapted to event data.
The number of events per group (i.e., size of the spatio-temporal neighborhood) is an important hyper-parameter of many methods.It highly depends on the processing algorithm and the available resources, and accepts multiple selection strategies [11], [88], [102], [111], such as constant number of events, constant observation time (i.e., constant frame rate), or more adaptive ones (thresholding the number of events in regions of the image plane) [88].Utilizing a constant number of events fits naturally with the camera's output rate but it does not account for spatial variations of the rate.A constant frame rate selects a varying number of events, which may be too few or too many, depending on the scene.Criteria more adapted to the scene dynamics (in time and space) are often preferred but nontrivial to design.

Biologically Inspired Visual Processing
Biological principles and computational primitives drive the design of event camera pixels and some of the eventprocessing algorithms (and hardware), such as Spiking Neural Networks (SNNs).
Visual Pathways.The DVS [2] was inspired by the function of biological visual pathways, which have "transient" pathways dedicated to processing dynamic visual information in the so-called "where" pathway.Animals ranging from insects to humans all have these transient pathways.
In humans, the transient pathway occupies about 30 percent of the visual system.It starts with transient ganglion cells, which are mostly found in retina outside the fovea.It continues with magno layers of the thalamus and particular sublayers of area V1.It then continues to area MT and MST, which are part of the dorsal pathway where many motion selective cells are found [45].The DVS corresponds to the part of the transient pathway(s) up to retinal ganglion cells.Similarly, the grayscale (EM) events of the ATIS correspond to the "sustained" or "what" pathway through the parvo layers of the brain [36], [43].
Event Processing by SNNs.Artificial neurons, such as Leaky-Integrate and Fire or Adaptive Exponential, are computational primitives inspired in neurons found in the mammalian's visual cortex.They are the basic building blocks of artificial SNNs.A neuron receives input spikes ("events") from a small region of the visual space (a receptive field), which modify its internal state (membrane potential) and produce an output spike (action potential) when the state surpasses a threshold.Neurons are connected in a hierarchical way, forming an SNN.Spikes may be produced by pixels of the event camera or by neurons of the SNN.Information travels along the hierarchy, from the event camera pixels to the first layers of the SNN and then through to higher (deeper) layers.Most first layer receptive fields are based on Difference of Gaussians (selective to center-surround contrast), Gabor filters (selective to oriented edges), and their combinations.The receptive fields become increasingly more complex as information travels deeper into the network.In ANNs, the computation performed by inner layers is approximated as a convolution.One common approach in artificial SNNs is to assume that a neuron will not generate any output spikes if it has not received any input spikes from the preceding SNN layer.This assumption allows computation to be skipped for such neurons.The result of this visual processing is almost simultaneous with the stimulus presentation [140], which is very different from traditional CNNs, where convolution is computed simultaneously at all locations at fixed time intervals.
Tasks.Bio-inspired models have been adopted for several low-level visual tasks.For example, event-based optical flow can be estimated by using spatio-temporally oriented filters [92], [118], [141] that mimic the working principle of receptive fields in the primary visual cortex [142], [143].The same type of oriented filters have been used to implement a spike-based model of selective attention [144] based on the biological proposal from [145].Bio-inspired models from binocular vision, such as recurrent lateral connectivity and excitatory-inhibitory neural connections [146], have been used to solve the event-based stereo correspondence problem [41], [147], [148], [149], [150] or to control binocular vergence on humanoid robots [151].The visual cortex has also inspired the hierarchical feature extraction model proposed in [152], which has been implemented in SNNs and used for object recognition.The performance of such networks improves the better they extract information from the precise timing of the spikes [153].Early networks were hand-crafted (e.g., Gabor filters) [53], but recent efforts let the network build receptive fields through brain-inspired learning, such as Spike-Timing Dependent Plasticity (STDP), yielding better recognition rates [120].This research is complemented by approaches where more computationally inspired types of supervised learning, such as back-propagation, are used in deep networks to efficiently implement spiking deep convolutional networks [127], [154], [155], [156], [157].The advantages of the above methods over their traditional vision counterparts are lower latency and higher efficiency.

ALGORITHMS / APPLICATIONS
In this section, we review several works on event-based vision, presented according to the task addressed.We start with low-level vision on the image plane, such as feature detection, tracking, and optical flow estimation.Then, we discuss tasks that pertain to the 3D structure of the scene, such as depth estimation, visual odometry (VO) and historically related subjects, e.g., intensity image reconstruction.Finally, we consider motion segmentation, recognition and coupling perception with control.

Feature Detection and Tracking
Feature detection and tracking on the image plane are fundamental building blocks of many vision tasks such as visual odometry, object segmentation and scene understanding.Event cameras make it possible to track asynchronously, adapted to the dynamics of the scene and with low latency, high dynamic range and low power (Section 2.2).Thus, they allow to track in the "blind" time between the frames of a standard camera.To do so, the methods developed need to deal with the unique space-time and photometric characteristics of the visual signal: events report only brightness changes, asynchronously (Section 2.3).
Challenges.Since events represent brightness changes, which depend on motion direction, one of the main challenges of feature detection and tracking with event cameras is overcoming the variation of scene appearance caused by such motion dependency (Fig. 5).Tracking requires the establishment of correspondences between events (or features built from the events) at different times (i.e., data association), which is difficult due to the varying appearance.The second main challenge consists of dealing with sensor noise and possible event clutter caused by the camera motion.
Literature Review.Early event-based feature methods were very simple and focused on demonstrating the lowlatency and low-processing requirements of event-driven vision systems.Hence they assumed a stationary camera scenario and tracked moving objects as clustered blob-like sources of events [6], [12], [14], [106], [158], circles [159] or lines [54].Only pixels that generated events needed to be processed.Simple Gaussian correlation filters sufficed to detect blobs of events, which could be modeled by Gaussian Mixtures [160].For tracking, each incoming event was associated to the nearest existing blob/feature and used to asynchronously update its parameters (location, size, etc.).Circles [159] and lines [54] were treated as blobs in the Hough transform space.These methods were used in traffic monitoring and surveillance [14], [106], [160], high-speed robotic tracking [6], [12] and particle tracking in fluids [158] or microrobotics [159].However, they worked only for a limited class of object shapes.
Tracking of more complex, high-contrast user-defined shapes has been demonstrated using event-by-event adaptations of the Iterative Closest Point (ICP) algorithm [107], gradient descent [108], Mean-shift and Monte-Carlo methods [161], or particle filtering [162].The iterative methods in [107], [108] used a nearest-neighbor strategy to associate incoming events to the target shape and update its transformation parameters, showing very high-speed tracking (200kHz equivalent frame rate).Other works [161] handled geometric transformations of the target shape (aka "kernel") by matching events against a pool of rotated and scaled versions of it.The predefined kernels tracked the object without overlapping themselves due to a built-in repulsion mechanism.Complex objects, such as faces or human bodies, have been tracked with part-based shape models [163], where objects are represented as a set of basic elements linked by springs [164].The part trackers simply follow incoming blobs of events generated by ellipse-like shapes, and the elastic energy of this virtual mechanical system provides a quality criterion for tracking.In most tracking methods events are treated as individual points (without polarity) and update the system's state asynchronously, with minimal latency.The performance of the methods strongly depends on the tuning of several model parameters, which is done experimentally according to the object to track [161], [163].
The previous methods require a priori knowledge or user input to determine the objects to track.This restriction is valid for scenarios like tracking cars on a highway or balls approaching a goal, where knowing the objects greatly simplifies the computations.But when the space of objects becomes larger, methods to determine more realistic features become necessary.The features proposed in [109], [114] consist of local edge patterns that are represented as point sets.Incoming events are registered to them by means of some form of ICP.Other methods [27], [113] proposed to re-utilize well-known feature detectors [165] and trackers [166] on patches of motion-compensated event images (Section 3.1), providing good results.All these methods allowed to track features for cameras moving in natural scenes, hence enabling ego-motion estimation in realistic scenarios [110], [113], [115].Features built from motion-compensated events (in image form [113] or point-set form [114]) provide a useful representation of edge patterns.However, they depend on motion direction, and, therefore, trackers suffer from drift as event appearance changes over time [64].To track with no drift, motion-invariant features are needed.
Combining Events and Frames.Data association (Fig. 5) simplifies if the absolute intensity of the pattern to be tracked (Fig. 5c, i.e., a motion-invariant representation or "map" of the feature) is available.This is the approach followed by works that leverage the strengths of a combined frame-and event-based sensor ( a la DAVIS [4]).The algorithms in [64], [109], [110] automatically detect arbitrary edge patterns (features) on the frames and track them asynchronously with events.The feature location is given by the Harris corner detector [165] and the feature descriptor is given by the edge pattern around the corner: [109], [110] convert Canny edges to point sets used as templates for ICP tracking, thus they assume events are mostly triggered at strong edges.In contrast, the edge pattern in [64] is given by the frame intensities, and tracking consists of finding the motion parameters that minimize the photometric error between the events and their frame prediction using a generative model (4).A comparison of five feature trackers is provided in [64], showing that the generative model is most accurate, with sub-pixel performance, albeit it is computationally expensive.Finally, [64] also shows the interesting fact that an event-based sensor suffices: frames can be replaced by images reconstructed from events (Section 4.5) and still achieve similar detection and tracking results.
Corner Detection and Tracking.Since event cameras naturally respond to edges in the scene, they shorten the detection of lower-level primitives such as keypoints or "corners".Such primitives identify pixels of interest around which local features can be extracted without suffering from the aperture problem, and therefore provide reliable tracking information.The method in [167] computes corners as the intersection of two moving edges, which are obtained by fitting planes in the space-time stream of events.To deal with event Clearly, it is not easy to establish event correspondences between (a) and (b) due to the changing appearance of the edge patterns in (c) with respect to the motion.Image adapted from [64].
noise, least-squares is supplemented by a sampling technique similar to RANSAC.This method of fitting planes locally to time surfaces has also been profitable to estimate optical flow [21] and "event lifetime" [132], which are obtained from the coefficients of the planes.Recently, extensions of popular frame-based keypoint detectors, such as Harris [165] and FAST [168], have been developed for event cameras [95], [98], [99], by operating on time surfaces (TSs) as if they were natural intensity images.In [98] the TS is binarized before applying the derivative filters of Harris' detector.To speed up detection, [99] replaces the derivative filters with pixelwise comparisons on two concentric circles of the TS around the current event.Moving corners produce local TSs with two clearly separated regions: recent versus old events.Hence, corners are obtained by searching for arcs of contiguous pixels with higher TS values than the rest.The method in [95] improves the detector in [99] and proposes a strategy to track the corners.Assuming corners follow continuous trajectories on the image plane and the detected event corners are accurate, these are threaded by proximity along trajectories, following a tree-based hypothesis graph.The above TS-based hand-crafted corner detectors suffer from variations of the TS due to changes in motion direction.To overcome them, [96] proposes a data-driven method to learn the TS appearance of intensity-image corners.To this end, a grayscale input (from DAVIS or ATIS camera) provides the supervisory signal to label the corners.As a tradeoff between accuracy and speed, a random forest classifier is used.Event corners find multiple applications, such as visual odometry or ego-motion segmentation [169]; yet there are only a few demonstrations.
Opportunities.In spite of the abundance of detection and tracking methods, they are rarely evaluated on common datasets for performance comparison.Establishing benchmark datasets [170] and evaluation procedures will foster progress in this and other topics.Also, in most algorithms, parameters are defined experimentally according to the tracking target.It would be desirable to have adaptive parameter tuning to increase the range of operation of the trackers.Learning-based feature detection and tracking methods also offer considerable room for research.

Optical Flow Estimation
Optical flow estimation is the problem of computing the velocity of objects on the image plane without knowledge about the scene geometry or motion.The problem is illposed and thus requires regularization to become tractable.
Event-based optical flow estimation is challenging because of the unfamiliar way in which events encode visual information (Section 2).In conventional cameras optical flow is obtained by analyzing two consecutive images.These provide spatial and temporal derivatives that are substituted in the brightness constancy assumption (p.12), which together with smoothness assumptions provide enough equations to solve for the flow at each image pixel.In contrast, events provide neither absolute brightness nor spatially continuous data.Each event does not carry enough information to determine flow, and so events need to be aggregated to produce an estimate, which leads to the unusual question of where in the x-y-t-space of the image plane spanned by the events is flow computed.Ideally one would like to know the flow field over the whole space, which deems computationally expensive.In practice, optical flow is computed only at specific points: at the event locations, or at images with artificially-chosen times.Nevertheless, computing flow from events is attractive because they represent edges, which are the parts of the scene where flow estimation is less ambiguous, and because their fine timing information allows measuring high speed flow [11].Finally, another challenge is to design a flow estimation algorithm that is biologically plausible, i.e., compatible with what is known from neuroscience about early processing in the primate visual cortex, and that can be implemented efficiently in neuromorphic processors.
Literature Review.Table 2 lists some event-based optical flow methods, categorized according to different criteria.Early works [172] tried to adapt classical approaches in computer vision to event-based data (Fig. 6b).These are based on the brightness constancy assumption [166], and discussion focused on whether events carried enough information to estimate flow with such approaches [118].Events allow to estimate the temporal derivative of brightness (3), and so additional assumptions were needed to approximate the spatial derivative rL in order to apply such classical methods [166].However, due to the potentially very small number of events generated at each pixel as an edge crosses over it, it is difficult to estimate derivatives (rL; @L=@t) reliably [118], which leads gradient-based methods like [172] to inconclusive flow estimates.Approaches that consider the local distribution of events in the x-y-t-space, as in [21], are more robust and therefore preferred.
The method in [21] reasons about the local distribution of events geometrically, in terms of time surfaces and planar approximations.As an edge moves it produces events that resemble points on a surface in space-time (the time surface, Section 3).The surface slopes in the x-t and y-t cross sections encode the edge motion, thus optical flow is estimated by fitting planes to the surface and reading the slopes from the plane coefficients.In spite of providing only normal flow Delbruck [92], [171] Normal Sparse Model Yes Benosman et al. [171], [172] Full Sparse Model No Orchard et al. [141] Full Sparse ANN Yes Benosman et al. [21], [171] Normal Sparse Model No Barranco et al. [173] Normal Sparse Model No Brosch et al. [118] Normal Sparse Model Yes Bardow et al. [101] Full Dense Model No Liu et al. [88] Full Sparse Model No Gallego [112], Stoffregen [138] Full Sparse Model No Haessig et al. [174] Normal Sparse ANN Yes Zhu et al. [22], [103] Full Dense ANN No Ye et al. [137] Full Dense ANN No Paredes-Vall es [85] Full Sparse ANN Yes Some methods provide full motion flow (F) whereas others only its component normal to the local brightness edge (N).The output may be a dense (D) flow field (i.e., optical flow for every pixel at some time) or sparse (S) (i.e., flow computed at selected pixels).According to their design, methods may be modelbased or model-free (Artificial Neural Network -ANN), and neuro-biologically inspired or not.
(i.e., the component of the optical flow perpendicular to the edge), the method works even in the case of only a few generated events.Of course, the goodness of fit depends on the size of the spatio-temporal neighborhood (this remark generalizes to other methods).If the neighborhood is too small then the plane fit may become arbitrary.If the neighborhood is too large then the event stream may not be well approximated by a local plane.
A hierarchical architecture for optical flow estimation building on experimental findings of the primate visual system is proposed in [118].It applies a set of spatio-temporal filters on the event stream to yield selectivity to different motion speeds and directions ( a la Gabor filters) while maintaining the sparse representation of events.Such filters are formally equivalent to spatio-temporal correlation detectors.Other biologically-inspired methods [85], [141] can also be interpreted as filter banks sampling the event stream along different spatio-temporal orientations; [141] and [118] define hand-crafted filters, whereas [85] learns them from event data using a novel STDP rule.The SNN in [141] detects motion patterns by delaying events through synaptic connections and employing neurons as coincidence detectors.Its neurons are sensitive to 8 speeds and 8 directions (i.e., 64 velocities) over receptive fields of 5 Â 5 pixels.These methods are implementable in neuromorphic hardware, offering low-power, efficient computations.
Methods like [23], [101] estimate optical flow jointly with other quantities, notably image intensity, so that the quantities involved bring in well-known equations and boost each other towards convergence.Knowing image intensity, or equivalently (rL; @L=@t), is desirable since it can be used on the brightness constancy law to provide constraints on the optical flow.In this respect, [101] combines multiple equations ((2), brightness constancy, smoothness priors, etc.) as penalty terms into an objective function that is optimized via calculus of variations.The method finds the optical flow and image intensity on the image plane that minimizes the objective function, i.e., that best explains the distribution of events in the x-y-t-space (using a voxel grid).Thus, it outputs a dense flow (i.e., flow at every pixel).Flow vectors at pixels where no events were produced (i.e., regions of homogeneous brightness) are due to the smoothness priors, thus they are less reliable than those computed at pixels where events were triggered (i.e., at edges).
The method in [88] estimates optical flow by computing event frames (Section 3) at an adaptive rate and applying video coding techniques (block matching).It can be interpreted as finding the optical flow vector that best matches the distributions of events within two cuboids (collapsed into event frames).Thus, the optical flow problem is posed as that of finding event correspondences, i.e., events triggered by the same scene point (at different times).The method defines two sets of events ("blocks") and a similarity metric to compare them.It is assumed that the appearance of event frames do not change significantly for short times and hence simple metrics, such as sum of absolute distances, suffice to compare them.The method can be implemented in FPGA, trading off efficiency for accuracy.
The framework in [82], [112], [138] computes optical flow by maximizing the sharpness of image patches obtained by warping cuboids of events, producing motion-compensated images (Section 3).It can be interpreted as applying an adaptive filter to the events, where the filter coefficients define the spatio-temporal direction that maximizes the filter's response.Motion compensation was also used to compute flow in [114], albeit using point sets.
Recently, deep learning methods have emerged [22], [103], [137].These are based on the availability of large amounts of event data paired with an ANN.In [22], an encoder-decoder CNN is trained using a self-supervised scheme to estimate dense optical flow.The loss function measures the error between DAVIS grayscale images aligned using the flow produced by the network.The trained network is able to accurately predict optical flow from events only, passed as time surfaces and event frames.The work [137] presents the first monocular ANN architecture to estimate dense optical flow, depth and ego-motion (i.e., learning structure from motion) from events only.The input to the ANN consists of events over multiple time slices, given as event frames and time surfaces with average timestamps.This reduces event noise and preserves the structure of the event stream better than [22].The network is trained unsupervised, measuring the photometric error between the events in neighboring time slices aligned using the estimated flow.Later, [22] was extended to unsupervised learning of flow and ego-motion in [103] using a motion-compensation loss function in terms of time surfaces.
Opportunities.Comprehensive datasets with accurate ground truth optical flow in multiple scenarios (varying texture, speed, parallax, occlusions, illumination, etc.) and a common evaluation methodology would be essential to assess progress and reproducibility in this paramount lowlevel vision task.Providing ground truth event-based optical flow in real scenes is challenging, especially for moving objects not conforming to the motion field induced by the camera's ego-motion.A thorough quantitative comparison  [175].In (a), events (polarity shown in red/blue) are overlaid on a grayscale frame from a DAVIS.(b) shows the sparse optical flow (colored according to magnitude and direction) computed using [166] on brightness increment images.(c) A different scene: dense optical flow of a fidget spinner spinning at 750 /s in a dark environment [103].Events enable the estimation of optical flow in challenging scenarios.
of existing event-based optical flow methods would help identify key ideas to develop improved methods.

3D Reconstruction Monocular and Stereo
Depth estimation with event cameras is a broad field.It can be divided according to the considered scenario and camera setup or motion, which determine the problem assumptions.
Instantaneous Stereo.Most works on depth estimation with event cameras target the problem of "instantaneous" stereo, i.e., 3D reconstruction using events on a very short time (ideally on a per-event basis) from two or more synchronized cameras that are rigidly attached.Being synchronized, the events from different image planes share a common clock.These works follow the classical two-step stereo solution: first solve the event correspondence problem across image planes (i.e., epipolar matching) and then triangulate the location of the 3D point [176].The main challenge is finding correspondences between events; it is the computationally intensive step.Events are matched (i) using traditional stereo metrics (e.g., normalized cross-correlation) on event frames [129], [177] or time surfaces [133] (Section 3), and/or (ii) by exploiting simultaneity and temporal correlations of the events across sensors [133], [178], [179].These approaches are local, matching events by comparing their neighborhoods since events cannot be matched based on individual timestamps [154], [180].Additional constraints, such as the epipolar constraint [181], ordering, uniqueness, edge orientation and polarity may be used to reduce matching ambiguities and false correspondences, thus improving depth estimation [18], [154], [182].Event matching can also be done by comparing local context descriptors [183], [184] of the spatial distribution of events on both stereo image planes.
Global approaches produce better depth estimates (i.e., less sensitive to ambiguities) than local approaches by considering additional regularity constraints.In this category, we find extensions of Marr and Poggio's cooperative stereo algorithm [146] for the case of event cameras [41], [148], [149], [150], [185].These approaches consist of a network of disparity sensitive neurons that receive events from both cameras and perform various operations (amplification, inhibition) that implement matching constraints (uniqueness, continuity) to extract disparities.They use not only the temporal similarity to match events but also their spatiotemporal neighborhoods, with iterative nonlinear operations that result in an overall globally-optimal solution.A discussion of cooperative stereo is provided in [43].Also in this category are [186], [187], [188], which use Belief Propagation on a Markov Random Field or semiglobal matching [189] to improve stereo matching.These methods are primarily based on optimization, trying to define a wellbehaved energy function whose minimizer is the correct correspondence map.The energy function incorporates regularity constraints, which enforce coupling of correspondences at neighboring points and therefore make the solution map less sensitive to ambiguities than local methods, at the expense of computational effort.A table comparing different stereo methods is provided in [190]; however, it should be interpreted with caution since the methods were not benchmarked on the same dataset.
Recently, brute-force space-sweeping using dedicated hardware (a GPU) has been proposed [191].The method is based on ideas similar to [19], [112]: the correct depth manifests as "in focus" voxels of displaced events in the Disparity Space Image [19], [192].In contrast, other approaches pair event cameras with neuromorphic processors (Section 5.1) to produce fully event-based low-power (100 mW), high-speed stereo systems [149], [190].There is an efficiency versus accuracy tradeoff that has not been quantified yet.
Most of the methods above are demonstrated in scenes with static cameras and few moving objects, so that correspondences are easy to find due to uncluttered event data.Event matching happens with low latency, at high rate ($1kHz) and consuming little power, which shows that event cameras are promising for high-speed 3D reconstructions of moving objects or in uncluttered scenes.
Monocular Depth Estimation.Depth estimation with a single event camera has been shown in [19], [25], [112].It is a significantly different problem from previous ones because temporal correlation between events across multiple image planes cannot be exploited.These methods recover a semidense 3D reconstruction of the scene (i.e., 3D edge map) by integrating information from the events of a moving camera over time, and therefore require knowledge of camera motion.Hence they do not pursue instantaneous depth estimation, but rather depth estimation for SLAM [193].
The method in [25] is part of a pipeline that uses three filters operating in parallel to jointly estimate the motion of the event camera, a 3D map of the scene, and the intensity image.Their depth estimation approach requires using an additional quantity-the intensity image-to solve for data association.In contrast, [19] (Fig. 7) proposes a space-sweep method that leverages the sparsity of the event stream to perform 3D reconstruction without having to establish event matches or recover the intensity images.It back-projects events into space, creating a ray density volume [194], and then finds scene structure as local maxima of ray density.It is computationally efficient and used for VO in [26].
Opportunities.Although there are many methods for event-based depth estimation, it is difficult to compare their performance since they are not evaluated on the same dataset.In this sense, it would be desirable to (i) provide a comprehensive dataset and testbed for event-based depth evaluation and (ii) benchmark many existing methods on the dataset, to be able to compare their performance.

Pose Estimation and SLAM
Addressing the Simultaneous Localization and Mapping (SLAM) problem with event cameras has been difficult because most methods and concepts developed for conventional cameras (feature detection, matching, iterative image alignment, etc.) are not applicable or were not available; events are fundamentally different from images.The challenge is therefore to design new SLAM techniques that are able to unlock the camera's advantages (Sections 2.3 and 2.2), showing their usefulness to tackle difficult scenarios for current frame-based cameras.Historically, the design goal of such techniques has focused on preserving the low-latency nature of the data, i.e., being able to produce a state estimate for every incoming event (Section 3).However, each event does not contain enough information to estimate the state from scratch (e.g., the six degrees of freedom (DOF) pose of a calibrated camera), and so the goal becomes that each event be able to asynchronously update the state of the system.Probabilistic (Bayesian) filters [195] are popular in eventbased SLAM [7], [24], [75], [196] because they naturally fit with this description.Their main adaptation for event cameras consists of designing sensible likelihood functions based on the event generation process (Section 2.4).
Since events are caused by the apparent motion of intensity edges, the majority of maps emerging from SLAM systems naturally consist only of scene edges, i.e., semi-dense maps (Fig. 8 and [19]).However, note that an event camera does not directly measure intensity gradients but only temporal changes (2), and so the presence, orientation and strength of edges (on the image plane and in 3D) must be estimated together with the camera's motion.The strength of the intensity gradient at a scene point is correlated with the firing rate of events corresponding to that point, and it enables reliable tracking [86].Edge information for tracking may also be obtained from gradients of brightness maps [7], [24], [25] used in generative models (Section 2.4).
The event-based SLAM problem in its most general setting (6-DOF motion and natural 3D scenes) is a challenging problem that has been addressed step-by-step in scenarios with increasing complexity.Three complexity axes can be identified: dimensionality of the problem, type of motion and type of scene.The literature is dominated by methods that address the localization subproblem first (i.e., motion estimation) because it has fewer degrees of freedom to estimate.Regarding the type of motion, solutions for constrained motions, such as rotational or planar (both being 3-DOF), have been investigated before addressing the most complex case of a freely moving camera (6-DOF) .Solutions for artificial scenes in terms of photometry (high contrast) and/or structure (line-based or 2D maps) have been proposed before focusing on the most difficult case: natural scenes (3D and with arbitrary photometric variations).Some proposed solutions require additional sensing (e.g., RGB-D) to reduce the complexity of the problem.This, however, introduces some of the bottlenecks present in framebased systems (e.g., latency and motion blur).Table 3 classifies the related work using these complexity axes.
Tracking and Mapping.Let us focus on methods that address the tracking-and-mapping problem.Cook et al. [23] proposed a generic message-passing algorithm within an interacting network to jointly estimate ego-motion, image intensity and optical flow from events.However, the system was restricted to rotational motion.Joint estimation is appealing because it allows to employ as many equations as possible relating the variables (e.g., (4) and rotational prior) in the hope of finding a better solution to the problem.
An event-based 2D SLAM system was presented in [196] by extension of [84], and thus it was restricted to planar motion and high-contrast scenes.The method used a particle filter for tracking, with the event likelihood function inversely related to the the reprojection error of the event with respect to the map.The map of scene edges was concurrently built; it consisted of an occupancy map [195], with each pixel representing the probability that the pixel triggered events.The method was extended to 3D in [197], but it relied on an external RGB-D sensor attached to the event camera for depth estimation.The depth sensor introduced bottlenecks, which deprived the system of the low latency and high-speed advantages of event cameras.
tracking was performed by minimization of a photometric error at the event locations given a probabilistic edge map.
The map was simultaneously built, and each map point represented the probability of events being generated at that location [196].Hence it was a panoramic occupancy map measuring the strength of the scene edges.
Recently, solutions to the full problem of event-based 3D SLAM for 6-DOF motions and natural scenes, not relying on additional sensing, have been proposed [25], [26] (Table 3).The approach in [25] extends [24] and consists of three interleaved probabilistic filters to perform pose tracking as well as depth and intensity estimation.However, it suffers from limited robustness (especially during initialization) due to the assumption of uncorrelated depth, intensity gradient, and camera motion.Furthermore, it is computationally intensive, requiring a GPU for real-time operation.In contrast, the semi-dense approach in [26] shows that intensity reconstruction is not needed for depth estimation or pose tracking.The approach has a geometric foundation: it performs space sweeping for 3D reconstruction [19] and edgemap alignment (non-linear optimization with few events per frame) for pose tracking.The resulting SLAM system runs in real-time on a CPU.
Trading off latency for efficiency, probabilistic filters [24], [25], [196] can operate on small groups of events.Other approaches are natively designed for groups, based for example on non-linear optimization [26], [111], [112], and run in real time on the CPU.Processing multiple events simultaneously is also beneficial to reduce noise.
Opportunities.The above-mentioned SLAM methods lack loop-closure capabilities to reduce drift.Currently, the scales of the scenes on which event-based SLAM has been demonstrated are considerably smaller than those of framebased SLAM.However, trying to match both scales may not be a sensible goal since event cameras may not be used to tackle the same problems as standard cameras; both sensors are complementary, as argued in [7], [27], [64], [75].Stereo event-based SLAM is another unexplored topic, as well as designing more accurate, efficient and robust methods than the existing monocular ones.Robustness of SLAM systems can be improved by sensor fusion with IMUs [27], [193].

Image Reconstruction
Events represent brightness changes, and so, in ideal conditions (noise-free scenario, perfect sensor response, etc.) integration of the events yields "absolute" brightness.This is intuitive, since events are just a non-redundant (i.e., "compressed") per-pixel way of encoding the visual content in the scene.Event integration or, more generically, image reconstruction (Fig. 9) can be interpreted as "decompressing" the visual data encoded in the event stream.Due to the very high temporal resolution of the events, brightness images can be reconstructed at very high frame rate (e.g., 2 kHz to 5 kHz [8], [199]) or even continuously in time [62].
As the literature reveals, the insight about image reconstruction from events is that it requires regularization.Event cameras have independent pixels that report brightness changes, and, consequently, per-pixel integration of such changes during a time interval only produces brightness increment images.To recover the absolute brightness at the end of the interval, an offset image (i.e., the brightness image at the start of the interval) would need to be added to the increment [81], [200].Surprisingly, some works have used spatial and/or temporal smoothing [62], [119], [199], [201] to reconstruct brightness starting from a zero initial condition, i.e., without knowledge of the offset image.Other forms of regularization, using learned features from natural scenes [8], [102], [104], [199] are also effective.
Literature Review.Image reconstruction from events was first established in [23] under rotational camera motion and static scene assumptions.These assumptions together with the brightness constancy (4) were used in a message-passing algorithm between pixels in a network of visual maps to jointly estimate several quantities, such as scene brightness.Also under the above motion and scene assumptions, [24] showed how to reconstruct high-resolution panoramas from the events, and they popularized the idea of eventbased HDR image reconstruction.Each pixel of the panoramic image used a Kalman filter to estimate the brightness gradient (based on (4)), which was then integrated using Poisson reconstruction to yield absolute brightness.The method in [203] exploited the constrained motion of a platform rotating around a single axis to reconstruct images that were then used for stereo depth estimation.
Motion restrictions were then replaced by regularizing assumptions to enable image reconstruction for generic motions and scenes [101].In this work, image brightness and optical flow were simultaneously estimated using a variational framework that contained several penalty terms (on data fitting (1) and smoothness of the solution) to best explain a space-time volume of events discretized as a voxel grid.This method was the first to show reconstructed video from events in dynamic scenes.Later [119], [199], [201] showed that image reconstruction was possible even without having to estimate motion.This was done using a variational image denoising approach based on time surfaces [119], [201] or using sparse signal processing with a patch-based learned dictionary that mapped events to image gradients, which were then Poisson-integrated [199].Concurrently, the VO methods in [25], [26] extended the image reconstruction technique in [24] to 6-DOF camera motions by using the computed scene depth and poses: [25] used a robust variational regularizer to reduce noise and improve contrast of the reconstructed image, whereas [26] showed image reconstruction as an ancillary result, since it was not needed to achieve VO.Recently, [62] proposed a temporal smoothing filter for image reconstruction and for continuously fusing events and frames.The filter acted independently on every pixel, thus showing that no spatial regularization on the Fig. 9. Image reconstruction.In the scenario of a car driving out of a tunnel the frames from a consumer camera (Huawei P20 Pro) (Left) suffer from under-or over-exposure, while events capture a broader dynamic range of the scene, which is recovered by image reconstruction methods (Middle).Events also enable the reconstruction of high-speed scenes, such as a exploding mug (Right).Images courtesy of [8], [202].
image plane was needed to recover brightness, although it naturally reduced noise and artefacts at the expense of sacrificing some real detail.More recently, [8], [104] has presented a deep learning approach that achieves considerable gains over previous methods and mitigates visual artefacts.Reflecting back on earlier works, the motion restrictions or hand-crafted regularizers that enabled image reconstruction have been replaced by perceptual, datadriven priors from natural scenes that consequently produced more natural-looking images.Note that image reconstruction methods used in VO or SLAM [23], [24], [25] assume static scenes, whereas methods with weak or no motion assumptions [8], [62], [101], [104], [119], [199], [201] are naturally used to reconstruct videos of arbitrary (e.g., dynamic) scenes.
Besides image reconstruction from events, another category of methods tackles the problem of fusing events and frames (e.g., from the DAVIS [4]), thus enhancing the brightness information from the frames with high temporal resolution and HDR properties of events [28], [62], [200].These methods also do not rely on motion knowledge and are ultimately based on (2).The method in [200] performs direct event integration between frames, pixel-wise.However, the fused brightness becomes quickly corrupted by event noise (due to non-ideal effects, sensitivity mismatch, missing events, etc.), and so fusion is reset with every incoming frame.To mitigate noise, events and frames are fused in [62] using a per-pixel, temporal complementary filter that is high-pass in the events and low-pass in the frames.It is an efficient solution that takes into account the complementary sensing modality of events and frames: frames carry slowvarying brightness information (i.e., low temporal frequency), whereas events carry "change" information (i.e., high frequency).The fusion method in [28] exploits the high temporal resolution of the events to additionally remove motion blur from the frames, producing high frame-rate, sharp video from a single blurry frame and events.It is based on a double integral model (one integral to recover brightness and another one to remove blur) within an optimization framework.A limitation of the above methods is that they still suffer from artefacts due to event noise.These might be mitigated if combined with learning-based approaches [8].
Applications.Image reconstruction implies that, in principle, it is possible to convert the events into brightness images and then apply mature computer vision algorithms [8], [104], [204].This can have a high impact on both, event-and frame-based communities.The resulting images capture high-speed motions and HDR scenes, which may be beneficial in some applications, but it comes at the expense of computational cost, latency and power consumption.
Despite image reconstruction having been useful to support tasks such as recognition [199], SLAM [25] or optical flow estimation [101], there are also works in the literature, such as [97], [103], [112], [137], showing that it is not needed to fulfill such tasks.One of the most valuable aspects of image reconstruction is that it provides scene representations (e.g., appearance maps [7], [24]) that are more invariant to motion than events and also facilitate establishing event correspondences, which is one of the biggest challenges of some event data processing tasks, such as feature tracking [64].

Motion Segmentation
Segmentation of moving objects viewed by a stationary event camera is simple because events are solely imputable to the motion of the objects (assuming constant illumination) [106], [108], [161].The challenges arise in the scenario of a moving camera because events are triggered everywhere on the image plane [13], [116], [139] (Fig. 10), produced by moving objects and the static scene (whose apparent motion is induced by the camera's ego-motion) and the goal is to infer this causal classification for each event.However, each event carries very little information, and therefore it is challenging to perform the mentioned per-event classification.
Overcoming these challenges has been done by tackling segmentation scenarios of increasing complexity, obtained by reducing the amount of additional information given to solve the problem.Such additional information adopts the form of known object shape or known motion, i.e., the algorithm knows "what object to look for" or "what type of motion it expects" and objects are segmented by detecting (in-)consistency with respect to the expectation.The less additional information is provided, the more unsupervised the problem becomes (e.g., clustering).In such a case, segmentation is enabled by the key insight that moving objects produce distinctive traces of events on the image plane and it is possible to infer the trajectories of the objects that generate those traces, yielding the segmented objects [139].Like clustering, this is a joint optimization problem in the motion parameters of the objects (i.e., the "clusters") and the eventobject associations (i.e., the segmentation).
Literature Review.Considering known object shape, [13] presents a method to detect and track a circle in the presence of event clutter caused by the moving camera.It is based on the Hough transform using optical flow information extracted from temporal windows of events.The method was extended in [162] using a particle filter to improve tracking robustness: the duration of the observation window was dynamically selected to accommodate for sudden motion changes due to accelerations of the object.More generic object shapes were detected and tracked by [169] using event corners (Section 4.1) as geometric primitives.In this method, additional knowledge of the robot joints controlling the camera motion was required.
Segmentation has been addressed by [116], [138], [139] under mild assumptions leveraging the idea of motion-compensated event images [111] (Section 3).Essentially this technique associates events that produce sharp edges when warped according to a motion hypothesis.The simplest Fig. 10.The iCub humanoid robot from IIT has two event cameras in the eyes.Here, it segments and tracks a ball under event clutter produced by the motion of the head.Right: space-time visualization of the events on the image frame, colored according to polarity (positive in green, negative in red).Image courtesy of [162].
hypothesis is a linear motion model (i.e., constant optical flow), yet it is sufficiently expressive: for short times, scenes may be described as collections of objects producing events that fit different linear motion models.Such a scene description is what the cited segmentation algorithms seek for.Specifically, the method in [138] first fits a linear motioncompensation model to the dominant events, then removes these and fits another linear model to the remaining events, greedily.Thus, it clusters events according to optical flow, yielding motion-compensated images with sharp object contours.Similarly, [116] detects moving objects in clutter by fitting a motion-compensation model to the dominant events (i.e., the background) and detecting inconsistencies with respect to it (i.e., the objects).They test the method in challenging scenarios inaccessible to standard cameras (HDR, high-speed) and release a dataset.The work in [139] proposes an iterative clustering algorithm that jointly estimates the event-object associations (i.e., segmentation) and the motion parameters of the objects (i.e., clusters) that produce sharpest motion-compensated event images.It allows for general parametric motion models [112] to describe each object and produces better results than greedy methods [116], [138].In [117] a learning-based approach for segmentation using motion-compensation is proposed: ANNs are used to estimate depth, ego-motion, segmentation masks of independently moving objects and object 3D velocities.An eventbased dataset is provided for supervised learning, which includes accurate pixel-wise motion masks of 3D-scanned objects that are reliable even in poor lighting conditions and during fast motion.
Segmentation is a paramount topic in frame-based vision, yet it is rather unexplored in event-based vision.As more complex scenes are addressed and more advanced eventbased vision techniques are developed, more works targeting this challenging problem are expected to appear.

Recognition
Algorithms.Recognition algorithms for event cameras have grown in complexity, from template matching of simple shapes to classifying arbitrary edge patterns using either traditional machine learning on hand-crafted features or modern deep learning methods.This evolution aims at endowing recognition systems with more expressibility (i.e., approximation capacity) and robustness to data distortions.
Early research with event-based sensors began with tracking a moving object using a static sensor.An eventdriven update of the position of a model of the object shape was used to detect and track objects with a known simple shape, such as a blob [6], circle [53], [205] or line [54].Simple shapes can also be detected by matching against a predefined template, which removes the need to describe the geometry of the object.This template matching approach was implemented using convolutions in early hardware [53].
For more complex objects, templates can be used to match low level features instead of the entire object, after which a classifier can be used to make a decision based on the distribution of features observed [93].Nearest Neighbor classifiers are typically used, with distances calculated in feature space.Accuracy can be improved by increasing feature invariance, which can be achieved using a hierarchical model where feature complexity increases in each layer.With a good choice of features, only the final classifier needs to be retrained when switching tasks.This leads to the problem of selecting which features to use.Hand-crafted orientation features were used in early works, but far better results are obtained by learning the features from the data itself.In the simplest case, each template can be obtained from an individual sample, but such templates are sensitive to noise in the sample data [15].One may follow a generative approach, learning features that enable to accurately reconstruct the input, as was done in [122] with a Deep Belief Network (DBN).More recent work obtains features by unsupervised learning, clustering the event data and using the center of each cluster as a feature [93].During inference, each event is associated to its closest feature, and a classifier operates on the distributions of features observed.With the rise of deep learning in frame-based computer vision, many have sought to leverage deep learning tools for event-based recognition, using back-propagation to learn features.This approach has the advantage of not requiring a separate classifier at the output, but the disadvantage of requiring far more labeled data for training.Image recognition with events also suffers from the practical problem of the availability of training data in the event domain [206].In [207] the authors use wormhole learning, a semi-supervised approach in which, starting from a detector in the RGB domain, one is able to train a detector in the event domain; moreover, in a second phase the teacher becomes the student, and some of the illumination invariance of the event sensor is transferred to the RGB-only detector.
Most learning-based approaches convert events/spikes into (dense) tensors, a convenient representation for imagebased hierarchical models, e.g., ANNs (Fig. 11).There are different ways the value of each tensor element can be computed (Section 3.1).Simple methods use the time surfaces, or event histogram frames.A more robust method uses time surfaces with exponential decay [93] or with average timestamps [97].Image reconstruction methods (Section 4.5) may also be used.Some recognition approaches rely on converting spikes to frames during inference [134], [199], while others convert the trained ANN to an SNN which can operate directly on the event data [121].Similar ideas can be applied for tasks other than recognition [22], [91].As neuromorphic hardware advances (Section 5.1), there is increasing interest in learning directly in SNNs [127] or even directly in the neuromorphic hardware itself [128].Tasks.Early tasks focused on detecting the presence of a simple shape (such as a circle) from a static sensor [6], [53], [205], but soon progressed to the classification of more complex shapes, such as card pips [121] (Fig. 11b), block letters [15] and faces [93], [199].A popular task throughout has been the classification of hand-written digits.Inspired by the role it has played in frame-based computer vision, a few event-based MNIST datasets have been generated from the original MNIST dataset [58], [208].These datasets remain a good test for algorithm development, with many algorithms now achieving over 98 percent accuracy on the task [97], [126], [127], [209], [210], [211], but few would propose digit recognition as a strength of event-based vision.More difficult tasks involve either more difficult objects, such as the Caltech-101 and Caltech-256 datasets (both of which are still considered easy by computer vision) or more difficult scenarios, such as recognition from on-board a moving vehicle [97].Very few works tackle these tasks so far, and those that do typically fall back on generating event frames and processing them using a traditional deep learning framework.
A key challenge for recognition is that event cameras respond to relative motion in the scene (Section 2.3), and thus require either the object or the camera to be moving.It is therefore unlikely that event cameras will be a strong choice for recognizing static or slow moving objects, although little has been done to combine the advantages of frame-and event-based cameras for these applications.The event-based appearance of an object is highly dependent on the abovementioned relative motion (Fig. 5), thus tight control of the camera motion could be used to aid recognition [208].
Since the camera responds to dynamic signals, obvious applications include recognizing objects by the way they move [212], or recognizing dynamic scenes such as gestures or actions [16], [17].These tasks are typically more challenging than static object recognition because they include a time dimension, but this is exactly where event cameras excel.
Opportunities.Event cameras exhibit many alluring properties, but event-based recognition has a long way to go if it is to compete with modern frame-based approaches.While it is important to compare event-and frame-based methods, one must remember that each sensor has its own strengths.The ideal acquisition scenario for a frame-based sensor consists of both the sensor and object being static, which is the worst possible scenario for event cameras.For event-based recognition to find widespread adoption, it will need to find applications which play to its strengths.Such applications are unlikely to be similar to well established computer vision recognition tasks which play to the frame-based sensor's strengths.Instead, such applications are likely to involve resource constrained recognition of dynamic sequences, or recognition from on-board a moving platform.Finding and demonstrating the use of event-based sensors in such applications remains an open challenge.
Although event-based datasets have improved in quality in recent years, there is still room for improvement.Data collection and annotation is a tiresome and thankless task, but developing an easy-to-use pipeline for collecting and annotating event-based data would be a significant contribution to the field, especially if the tools can mature to the stage where the task can be outsourced to laymen.

Neuromorphic Control
In living creatures, most information processing happens through spike-based representation: spikes encode the sensory data; spikes perform the computation; and spikes transmit actuator "commands".Therefore, biology shows that the event-based paradigm is, in principle, applicable not just to perception and inference, but also to control.
Neuromorphic-Vision-Driven Control Architecture.In this type of architecture (Fig. 12), there is a neuromorphic sensor, an event-based estimator, and a traditional controller.The estimator computes a state, and the controller computes the control based on the provided state.The controller is not aware of the asynchronicity of the architecture.
Neuromorphic-vision-driven control architectures have been demonstrated since the early days of neuromorphic cameras, and they have proved the two advantages of low latency and computational efficiency.The earliest demonstrators were the spike-based convolutional target tracking demo in the CAVIAR project [53] and the "robot goalie" described in [6], [12].Another early example was the pencilbalancing robot [54].In that demonstrator two DVS's observed a pencil as inverted pendulum placed on a small movable cart.The pencil's state in 3D was estimated in below 1ms latency.A simple hand tuned PID controller kept the pencil balanced upright.It was also demonstrated on an embedded system, thereby establishing the ability to run on severely constrained computing resources.
Event-Based Control Theory.Event-based techniques can be motivated from the perspective of control and decision theory.Using a biological metaphor, event-based control can be understood as a form of what economics calls rational inattention [213]: more information allows for better decisions, but if there are costs associated to obtaining or processing the information, it is rational to take decisions with only partial information available.
In event-based control, the control signal is changed asynchronously [214].There are several variations of the concept depending on how the "control events" are generated.One important distinction is between event-triggered control and self-triggered control [215].In event-based control the events are Fig.12.Control architectures based on neuromorphic events.In a neuromorphic-vision-driven control architecture (a), a neuromorphic sensor produces events, an event-based perception system produces state estimates, and a traditional controller is called asynchronously to compute the control signal.In a native neuromorphic-based architecture (b), the events generate directly changes in control.Finally, (c) shows an architecture in which the task informs the events that are generated.
generated "exogenously" based on certain condition; for example, a "recompute control" request might be triggered when the trajectory's tracking error exceeds a threshold.In self-triggered control, the controller decides by itself when is the next time it should be called based on the situation.For example, a controller might decide to "sleep" for longer if the state is near the target, or to recompute the control signal sooner if it is required.
The advantages of event-based control are usually justified considering a trade-off between computation / communication cost and control performance.The basic consideration is that, while the best control performance is obtained by recomputing the control infinitely often (for an infinite cost), there are strongly diminishing returns.A solid principle of control theory is that the control frequency depends on the time constant of the plant and the sensor: it does not make sense to change the control much quicker than the new incoming information or the speed of the actuators.This motivates choosing control frequencies that are comparable with the plant dynamics and adapt to the situation.For example, one can show that an event-triggered controller achieves the same performance with a fraction of the computation; or, conversely, a better performance with the same amount of computation.In some cases (scalar linear Gaussian) these tradeoffs can be obtained in closed form [216], [217].(Analogously, certain trade-offs can be obtained in closed form for perception [218].) Unfortunately, the large literature in event-based control is of restricted utility for the embodied neuromorphic setting.Beyond the superficial similarity of dealing with "events" the settings are quite different.For example, in network-based control, one deals with typically low-dimensional states and occasional events-the focus is on making the most of each single event.By contrast, for an autonomous vehicle equipped with event cameras, the problem is typically how to find useful signals in potentially millions of events per second.Particularizing the event-based control theory to the neuromorphic case is a relatively young avenue of research [219], [220], [221], [222].The challenges lie in handling the non-linearities typical of the vision modality, which prevents clean closed-form results.
Open Questions in Neuromorphic Control.Finally, we describe some of open problems in this topic.
Task-Driven Sensing.In animals, perception has value because it is followed by action, and the information collected is actionable information that helps with the task.A significant advance would be the ability for a controller to modulate the sensing process based on the task and the context.In current hardware there is limited software-modulated control for the sensing processing, though it is possible to modulate some of the hardware biases.Integration with region-of-interest mechanisms, heterogeneous camera bias settings, etc. would provide additional flexibility and more computationally efficient control.
Thinking Fast and Slow.Existing research has focused on obtaining low-latency control, but there has been little work on how to integrate this sensorimotor level into the rest of an agent's cognitive architecture.Using again a bio-inspired metaphor, and following Kahneman [223], the fast/instinctive/ "emotional" system must be integrated with the slower/ deliberative system.

Neuromorphic Computing
Neuromorphic engineering tries to capture some of the unparalleled computational power and efficiency of the brain by mimicking its structure and function.Typically this results in a massively parallel hardware accelerator for SNNs (Section 3.3), which is how we will define a neuromorphic processor.Since the neuron spikes within such a processor are inherently asynchronous, a neuromorphic processor is the best computational partner for an event camera.Neuromorphic processors act on the events injected by the event camera directly, without conversion, and offer better data-processing locality (spatially and temporally) than standard architectures such as CPUs, yielding low power and low latency computer vision systems.
Neuromorphic processors may be categorized by their neuron model implementations (Table 4), which are broadly divided between analog neurons (Neurogrid, BrainScaleS, ROLLS, DYNAP-se), digital neurons (True-North, Loihi, ODIN) and software neurons (SpiNNaker).Some architectures also support on-chip learning (Loihi, ODIN, DYNAP-le).When evaluating a neuromorphic processor for an event-based vision system, the following criteria should be considered in addition to the processor's functionality and performance: (i) the software development ecosystem: a minimal toolchain includes an API to compose and train a network, a compiler to prepare the network for the hardware, and a runtime library to deploy the network in hardware, (ii) event-based vision systems typically require that a processor be available as a standalone system suitable for mobile applications, and not just hosted in a remote server, (iii) the availability of neuromorphic processors.
Several developments are necessary to enable a more widespread use of these processors, such as: (i) developing a more user-friendly ecosystem (an easier way to program the desired method for deployment in hardware), (ii) enabling more processing capabilities of the hardware platform, (iii) increasing the availability of devices beyond early access programs targeted at selected partners.
The following processors (Table 4) have the most mature developer workflows, combined with the widest availability of standalone systems.More details are given in [229], [230].SpiNNaker (Spiking Neural Network Architecture) uses general-purpose ARM cores to simulate biologically realistic models of the human brain [231].SpiNNaker implements neurons as software running on the cores, sacrificing hardware acceleration to maximize model flexibility.The SpiN-Naker has been coupled with event cameras for stereo depth estimation [149], [232], optic flow computation [232], [233], and for object tracking [234] and recognition [235].
TrueNorth uses digital neurons to perform real-time inference.Each chip simulates 1 M (million) neurons and 256 M synapses, distributed among 4,096 neurosynaptic cores.There is no on-chip learning, so networks are trained offline using a GPU or other processor [236].
TrueNorth has been paired with event cameras to produce end-to-end, low power and low-latency event-based vision systems for gesture recognition [17], stereo reconstruction [190] and optical flow estimation [174].
Loihi uses digital neurons to perform real-time inference and online learning.Each chip simulates up to 131 thousand spiking neurons and 130 M synapses.A learning engine in each neuromorphic core updates each synapse using rules that includes STDP and reinforcement learning [226].Nonspiking networks can be trained in TensorFlow and approximated by spiking networks for Loihi using the Nengo Deep Learning toolkit from Applied Brain Research [237].
DYNAP.The Dynamic Neuromorphic Asynchronous Processor has two variants, one optimized for scalable inference (Dynap-se), and another for online learning (Dynap-le).
Braindrop prototypes a single core of the 1 M-neuron Brainstorm system [228].It is programmed using Nengo [238] and implements the Neural Engineering Framework [239].

Applications in Real-Time On-Board Robotics
As event-based vision sensors often produce significantly less data per time interval compared to traditional cameras, multiple applications can be envisioned where extracting relevant vision information can happen in real-time within a simple computing system directly connected to the sensor, avoiding USB connection.Fig. 13 shows an example of such, where a dual-core ARM micro controller running at 200 MHz with 136 kB on-board SRAM fetches and processes events in real-time.The combined embedded system of sensor and micro controller here operate a simple wheeled robot in tasks such as line following, active and passive object tracking, distance estimation, and simple mapping [240].
A different example of near-sensor processing ("edge computing") is the Speck SoC, 9 which combines a DVS and the Dynap-se neuromorphic CNN processor.Its peak power consumption is less than 1mW and latency is less than 30ms.Application domains are low-power, continuous object detection, surveillance, and automotive systems.
Event cameras have also been used on-board quadrotors with limited computational resources, both for autonomous landing [241] or flight [27] (Fig. 13b), in challenging scenes.

DISCUSSION
Event-based vision is a topic that spans many fields, such as computer vision, robotics and neuromorphic engineering.Each community focuses on exploiting different advantages of the event-based paradigm.Some focus on the low power consumption for "always on" or embedded applications on resource-constrained platforms; others favor low latency to enable highly reactive systems, and others prefer the availability of information to better perceive the environment (high temporal resolution and HDR), with fewer constraints on computational resources.
Event-based vision is an emerging technology in the era of mature frame-based camera hardware and software.Comparisons are, in some terms, unfair since they are not carried out under the same maturity level.Nevertheless event cameras show potential, able to overcome some of the limitations of frame-based cameras, reaching new scenarios previously inaccessible.There is considerable room for improvement (research and development), as pointed out in numerous opportunities throughout the paper.
There is no agreement on what the best method (and representation) to process events is, notably because it depends on the application.There are different trade-offs involved, such as latency versus power consumption and accuracy, or sensitivity versus bandwidth and processing capacity.For example, reducing the contrast threshold and/ or increasing the resolution produces more events, which will be processed by an algorithm and platform with finite capacity.A challenging research area is to quantify such trade-offs and to develop techniques to dynamically adjust the sensor and/or algorithm parameters for optimal performance.
Another big challenge is to develop bio-inspired systems that are natively event-based end-to-end (from perception to control and actuation) that are also more efficient and long-term solutions than synchronous, frame-based systems.Event cameras pose the challenge of rethinking perception, control and actuation, and, in particular, the current main stream of deep learning methods in computer vision: adapting them or transferring ideas to process events while being as top-performing.Active vision (pairing perception and control) is specially relevant on event cameras because the events distinctly depends on motion, which may be due to the actuation of a robot.
Event cameras can be seen as an entry point for more efficient, near-sensor processing, such that only high-level, non-redundant information is transmitted, thus reducing bandwidth, latency and power consumption.This could be  [27].The high speed and dynamic range of events are leveraged to operate in difficult illumination conditions.The same visualinertial odometry algorithm [27] is also demonstrated on high-speed scenarios, such as an event camera spinning tied to a rope.9. https://www.speck.ai/done by pairing an event camera with hardware on the same sensor device (Speck in Section 5.2), or by alternative bio-inspired imaging sensors, such as cellular processor arrays [242] which every pixel has a processor that allows to perform several types of computations with the brightness of the pixel and its neighbors.

CONCLUSION
Event cameras are revolutionary sensors that offer many advantages over traditional, frame-based cameras, such as low latency, low power, high speed and high dynamic range.Hence, they have a large potential for computer vision and robotic applications in challenging scenarios currently inaccessible to traditional cameras.We have provided an overview of the field of event-based vision, covering perception, computing and control, with a focus on the working principle of event cameras and the algorithms developed to unlock their outstanding properties in selected applications, from low-level vision to high-level vision.Neuromorphic perception and control are emerging topics; and so, there are plenty of opportunities, as we have pointed out throughout the text.Many challenges remain ahead, and we hope that this paper provides an introductory exposition of the topic, as a step in humanity's longstanding quest to build intelligent machines endowed with a more efficient, bio-inspired way of perceiving and interacting with the world.

Fig. 1 .
Fig. 1.Summary of the DAVIS camera [4], comprising an event-based dynamic vision sensor (DVS [2]) and a frame-based active pixel sensor (APS) in the same pixel array, sharing the same photodiode in each pixel.(a) Simplified circuit diagram of the DAVIS pixel (DVS pixel in red, APS pixel in blue).(b) Schematic of the operation of a DVS pixel, converting light into events.(c)-(d) Pictures of the DAVIS chip and USB camera.(e)A white square on a rotating black disk viewed by the DAVIS produces grayscale frames and a spiral of events in space-time.Events in space-time are color-coded, from green (past) to red (present).(f) Frame and overlaid events of a natural scene; the frames lag behind the low-latency events (colored according to polarity).Images adapted from[4],[35].A more in-depth comparison of the DVS, DAVIS, and ATIS pixel designs can be found in[36].

Fig. 2 .
Fig. 2. "Event transfer function" from a single DVS pixel in response to sinusoidal LED stimulation.The background events cause additional ON events at very low frequencies.The 60 fps camera curve shows the transfer function including aliasing from frequencies above the Nyquist frequency.Figure adapted from [2].

Fig.
Fig.3.Several event representations (Section 3.1) of the slider_depth sequence[81].From let to right: events in space time, colored according to polarity (positive in blue, negative in red).Event frame (brightness increment image DLðxÞ).Time surface with last timestamp per pixel (darker pixels indicate recent time), only for negative events.Interpolated voxel-grid (240 Â 180 Â 10 voxels), colored according to polarity, from dark (negative) to bright (positive).Motion-compensated event image[82] (sharp edges obtained by event accumulation are darker than pixels with no events, in white).Reconstructed intensity image by[8].Grid-like representations are compatible with conventional computer vision methods[83].

Fig.
Fig.4.Events in a space-time volume are converted into an interpolated voxel grid (left) that is fed to a DNN to compute optical flow and egomotion in an unsupervised manner[103].Thus, modern tensor-based DNN architectures are re-utilized using novel loss functions (e.g., motion compensation) adapted to event data.

Fig. 5 .
Fig. 5.The challenge of data association.Panels (a) and (b) show events from a scene (c) under two different motion directions: (a) diagonal and (b) up-down.Intensity increment images (a) and (b) are obtained by accumulating event polarities over a short time interval: pixels that do not change intensity are represented in gray, whereas pixels that increased or decreased intensity are represented in bright and dark, respectively.Clearly, it is not easy to establish event correspondences between (a) and (b) due to the changing appearance of the edge patterns in (c) with respect to the motion.Image adapted from[64].

Fig. 6 .
Fig. 6.Two optical flow estimation examples.(a) and (b): indoor flying scene[175].In (a), events (polarity shown in red/blue) are overlaid on a grayscale frame from a DAVIS.(b) shows the sparse optical flow (colored according to magnitude and direction) computed using[166] on brightness increment images.(c) A different scene: dense optical flow of a fidget spinner spinning at 750 /s in a dark environment[103].Events enable the estimation of optical flow in challenging scenarios.

Fig. 11 .
Fig. 11.Recognition of moving objects.(a) A DAVIS240C sensor with FPGA attached tracks and sends regions of events to IBM's TrueNorth NS1e evaluation platform for classification.Results on a street scene show red boxes around tracked and classified cars.(b) In[121] very high speed object recognition (browsing a full deck of 52 cards in just 0.65s) was illustrated with event-driven convolutional neural networks.

Fig. 13 .
Fig. 13.(a) Embedded DVS128 on Pushbot as standalone closed-loop perception-computation-action system, used in navigation and obstacleavoidance tasks [240].(b) Drone with a down-looking DAVIS, used for autonomous flight[27].The high speed and dynamic range of events are leveraged to operate in difficult illumination conditions.The same visualinertial odometry algorithm[27] is also demonstrated on high-speed scenarios, such as an event camera spinning tied to a rope.
Guillermo Gallego (Senior Member, IEEE) received the PhD degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, Georgia, in 2011.He is currently an associate professor at the Department of Electrical Engineering and Computer Science, Technische Universit€ at Berlin, Berlin, Germany.From 2011 to 2014, he was a Marie Curie researcher with the Universidad Politecnica de Madrid, Spain, and from 2014 to 2019 he was a postdoctoral researcher with the University of Zurich, Switzerland.Tobi Delbr€ uck (Fellow, IEEE) received the BSc degree in physics from the UC San Diego, San Diego, California, in 1986, and the PhD degree from the Caltech, Pasadena, California, in 1993.He is a professor of physics and electrical engineering with the Institute of Neuroinformatics, ETH Zurich, Zurich, Switzerland, where he has been since 1998.His group with S.-C.Liu focuses on neuromorphic sensory processing and efficient deep learning.Garrick Orchard received the PhD degree in electrical and computer engineering from Johns Hopkins University, Baltimore, Maryland, in 2012.He is currently a researcher with the Neuromorphic Computing Laboratory, Intel Labs, Santa Clara, California.From 2012 to 2019, he was senior research scientist with Temasek Laboratories and Singapore Institute for Neurotechnology, National University of Singapore.Chiara Bartolozzi (Member, IEEE) received the degree in engineering from the University of Genova, Genoa, Italy, and the PhD degree in neuroinformatics from ETH Zurich, Zurich, Switzerland, developing analog subthreshold circuits for emulating biophysical neuronal properties onto silicon and modelling selective attention on hierarchical multi-chip systems.She is currently a researcher with the Istituto Italiano di Tecnologia (IIT), Italy.She leads the Neuromorphic Systems and Interfaces Group, IIT, with the aim of applying neuromorphic engineering to design autonomous robotic machines.Brian Taba received the BS degree in electrical engineering from the California Institute of Technology, Pasadena, California, in 1999, and the PhD degree in bioengineering from the University of Pennsylvania, Philadelphia, Pennsylvania.He is currently a researcher with IBM, within the SyNAPSE Project.Andrea Censi received the PhD degree in control & dynamical systems from the California Institute of Technology, Pasadena, California, in 2012.He is currently a deputy director for the Chair of Dynamic Systems and Control (Prof.Frazzoli) at the Institute for Dynamic Systems and Control, Department of Mechanical and Process Engineering, ETH Z€ urich.From 2013 to 2017, he was a postdoctoral researcher with the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts.Stefan Leutenegger received the PhD degree in mechanical engineering from Autonomous Systems Lab, ETH Zurich, Zurich, Switzerland, in 2014.He is currently a senior lecturer in robotics at the Department of Computing, Imperial College London, U.K. 2014, Since 2014, he leads the Smart Robotics Lab, Imperial College London and co-leads research with the Dyson Robotics Lab together with Prof. A. Davison.He is cofounder of the startup SLAMcore.Andrew J. Davison is currently a professor of robot vision and director of the Dyson Robotics Laboratory, Imperial College London.His research focus is on SLAM and its evolution towards general "Spatial AI."He has also had strong involvement in taking this technology into real applications, in particular through his work with Dyson and as co-founder of SLAMcore.He was elected fellow of the Royal Academy of Engineering, in 2017.J€ org Conradt (Senior Member, IEEE) received the PhD degree in physics/neuroscience from ETH Zurich, Zurich, Switzerland.He is currently an associate professor at the School of Electrical Engineering and Computer Science, KTH, Stockholm, Sweden.Before joining KTH, he was W1 professor with the Technische Universit€ at M€ unchen, Germany.He was the founding director of the Elite Master Program NeuroEngineering, Technische Universit€ at M€ unchen.Kostas Daniilidis (Fellow, IEEE) received the PhD degree in computer science from the University of Karlsruhe, Karlsruhe, Germany, in 1992.He is currently the currently Ruth Yalom Stone professor of computer and information science with the University of Pennsylvania where he has been faculty since 1998.He was the director of the interdisciplinary GRASP Laboratory from 2008 to 2013, associate dean for graduate education from 2012-2016, and director of online learning since 2016.His main interest include in deep learning of 3D representations, data association, event-based cameras, semantic localization and mapping, and vision based manipulation.Davide Scaramuzza (Senior Member, IEEE) received the PhD degree in robotics and computer vision from ETH Z€ urich, Z€ urich, Switzerland, in 2008.He is currently an associate professor of robotics and perception at the University of Z€ urich, Switzerland, where he does research on autonomous, vision-based navigation of mini drones and event cameras.For his research contributions, he received a European Research Council (ERC) Grant, the IEEE Robotics and Automation Early Career Award, and several industry and paper awards.

TABLE 1
Comparison of Commercial or Prototype Event CamerasValues are approximate since there is no standard measurement testbed.

TABLE 2
Classification of Several Optical Flow Methods According to Their Output and Design

TABLE 3
Event-Based Methods for Pose Tracking and/or Mapping With an Event Camera

TABLE 4 Comparison
Between Selected Neuromorphic Processors, Ordered by Neuron Model Type