Recursive Contrast Maximization for Event-Based High-Frequency Motion Estimation

Achieving high-frequency motion estimation with a fast-moving camera is an important task in the field of computer vision. Contrast Maximization (CMax), a method of motion estimation using an event camera, is the de-facto standard. However, CMax requires the processing of a large number of events at a single time, a computationally expensive task. That makes it difficult to perform high-frequency estimates. Specifically, past events that have already been used once for estimation need to be evaluated again. In this paper, we propose “Recursive Contrast Maximization (R-CMax)” to estimate motions at high frequencies. The proposed method approximates multiple events by two “compressed events” using estimated trajectories of events from the previous time step, which can be updated recursively. By using a small number of “compressed events,” motion estimation can be updated efficiently. Comparing R-CMax with CMax and its extensions, we experimentally show that R-CMax can perform motion estimation with a fraction of the computational complexity while maintaining comparable accuracy.


I. INTRODUCTION
A. BACKGROUND High-frequency motion estimation is necessary for the robust tracking of fast-moving camera poses; for example, in augmented reality and automated driving, the poses of a fastmoving camera must be robustly tracked. Since a fast-moving camera changes its poses in a short period of time, robust pose tracking requires high-frequency motion estimation. However, the limited frame rate of conventional cameras restricts high-frequency motion estimation.
One of the de-facto standard methods for motion estimation with event cameras is called Contrast Maximization (CMax), which was proposed by Gallego et al. [16], [17]. CMax creates an Image of Warped Events (IWE) by warping batches of events according to candidate model parameters and accumulating them on an image plane. Motion can then be estimated by maximizing the contrast of IWE -when events are warped by the correct parameters, they are aligned along the edges, producing a highcontrast IWE.

B. MOTIVATION
The problem is that high-frequency motion estimation by CMax is difficult due to computational complexity. As shown in Fig. 1, many events overlap in the sliding window when CMax is used for high-frequency motion estimation. Despite these many overlapping events, CMax needs to recalculate IWE again. Since creating IWE needs to compute tens of thousands of events in a short time, computational complexity is further increased. Overview of R-CMax in high-frequency motion estimation in comparison to CMax. The top row shows how events are warped for each method, while the blue box is a sliding window that indicates the events used to CMax. The bottom row shows how IWE changes before and after optimization; each graph's color corresponds to the event's color, and red represents IWE. Left: Overview of CMax, which calculates IWE using all events in the sliding window. CMax needs to recalculate IWE using all events in the window at the next estimation. This increases the computational complexity because many events would overlap and need to be processed many times. Right: Overview of R-CMax. After creating compressed events, R-CMax calculates IWE using only three events; two compressed events and one new event. In the next estimation, the compressed events are reused to compute IWE. Thus, motion can be estimated recursively at high frequencies by reducing computational complexity.

C. CONTRIBUTIONS
To solve this problem, we realized a recursive maximization of contrast called ''Recursive Contrast Maximization (R-CMax).'' R-CMax realized by the proposed method enables high-frequency motion estimation with reduced computational complexity through recursive updates. To realize a recursive update, we propose summarizing a large number of past events using small numbers of ''compressed events.'' As shown in Fig. 1, hundreds of spatiotemporal event trajectories are approximated using only two events. This approximation enables recursive maximization of contrast using only the compressed events and newly observed events, thereby reducing computational complexity.
To demonstrate the effectiveness of the proposed method in terms of accuracy and computational complexity, we compared it to CMax and its extensions on both simulated and real data. We demonstrated that R-CMax could perform motion estimation with only a fraction of the computational complexity of CMax while maintaining comparable accuracy.
In summary, our contributions are: • Proposal of R-CMax, which enables high-frequency motion estimation with reduced computational complexity through recursive updates (Section III).
• Experiments comparing R-CMax with CMax and its extensions to demonstrate its effectiveness (Section IV).

II. RELATED WORKS
This section presents research on motion estimation using event cameras. The high temporal resolution and low latency of event cameras, which are not available with conventional cameras, make them suitable for motion estimation in a variety of situations that were previously impossible.
Many motion estimation researches have been conducted by combining event cameras with external sensors. Censi and Scaramuzza [18] automatically calibrate the event camera and the conventional camera in spatiotemporal terms, and each time an event is observed, update the relative pose of the event camera with respect to the last frame. Weikerdorfer et al. [19] use a frame-based RGB-D camera attached to an event camera to perform depth estimation and create voxel maps for motion estimation. Kueng et al. [20] use intensity images from conventional cameras to detect feature points and track them asynchronously, using events to perform motion estimation. Gallego et al. [21] perform eventby-event six degrees of freedom (DoF) motion estimation from existing depth maps with intensity values. These methods allow high-frequency motion estimation for each event observed, but they require information from external sensors as well as event data.
Motion estimation methods performed only with event cameras have also been studied. Weikersdorfer et al. [22], [23] use an upward-mounted event camera to capture events from the ceiling and use a particle filter to estimate the position of the camera moving parallel to the ceiling. Kim et al. [24] proposed a method for jointly estimating the camera's six degrees of freedom, depth to keyframes, and scene intensity gradients using a probabilistic filter. Rebecqet et al. [25] perform motion estimation by tracking based on image-model alignment and 3D reconstruction using events in parallel. More recently, there have been successful methods for optical flow, ego-motion, and depth estimation by accumulating observed events into images and using them to train deep learning methods [26], [27], [28], [29]. Although these methods allow motion estimation VOLUME 10, 2022 FIGURE 2. System flow of R-CMax. As an initialization process, CMax is performed with the accumulated multiple events to create a map to obtain initial ''compressed events'' from the map. The compressed events allow recursively maximizing the contrast of IWE using only compressed and new events and estimating motion at a high frequency with a low computational cost. The map and compressed events are recursively updated based on the motion estimation results.
using only event data, they cannot take advantage of the low latency of event cameras, as they cannot perform event-byevent, high-frequency motion estimation for each observed event.
CMax by Gallego et al. [16], which is most related to our method, is the de-facto standard method for motion estimation using only event cameras. One of the characteristics of CMax is that it is a unified method that can estimate various motions using the same framework, such as translation and rotation, given that the motion to be estimated is modeled. In addition, Gallego [17] and Stoffregan [30] have also proposed and validated various loss functions other than contrast. Nunes and Demiris proposed a method based on CMax called Entropy Minimization (EMin) [31]. This method allows motion estimation directly from warped events by introducing an entropy-based measure, whereas CMax is a method that requires events to be converted into an imagebased, intermediate representation called IWE. Nunes and Demiris also proposed Dispersion Minimization (DMin) [32] as a further extension of EMin. In addition to the advantage of not having to rely on intermediate representations, this method allows for incremental estimation of model parameters. However, since recalculating past events is inevitable, this method has yet to reduce the computational complexity of high-frequency motion estimation.

A. PRELIMINARY 1) EVENT DATA REPRESENTATION
The data obtained from the event camera will be described in this section of the paper. An event camera asynchronously outputs the intensity change of each pixel, called an event. Suppose there exists a given set of events E = e k in the time window [0, T ], where e k = (x k , t k , p k ). Here, x k is the coordinate where the event is triggered, t k is the timestamp where the event occurs with a precision of microseconds, and p k ∈ {−1, +1} is the polarity of the event, which indicates brighter or darker changes when the logarithmic intensity changes above a threshold at a certain pixel x = (x, y).

2) CONTRAST MAXIMIZATION
Since CMax is the basis for our proposed method, we briefly describe the CMax method below. Let E = {e k } Ne k=1 be a group of accumulated observed events, where N e is the number of events accumulated. Let θ be the motion we want to estimate. We can define a function W that warps the events along their spatiotemporal trajectory as follows: In CMax, an intermediate representation called IWE is introduced that accumulates the events warped by (2) onto the image plane, defined as follows: where N is Gaussian. Creating IWE with the optimal parameter θ * , its contrast is locally maximal. The contrast of IWE can be calculated by taking the variance of IWE, which is where N p is the number of pixels in IWE, I = (i mn ) is the coordinate of IWE, and µ I = 1 N p m,n i mn is the average value of IWE. Therefore, θ * is obtained by the following optimization: The problem with CMax is that it is difficult to perform high-frequency motion estimation due to increased computational complexity. As shown in Section I and Fig. 1, this problem occurs when trying to perform high-frequency FIGURE 3. Detail of IWE approximation by compressed events. Left: As an initialization process, CMax is performed with multiple events to obtain initial compressed events. Middle: Once CMax is performed, compressed events e c with weights of w = m can be created. Compressed events can then be reused for subsequent estimation, enabling recursive maximization of contrast. Right: Warping the compressed event with the previous estimation results allows the approximation of past events using two events. This approximation allows R-CMax to create IWE using only three events: two compressed events and one new event. Warp compressed event using θ.

5:
Perform CMax using warped compressed events and e k to find θ * . 6: Update compressed events by updating the map M.

7:
θ ← θ * 8: end for motion estimation with CMax; despite many overlapping events in the sliding window, CMax needs to recalculate IWE. Therefore, avoiding the recalculation of past events reduces computational complexity.

B. PROPOSED METHOD
To solve this problem, we realized recursive maximization of contrast, called ''Recursive Contrast Maximization'' (R-CMax). The flow of our method is shown in Fig. 2 and Algorithm 1. To realize a recursive update, we propose ''compressed events,'' which approximate past events using previous motion estimates (Section III-B1, III-B2). Hundreds of spatiotemporal events can be represented using only two compressed events. With these compressed events, IWE can be created using the compressed and observed events without calculating past events. Additionally, our method updates compressed events by maintaining a two-dimensional event map updated based on the estimation results (Section III-B3).
Note that the motion to be estimated in this paper is a twodimensional motion that moves parallel to a plane. Therefore, the parameter θ = v x , v y is common to all pixels in the event camera, and the motion model W is expressed as follows: where t k = t k − t ref . The extensibility of this method is discussed in Section V.
The recalculation for E old is the cause of the increase in the number of operations. Therefore, we propose to approximate (7) by using the compressed events E c = e r,c N c r=1 , which is created based on the past events and the previous estimation results: where w r denotes the weight of each compressed event, x r,c andx r,c denote the coordinates of the compressed events that approximate the coordinates of past events. As shown in (7) and (8), we use two compressed events to approximate past events. Since the number of compressed events is less than past events (N c < N old ), introducing compressed events can significantly reduce computational complexity. By using (8), IWE can be created using only compressed events and new events. Compressed events can be reused to create the next IWE used for the following estimation, allowing for recursive maximization of contrast.

2) COMPRESSED EVENT
This section describes compressed events used for recursive updates. The concept of compressed events is shown in Fig. 3.
Performing CMax can be interpreted as associating events from the same edge. Therefore, performing CMax integrates multiple events from the same edge into a single event. For events occurring from the same edge r, a compressed event is created as follows: where x r,c represents the coordinates of the edge on the IWE, w r = N r i p i is the weight that the compressed event has, and N r is the number of events on r. In other words, a compressed event is an event with a weight equal to the number of events on the same edge. The compressed events we introduced allow us to perform recursive optimization, which does not recalculate past events. We assume that motion can be approximated as constant velocity linear motion for a short time and that the estimation results do not change significantly before and after observing a new event. Under this assumption, using compressed events, the coordinates of past events can be approximated as follows: where x r,c denotes the coordinates of the compressed event, θ is the θ * from the previous estimation, and W −1 is the inverse transform of modeled motion (Fig. 3).
Creating the IWE is to evaluate the overlap of warped events. To reduce the amount of computation, the formula for IWE shown in (3) practically computes the Gaussian only for the pixels surrounding the warped event, not all of the IWE pixels. However, as shown in Fig. 1, the two compressed events are so far apart that differentiation using a truncated Gaussian does not lead to the true value. On the other hand, increasing the Gaussian size increases computational  complexity. To mitigate that, we take advantage of the fact that the compressed event has already been associated. We do not need to compute all possible combinations of events, and the overlap of events can be evaluated analytically using the previous estimation results. Therefore, the computational complexity of IWE creation with our method is independent of the kernel size of the Gaussian. Since the number of target events is reduced and the computational complexity of IWE creation is comparable, the overall amount of calculation is significantly reduced.

3) EVENT MAPPING
As the event camera moves, new compressed events must be added when new objects are captured and must be deleted when they are no longer required. To update compressed events, we maintain a two-dimensional event map M simultaneously with the motion estimation. Given the previous motion estimation results, the observed events can be warped to the map coordinates as follows: where I n is the IWE made by new events at the n-th estimation and s n = n 1 θ is the camera's travel distance calculated from previous estimation results. The constant µ is determined by the following equation using the number of events at initialization N 0 and the number of new events N e : The map is updated so that recently used compressed events have larger weights and compressed events that have not been observed for a while disappear. This is to ensure that compressed events with newer observation times have a larger impact on estimation and that compressed events that have not been observed for a while have a smaller impact on estimation. To achieve this, we update a constant µ < 1 and apply it to the 2-D map for each motion estimation. M can be updated recursively by using (11) and (12). The compressed events can be updated as follows: To avoid noise from being included in the compressed events, a threshold σ is prepared, and pixels on the map with M(x) > σ can be used as compressed events.

IV. EXPERIMENTAL EVALUATION
In this section, we present the experiments and discussions conducted to demonstrate the effectiveness of our method. The novelty of our method is that it allows us to recursively maximize the contrast of IWE and reduce computational complexity by approximating past events. We conducted the following three experiments on simulated and real data to demonstrate that our method reduces computational complexity and that comparable in accuracy to existing methods. In our first experiment, we applied R-CMax to motion estimation on simulation data created with ESIM [33]. The experimental results were compared by applying the same data to the original CMax, Entropy Minimization (EMin) [31], and Dispersion Minimization (DMin, on-line mode) [32]. In the second experiment, QR codes were captured by an actual event camera. We estimated the event camera's motion to show that this method can be applied to real event camera data containing noise. Experimental results were compared to DMin and ARKit4 on an iPhone 13 Pro, which uses images, depth sensors, and IMU. In addition, the created map was visualized and compared with the original QR code to show that it can be applied to object restoration. In the third experiment, we performed motion estimation for a fast-moving camera to demonstrate the effectiveness of highfrequency estimation with event cameras. The experimental results of the R-CMax were compared to the original CMax to evaluate its accuracy.
All experiments were performed on an Intel Core i7-6800K CPU and 64 GB of system memory, using Python for the implementation and matplotlib [34] for visualization.

A. EXPERIMENTS WITH SYNTHETIC EVENT DATA
This section describes experiments showing that the proposed method reduces computational complexity, and its estimation accuracy is comparable to existing methods. To compare R-CMax with existing methods, we created a dataset using ESIM, which can simulate an event camera. We created ten event data sets by vertically placing a virtual event camera toward the image shown in Fig. 4 and moving it randomly. We applied our method and comparative methods to the data to estimate the motion of the event camera. For comparison, the same data was given to CMax, EMin, and DMin (on-line mode) to estimate motion. Each method used a batch size of 50,000 events, and motion estimation was performed every time an event was acquired. The parameters to be estimated were two-dimensional optical flows (v x , v y ) moving parallel to the plane. Fig. 5 shows the number of events required for R-CMax and CMax to be estimated at each time. As seen from the figure, the proposed method succeeded in reducing the number of events that need to be computed when performing one estimation compared to CMax. The figure shows that CMax always processed a constant number of events for each motion estimation. This is because CMax accumulates a batch of events of a defined size for each new event acquired, creates an IWE, and calculates the maximum contrast. This results in having to constantly calculate a large number of events in a short period. On the other hand, the proposed method could perform motion estimation with less computation compared to existing methods. This is because the compressed event approximates past events once calculated. As a result, only compressed and newly observed events are needed for a single estimation, reducing the total amount of computation.
The results of the accuracy comparison experiment between R-CMax, CMax, and its extensions are shown in Table 1. The accuracy comparisons are made by taking the average of three metrics for the results of ten data sets, errors in the true value, error variance, and the root mean squared error (RMSE). As seen from the table, the proposed method can estimate motion with an accuracy comparable to the existing methods. As seen in Fig. 5, even though the proposed method performed motion estimation with fewer events than the existing method, it still maintained its accuracy.
A plot of one of the results of the accuracy comparison experiment is shown in Fig. 6. As seen from the figure, even for data involving changes in velocity, estimation by the proposed method was comparable to the existing method. Graph showing the trajectory of the dolly obtained by integrating the motion estimation results, comparing R-CMax, DMin, and ARKit4. The proposed method was able to track motion with comparable accuracy to ARKit4, which used intensity images and depth sensors.
Our method assumes that the velocity changes little before and after the observation of an event. Still, even data regarding such a change in velocity can be regarded as constant velocity for a very short time, such as before and after the observation of an event. Thus, the estimation can be performed without failure.

B. EXPERIMENTS WITH REAL EVENT DATA
This section describes the quantitative experiments that were conducted to demonstrate the effectiveness of this method, even with an actual event camera, thus showcasing its applicability to real-world applications. In Section IV-A, we experimented with simulated event data, where the event data occurred only at pixels with intensity changes. Therefore, by conducting experiments with actual event cameras, VOLUME 10, 2022 FIGURE 11. Left: A visualized image of the 2D map M that is created simultaneously with motion estimation. Right: Image of a QR code captured by an event camera during motion estimation. R-CMax gradually reconstructed the map as the event camera captured the QR code, and when the QR code was out of view, that area was removed from the map. By managing the map, it is possible to retain enough compressed events for R-CMax calculation. we show that the motion estimation of this method is possible even in more practical situations that feature noise. As shown in Fig. 7, the event camera was placed 20 cm from the floor, facing directly down to the ground, and motion estimation was performed when the camera moved in parallel over the QR code. The event camera used for this experiment was a Prophesee Gen3 VGA (and its supporting evaluation kit, EVK). As a comparison, the same data set was given to DMin (on-line mode) for motion estimation. Each method was used with a batch size of 50, 000 events, and motion estimation was performed for each event acquisition. As a reference value, we performed motion estimation using ARKit4 (iOS) with an iPhone 13 Pro Max mounted on a dolly with the front facing forward. Fig. 8 shows this method's estimation results along with the results of DMin and ARKit4. As can be seen from the figure, the proposed method could estimate motion even for actual event data. DMin could also estimate motion, but the proposed method was more effective because it reduced the number of operations. It could also be confirmed that the proposed method was capable of similar motion estimation compared to ARKit4, which used an RGB camera and LIDAR. Fig. 9 shows the accumulated events before and after applying R-CMax. The alignment of the events qualitatively indicates that contrast was correctly maximized. Fig. 10 compares the estimated trajectory of the dolly from the estimated motion with DMin and ARKit4. R-CMax was able to estimate the trajectory with the same accuracy as ARKit4, which used cameras and depth sensors, indicating that this method can also be applied to visual odometry.
A visualization of the map M that was stored during the estimation process is shown in Fig. 11. The result shows that the edges of the QR code had been restored and that the noise generated from the event camera had been eliminated from the map. This result also shows that the motion estimation by the proposed method is correct for actual event data.
Although the actual event cameras used in these experiments contain noise in the event data that are not present in the simulation data, these experimental results show that motion estimation can be performed even in the presence of noise. This is because the method of maintaining the map reduces the weight of randomly occurring events such as noise. Reducing unnecessary compressed events caused by noise makes motion estimation more robust. The experimental results demonstrate the potential for applications with motion estimation moving parallel to a plane using an event camera.

C. EXPERIMENTS WITH FAST-MOVING EVENT CAMERA
This chapter describes experiments conducted to demonstrate the effectiveness of R-CMax in tracking fast-moving cameras. High-frequency motion estimation is effective in tracking fast-moving cameras. Therefore, we compared the motion estimation results of fast-moving event cameras between R-CMax and CMax. In the experiment, we acquired data by moving the event camera at high speed by hand at 20 cm from the plane, and we performed motion estimation with R-CMax and CMax. The event camera used for this experiment was a Prophesee Gen3 VGA (and its supporting evaluation kit, EVK). Each method performed estimation for each event acquisition. Fig. 12 shows the proposed method's estimation results and CMax's results. The figure shows that the proposed method could estimate motion with the same accuracy as CMax, even for a fast-moving event camera. Also, as the figure shows, the maximum speed of the event camera in this experiment was about 6, 000 pix/sec. If we assume that the frame rate of a conventional camera is 30 fps, the camera moved 200 pixels per frame, which would be difficult to track with a conventional camera. Thus, high-frequency motion estimation with event cameras using R-CMax is effective for tracking fast-moving cameras.
Note that when the event camera moves at high speed, events can be generated from pixels that should not be generated. In the case of the above experiment, some estimations failed if there was too much noise of that kind. We believe this problem can be solved by appropriately designing noise filters and removing those noises before processing.

V. CONCLUSION
In this paper, we proposed R-CMax for high-frequency, two-dimensional motion estimation by recursively maximizing contrast for each event acquisition. CMax, used for motion estimation, has a problem: the number of operations required increases when trying to perform high-frequency motion estimation. This is because high-frequency motion estimation via CMax requires the recalculation of past events already used for estimation. The proposed method approximates multiple events by ''compressed events'' using estimated trajectories of events from the previous time step. By creating an IWE with only compressed events and new events, we can maximize contrast while preventing event recalculation. This allows motion estimation to be performed each time an event is observed, allowing for the outputs of estimation results at a high frequency with reduced computational complexity. To demonstrate the effectiveness of the proposed method, we showed that in multiple sequences created using ESIM, the proposed method performed motion estimation without loss of accuracy and reduced the amount of computation required for a single estimation compared to existing methods. Motion estimation experiments using an actual event camera were also conducted. It was shown that the proposed method could robustly estimate motion, even with an actual event camera that generates random events such as noise. We also demonstrated that the high-frequency motion estimation by the proposed method is effective even when the event camera moves at high speed.
One limitation of our method is that the motion to be estimated is limited to two-dimensional motion moving parallel to a plane. One possible future direction would be extending motion to more complicated, such as rotation, homography, etc. However, if we extend this method to rotations and homography, the event trajectories are no longer spatiotemporally linear, even though the motion is linear in parameter space. In the present method, the approximation is based on the assumption that the event trajectories are spatiotemporally linear, so the current approximation method cannot be applied directly. Therefore, to realize modeling higher degree motion, we need to formulate spatiotemporal trajectories of events as a non-linear function of time.