Fast online 3D reconstruction of dynamic scenes from individual single-photon detection events

In this paper, we present an algorithm for online 3D reconstruction of dynamic scenes using individual times of arrival (ToA) of photons recorded by single-photon detector arrays. One of the main challenges in 3D imaging using single-photon Lidar is the integration time required to build ToA histograms and reconstruct reliable 3D profiles in the presence of non-negligible ambient illumination. This long integration time also prevents the analysis of rapid dynamic scenes using existing techniques. We propose a new method which does not rely on the construction of ToA histograms but allows, for the first time, individual detection events to be processed online, in a parallel manner in different pixels, while accounting for the intrinsic spatiotemporal structure of dynamic scenes. Adopting a Bayesian approach, a Bayesian model is constructed to capture the dynamics of the 3D profile and an approximate inference scheme based on assumed density filtering is proposed, yielding a fast and robust reconstruction algorithm able to process efficiently thousands to millions of frames, as usually recorded using single-photon detectors. The performance of the proposed method, able to process hundreds of frames per second, is assessed using a series of experiments conducted with static and dynamic 3D scenes and the results obtained pave the way to a new family of real-time 3D reconstruction solutions.


I. INTRODUCTION
Fast reconstruction of 3D scenes using single-photon light detection and ranging (Lidar) technology is an important challenge which is important in applications such as autonomous driving [1], environmental monitoring [2]- [4] and defence [5].A growing number of 3D imaging modalities is becoming increasingly popular [6], and single-photon Lidar offers appealing advantages, including low-power, a capability for long-range imaging [7] or imaging in complex media such as fog/smoke [8] and underwater [9], [10] with excellent range resolution (of the order of millimetres [11]).Recently, several algorithms have also been proposed to analyse distributed objects [12]- [17], i.e., when multiple surfaces are visible within each pixel.
Despite pushing the boundaries of 3D reconstruction in extreme environments, single-photon Lidar still suffers from 1) relatively long integration times required to obtain sufficiently reliable data and 2) significant computational requirements to process the resulting large volume of data recorded by single-photon imaging systems.Recent advances in singlephoton avalanche diode (SPAD) detector arrays [18], [19] have allowed significant reductions in acquisition times over raster scanning systems [11], [20]- [22], enabling acquisitions with video frame rates.Yet, robust, automated and scalable methods allowing for fast analysis of single-photon data are still required.One of the main bottlenecks of most stateof-the-art 3D reconstruction methods [15], [23]- [27] is that they rely on the construction of histograms of photon times of arrival (ToA) (or batches of detection events), which, when synchronised with a pulsed laser (time correlated singlephoton counting, TCSPC) correspond to photon times of flight (ToF), used to infer object ranges.One important exception is the so-called "first-photon" imaging approach [28] whereby the reflectivity and 3D profiles of the scene can be recovered using a single photon per pixel.However, the approach in [28] targets primarily raster scanning Lidar systems, allowing the variable per-pixel acquisition times, i.e., until the first photon is detected.
In this work, we consider Lidar data acquired using SPAD arrays and investigate a new 3D reconstruction algorithm that does not rely on ToF histograms, but on individual photon detection events.More precisely, we address the problem of 3D reconstruction after each time period (defined in Section II) during which each SPAD detector can record at most one detection event.This approach is particularly relevant for applications where the objects in the scene can move significantly faster than the integration period or the number of laser repetitions required to build sufficiently populated ToF histograms.In such cases, the relative movement of the scene with respect to the sensor can produce a 3D blur that produces broader peaks or even multiple returns in some Lidar waveforms, which can jeopardise the 3D reconstruction task.
Adopting a Bayesian approach, we consider a likelihood model based on the standard single-photon Lidar observation model in the low-flux regime.We then introduce a dynamic model for the spatiotemporal (ST) evolution of the 3D profile.Due to the complex nature of the likelihood (mixture of two distributions) and the structure of the prior model, the standard online estimation methods based on (extended) Kalman filtering [29] cannot be used directly.As the complexity of the resulting model grows prohibitively with the number of detection events, i.e., over time, we adopt an approximate estimation strategy based on assumed density filtering (ADF) [30]- [32], whereby the posterior distribution of the 3D profile estimated for a given frame is projected onto a family of more tractable distributions (Gaussian distributions here), which reduces significantly the complexity of the sequential estimation procedure.Although particle filters [33] could also be considered for approximate inference, this would lead to an increased computational cost induced by the approximation of densities using a large number of particles.It is important to note that the resulting method, which uses each frame (during which at most one photon can be detected per pixel) only once, enables online 3D reconstruction of dynamic scenes with limited memory requirements.Indeed, the individual frames are processed sequentially, resulting in a fix computational cost per frame which is important for any real time implementation.Moreover, thanks to its intrinsically parallel algorithmic architecture, the proposed method is extremely scalable to large arrays and long sequences of frames.Another important advantage is that it does not require the knowledge of the (potentially time varying) ambient illumination level.
To summarise, the main contributions of this work are: • A new Bayesian model for sequential 3D reconstruction using individual photon-detection events • An online estimation strategy, proposed to the best of our knowledge for the first time, for reconstruction of dynamic 3D scenes from single-photon data.This method based on assumed density filtering is highly scalable and computationally attractive.The remainder of the paper is organised as follows.Section II first recalls the classical observation model for 3D reconstruction using single-photon measurements in the photonstarved regime and describes the Bayesian model and inference strategy proposed for 3D reconstruction using a single frame.The generalisation of this method to online 3D reconstruction of dynamic scenes is detailed in Section III.Results of simulations conducted with simulated single-pixel data and sequences of frames are presented and discussed in Section IV and conclusions are finally reported in Section V.

A. Observation model
In this work, we consider a sequence of N frames, where each frame of duration T consists of P pixels.This paper addresses the reconstruction of dynamic 3D scenes where each single-photon detector, associated with one pixel, is able to record at most one detection event per pixel and per frame.
Let's first consider an active illumination scenario where the laser emits pulses of light with a repetition/illumination period T r = T .As described in [26], assuming that a single surface is visible in each pixel, within each frame n, the average photon flux at the detector/pixel p can be modelled as where d p,n is the instantaneous distance of the object, c is the speed of light in the homogeneous medium between the imaging system and the detector and r p,n is an amplitude parameter related to the reflectivity of the object.Moreover, b p,n represents the ambient illumination and dark count level in the nth pixel, which can potentially vary among pixels.Note that r p,n and b p,n also account for the quantum efficiency of the detectors that is not further detailed here for brevity (see [26] for details).Moreover, s(•) is the overall impulse response of the imaging system, which includes the shape of the pulse emitted by the laser and the temporal response of the single-photon detector.As in [26], we assume that s(•) is known as it can be measured during the calibration of the Lidar system, and that it is well approximated by a Gaussian profile with variance s 2 .As will be discussed in Section II, the proposed method can also be applied when the shape of this impulse response is not Gaussian and changes from one pixel to another due, for instance, to the inhomogeneity of the P detectors.Over the nth illumination period, the detection rate is thus given by where B p,n = T r b p,n and where we assume that the object distance is not too close from the minimum (0) and maximum (T r c/2) admissible ranges such the the integral S = Tr 0 s(t− 2d p,n /c)dt remains constant, whatever the value of d p,n .In the low-flux regime, we have r p,n (t)S + B p,n 1, such that the probability of two photons reaching the same detector in a given interval T r is small and such that the dead-time of the detector can be neglected.In that case, the probability of detection is given by π p,n = 1 − exp [−Λ p,n ] ≈ Λ p,n and the probability of a detected photon being associated with the original emitted pulse, denoted by w p,n is given by w p,n = (r p,n S)/Λ p,n .Let z p,n ∈ (0; 1) be a binary label indicating a detection event (i.e., when z p,n = 1) in pixel p for the nth frame, such that When z p,n = 1, the observation model for the measured time of arrival y p,n ∈ [0; T r ) in pixel p and frame n can be expressed as where U [0;Tr) (•) is the uniform distribution defined on [0; T r ), and with f s (t − 2d p,n /c) = S −1 s(t − 2d p,n /c), ∀(p, n).Moreover, we use the notation y p,n = ∅ when no detections were recorded in the pth pixel within the nth frame.Assume now that a frame lasts N r laser repetition periods, i.e., T = N r T r and that the detector is only able to record at most one detection event during that frame.If the observation conditions have not changed during the N r repetitions, the probability of detection is given by πp,n = 1−exp [−N r Λ p,n ].However, in the low-flux regime, Eq. (4) still applies.Consequently, although each frame can result from more than one illumination periods, the observation models (3) and (4) are still valid by replacing π p,n by πp,n in (3), provided that the observation conditions have not changed over the period T .This observation can be useful for practical applications since in the low-flux regime, imposing r p,n S + B p,n 1 (for each T r interval) leads to extremely sparse detection events and large volumes with T = T r , while using T = N r T r (for a given T r ) allows both reduced data volume and higher perframe detection rates.
In this paper, we address the problem of estimating D = {d p,n } p,n from the set of observations Y = {y p,n } p,n .As discussed in the introduction of this paper, although it is possible to develop batch-based methods for recovering D given all the N P observations [20], [26], such approaches can become computationally prohibitive for large numbers of pixels, but more importantly for long temporal sequences.Moreover, these existing approaches do not specifically deal with time varying scenes, and do not use a spatiotemporal models.While the method recently proposed by Halimi et al. [8], [16] can be used for varying scenes (sequences of batches), it is not designed to handle long temporal sequences either as all the batches are proposed simultaneously.Indeed, the ADMM-based method developed in [16] seems more suited for multispectral measurements.Thus, here we adopt a sequential approach where the N frames are processed one by one and only once, allowing for fast estimation and reduced memory requirements.In the remainder of the paper, we thus use ( 3)-( 4) as our observation model.
The next paragraph introduces the Bayesian model and estimation strategy used to process a single frame, assuming that W = {w p,n } p,n is known.The generalisation of the proposed updated to online 3D reconstruction, including the sequential estimation of W will be discussed in Section III.

B. Estimation strategy
As mentioned above, we first investigate the estimation of d n = {d p,n } p from a set of measurements y n = {y p,n } p associated with the nth frame.Assuming the detection events in different pixels are mutually independent (given the other parameters in (4)), the joint likelihood can be expressed as with To obtain a tractable and computationally efficient ADFbased estimation strategy, we propose to define independent prior distributions for the target ranges in a given frame, i.e., f Despite the apparent lack of prior correlation between the elements of d n (given the set of parameters in Θ n = {θ p,n } p ), it is possible to enforce ST correlations by defining Θ n using d n−1 , as will be discussed in Section III.For now, let's assume that each distance d p,n is assigned a fully specified mixture of M Gaussian distributions as follows where ) is a Gaussian distribution with mean µ p,n = 1, ∀(p, n) and their value, as well as that of M will be discussed in Section III.Since the joint likelihood ( 5) and the joint prior distribution ( 6) can be factorised over the P pixels, the resulting posterior distribution given by f can also be factorised over the P pixels and the P range parameters in d n can thus be estimated independently, in a parallel manner.Consequently, we simply summarise the update for one parameter d p,n , i.e., for pixel p.If z p,n = 0, d p,n does not appear in the data likelihood.In that case, the posterior distribution of d p,n reduces to its prior (6).If ), but it is possible to resort to numerical integration tools [35], [36] such as Gaussian quadrature or Laplace approximation to approximate the integrals and in turn the moments of f (d p,n |y p,n , z p,n = 1, w p,n , θ p,n ).

III. ONLINE ESTIMATION A. Approximation using Assumed Density Filtering
Estimating the posterior mean and variance of d p,n presents a great advantage for online estimation, beyond simply providing summary statistics about the current range profile.It allows, by propagating simply the first and second-order moments of the current posterior distributions, the use of a tractable adaptive estimation procedure.Indeed, if the prior distribution of d p,n consists of M components (as in ( 6)), its posterior will contain 2M components (only M if z p,n = 0) and if a classical Gaussian random walk is then used to model f (d p,n+1 |d p,n ), the posterior distribution of d p,n+1 will present 4M terms after marginalisation of d p,n .That number will thus increase prohibitively as n increases.The basic principle of assumed density filtering in this case is to approximate f (d p,n |y p,n , w p,n , θ p,n ) by a more tractable distribution that can then be used to build a new prior distribution for d p,n+1 .While it is possible to construct complex approximations of (8) using a fixed (reduced) number of Gaussian components, here we simply use an approximation based on a single Gaussian q(d p,n ).In a similar fashion to classical assumed density filtering [30], [31] and expectationpropagation [32], this approximation is found by minimising the following Kullback-Leibler divergence w.r.t.q p,n (d p,n ) which belongs to the family of Gaussian distributions.This minimisation reduces to matching the mean and variance of f (d p,n |y p,n , w p,n , θ p,n ) and q p,n (d p,n ), hence the discussion about the estimation of the moments of f (d p,n |y p,n , w p,n , θ p,n ) in Section II-B.

B. Spatiotemporal dynamic model for the range profile
A classical choice for modelling relatively slowly evolving parameters relies on (Gaussian) random walks.Whilst this approach is easy to implement, it does not allow, using simply f (d p,n+1 |d p,n ), ∀p, for rapid changes as might occur when the imaging system or the scene moves orthogonally to the direction of observation, whereby a foreground object can disappear from one pixel and appear in neighbouring pixels.To alleviate issues associated with such changes while keeping the estimation strategy tractable, we define, for each pixel, a local neighbourhood V p of M neighbours (including the current pixel) and define the following prior model where {q(d p,n )} p,n are the Gaussian approximating posterior distributions of {d p,n } p,n computed by minimising (10) and ) is a Gaussian random walk which models (through its variance γ 2 ) the expected amount of movement of the objects of the scenes along the direction of observation, between two frames.In (11), the weights ν p are chosen such that ν p =p = ν and ν p =p = (1−ν)/(M −1), i.e., ν represents the probability of a surface to remain in the same pixel in the next frame.While we choose in (11) a relatively simple ST model based a random walk, more complex local dynamic models (e.g., using object velocity) could also be used.As long as the prior model reduces to a mixture of Gaussian distributions, the proposed estimation strategy can still be used.
In a similar fashion to (8), Eq. ( 11) is a finite mixture a M Gaussian distributions whose weights, and individual means and variances, gathered in θ p,n+1 can be computed analytically by integration of the right-hand side of (11) w.r.t.{d p ,n } p ∈Vp .Using this strategy, the number of components of f (d p,n+1 |θ p,n+1 ) remains the same as for f (d p,n |θ p,n ), that is, M .The size of the neighbourhood can be adjusted using prior information about the scene on interest: small neighbourhoods (e.g., M = 5 using 4 neighbouring pixels, as used in this work) will be sufficient for slowly moving scenes but larger sets of pixels might be needed if objects can move by several pixels in the image plane.Compute prior model f (dp,n|θp,n) from (11).

8:
Compute qp,n(dp,n) using (10).Optional: Apply smoothing operator to wn+1.12: end for 13: Set Interestingly, the proposed 3D reconstruction method does not rely on the knowledge of the detection probabilities {π n } n since they do not intervene in the estimation of d n which only relies on {y n } n .In particular, this method does require knowledge of the number N r of illumination periods during each frame, which is used in the probabilities of detection {π p,n } p,n (see discussion below Eq. ( 4)).Thus, the only important and generally unknown parameters are the probabilities of signal detection events in W .In a similar fashion to the approach we proposed for D, the elements of W can be included in a Bayesian model and assigned iteratively prior distributions for online estimation, i.e., by computing the posterior distribution of (d n , w n ) at each frame, and by approximating this distribution to build a tractable prior distribution f (d n+1 , w n+1 |d n , w n ).However, this is not the approach we adopt here as it makes the estimation procedure more computationally demanding, in particular when computing the marginal moments, or more generally expectations w.r.t the posterior distribution of (d n , w n ) during the KL divergence minimisation.Instead, we use the following simple heuristic method which provides satisfactory results in practice.Let wn be an estimate of w n obtained from the previously observed data {y n } n=1,...,n−1 .Our aim here is to propose an estimate wn+1 of w n+1 , which depends on wn and the data y n .We first define an instantaneous estimator ŵn = { ŵp,n } p with ŵp,n = wp,n if y p,n = ∅.If y p,n = ∅, ŵp,n is obtained from (8) where w n has been replaced by wn .More precisely, ŵp,n is the posterior probability of the current detection event to be a signal detection.Thus ŵp,n can be obtained by summing the weights of the M components involving f s (•) in (8).The updated vector of probabilities is obtained using wn+1 = (1−α) wn +α ŵn , where α ∈ (0; 1) is an attenuation parameter to be tuned depending on the expected variations of w n over time.Note that it is also possible to apply a smoothing post-processing step, e.g., standard gaussian filtering to wn+1 to further refine the estimate of w n+1 since these parameters are often expected to be spatially correlated in each frame.As mentioned above, this strategy is simple and does not significantly degrade the performance of the 3D reconstruction method in most scenarios.The pseudo-code of the proposed method, referred to as O3DSP (for Online 3D reconstruction using Single-Photon data) is presented in Algo. 1.
Another important issue that might arise is the occurrence of a new object in the field of view.A particularly challenging scenario is the appearance of an object initially occluded by another object.In such cases, it is possible to add an extra component in (11), e.g., whose mean and variance can be related to the mean/median and dispersion of the {d p ,n } p , respectively.This approach would be efficient to capture new objects appearing between a foreground object and the background.However, as will be shown in Section IV, such an extra term does not seem necessary as the proposed ST model naturally enforces large variances around edges, which in turn allows initially occluded objects to be detected.Note that more complex and principled strategies should be developed to handle more challenging occlusion scenarios and situations where pixels do not contain any objects, which are out of scope of this work.This point will be discussed in the conclusion of this study.New objects can also enter the field of view from any side.To address this problem, we include, for the pixels around the edges of the image, and additional Gaussian component (with a large variance) in the mixture (11) such that the resulting prior allows at the same time, ranges similar to those in nearby pixels but also significantly different ranges induced by the presence of new objects.
Finally, the proposed algorithm can also be applied in the presence of faulty pixels for which π p,n = 0.For these pixels, the range information will be inferred using the inpainting capability of the model in (11).

IV. RESULTS
In this section, we discuss the performance of O3DSP through a series of experiments conducted with simulated data whereby ground truth is available for comparison.We first investigate the main parameters influencing the reconstruction performance using individual pixels, i.e., without accounting for information provided by neighbouring pixels.Then, we investigate the reconstruction of static and dynamic scenes using photon-starved measurements.

A. Single-pixel experiments
In Sections II and III, we have assumed that the measured times of arrivals follow continuous distributions, i.e., they are either uniformly distributed over [0; T r ) or Gaussian distributed.However, SPAD detectors have a finite timing resolution, whereby the measured times of arrival follow discrete distributions defined on a finite support.Fortunately, state-of-the-art SPADs [18]- [20] present a timing resolution which is much smaller that the support of f s (•) and thus than T r .Consequently, assuming continuous measurements does not significantly bias the estimation performance.Should the temporal resolution of the SPADs be coarser, O3DSP can still be applied using dither on the discrete measured ToAs [37].
In all the simulation results presented in this paper, we use the arbitrary illumination period T r = 1500 (unless stated otherwise) and f s (•) is modelled by a Gaussian distribution To initialise the algorithm, we used wp,0 = 0.5, ∀p and the Gaussian initial approximations q p,0 (•), ∀p are set identically such that their mean is T r /2 and their variance allows the entire interval (1, T r ) to be in the high probability region.This leads to a weakly informative initialisation that we use to assess the convergence of the algorithm.As will be discussed below, more efficient initialisations can also be used.
First, we investigate, the impact of w p,n on the estimation of d p,n for a given probability of detection π p,n .Here the number of frames is set to N = 500, π p,n = 0.5, d p,n = 300 and α = 0.01.Figs. 2 and 3, compare the convergence of {d p,n } n and { wp,n } n for w p,n = 0.8 (Fig. 2) and w p,n = 0.3 (Fig. 3).The top subplots depict the frames during which background (in black) and signal (in red) detections are recorded.The middle subplots depict the mean (red lines) and ±3 standard deviation intervals (black dashed lines) obtained by minimising (10).The bottom subplots represent the online estimates { wp,n } n (red lines) of w p,n .As can be seen from Figs. 2 and 3, the estimate of d p,n converges faster with w p,n = 0.8 than with w p,n = 0.3 (faster convergence to the ground truth and smaller uncertainty).This phenomenon is to be expected as the number of signal detections increases with w p,n , which in turn increases the amount of information about d p,n .On the other hand, the convergence of wp,n seems similar in both cases (around 200-300 frames).
Fig. 4 shows the estimation of d p,n and w p,n with π p,n = 0.8, d p,n = 300 and w p,n = 0.3.As expected, the convergence of {d p,n } n is faster than in Fig. 3 since its estimation is directly related to the number of signal detections which increases with π p,n (for a fixed w p,n ).
Reducing the probability of detection has an impact on the estimation of d p,n and w p,n , as can be seen in Fig. 5, where π p,n = 0.1 and w p,n = 0.3.In this case, with an average of 50 detection events for N = 500 frames (30% of which being signal detections), the convergence speed of { wp,n } n is reduced and the uncertainty about d p,n increases due to the lack of information provided by the data.In such difficult scenarios, the proposed method might not converge toward the correct solution without using additional information, which is a well known potential limitation of ADF [32].However, as will be shown in Section IV-B, the proposed ST model using information contained in neighbouring pixels (see ( 11)) yields  satisfactory results in the photon-starved regimes considered here.
We also evaluate the performance of O3DSP by analysing a single-pixel measurement where the object range describes a sine wave and where w p,n experiences two sudden changes (see Fig. 6).This figure has been obtained with π p,n = 0.5.As can be seen in the top and bottom subplots of Fig. 6, the probability of signal detection w p,n is changed successively from w p,n = 0.3 to w p,n = 0.8 and back to w p,n = 0.3.This figure shows that O3DSP is able to satisfactorily track the changes of d p,n without noticeable delay and that about 200 − 300 frames are required for wp,n to converge around the correct value.To highlight the benefits of our online approach over batchbased methods we also consider the single-pixel measurements used in Fig. 6 and compare our approach to the classical cross-correlation method (see details in [20]).This approach is chosen as it is the fastest batch-based method which processes all the pixels independently.Although the comparison could have been performed using image sequences and more advanced methods, the competing methods would have led to significantly higher computational costs.To apply the crosscorrelation, we first discretise the detection events uniformly over [0; T r ) with a stepsize of 1, which is much smaller and s 2 = 200 such that the discretisation bias can be neglected.For each batch of N 0 frames, the depth is then estimated by finding the delay that maximises the cross-correlation between the histogram of times of arrival within this batch and the discretised version of s(•).Fig. 7 compares the depth estimates obtained via cross-correlation for batches of N 0 = 10, N 0 = 50 and N 0 = 100 frames to those obtained using O3DSP.While small values of N 0 can lead to more accurate instantaneous estimates of the ranges, this figure shows that the results are also more sensitive to background detections due to the small number of detections within each batch of N 0 frames.Note that in extreme cases where π p,n is small, there might even be no detection in some batches.Note also in the top plot of Fig. 7 that the performance of the cross-correlation method is affected by the relative amount of background detections (larger errors for w p,n = 0.3 than for w p,n = 0.8, i.e., for n ∈ [600; 1100]).Here, we initialised the proposed method using weakly informative parameters but it could be initialised using a batch-based method, such as cross-correlation, with the first few frames to improve the convergence speed.

B. Analysis of static and dynamic 3D scenes
In this section, we first analyse the performance and convergence speed of O3DSP using simulated data based on real Lidar measurements conducted in [20], [21].More precisely, we consider a series of N = 5000 frames composed of 129 × 95 pixels and associated with a static scene whose range profile, probabilities of signal detection {w p,n } p and probabilities of detection {π p,n } p are depicted in Fig. 8. Here, we used T r = 2500.For most pixels, we have π p,n < 5%, which corresponds to realistic observation conditions in the photon-starved regime.
First, we compare the performance of O3DSP processing all the pixels independently, i.e., without smoothing of w and with ν = 1 to the version using the proposed ST model.In this case, we used M = 5 neighbours, ν = 0.99 and w Fig.  was smoothed using a Gaussian filter with standard deviation 0.5.In the two scenarios, we used (α, γ 2 ) = (0.1, 10).Fig. 9 depicts the estimated means and variances of the range estimates after 100, 500 and 5000 frames (top to bottom), with (left columns) and without (right columns) the ST model.These results illustrate the benefits of the ST model which improves the convergence speed of the algorithm and which reduces the number of isolated pixels with poorly estimated range (see bottom row of Fig. 9).O3DSP with the ST model is able to clearly identify regions of high uncertainty, i.e., the boundaries of the head where the range is likely to change suddenly, should the head move.Moreover, the uncertainty increases with the range difference between close pixels.For instance, the uncertainty is larger at the boundary of the head than in the neck/chin boundary.
To assess quantitatively the convergence of the method, we use the range root mean square error (RMSE) defined as where d n and dn are the actual and estimated range profiles in the frame n, respectively.Fig. 10 confirms that the proposed ST model improves the convergence speed and estimation performance in terms of RMSE.To ease the visualisation of these results, the generated data associated with this static scene, as well as the estimated range profiles are provided in a supplementary video associated with this paper (see Video 1 also available at https://youtu.be/yxcJypmc K4).For completeness, we also generated data with the same parameters as above but with probabilities of detection {π p,n } p,n multiplied by 10, when compared to those depicted in Fig. 8 (middle subplot), leading to an average probability of detection of 20% per pixel and per frame.Fig. 11 compares the convergence of the RMSEs for the original data (referred to as "low detection probability") and the new data set (referred to as "high detection probability").As expected, increasing π p,n yields faster convergence and lower RMSEs at convergence due to the additional amount of (more frequent) detections available.
Although O3DSP is designed to process sequences of individual detection events, these results illustrate also that it can be used to process ToA histograms associated with static scenes.Indeed, given the number or frames/laser repetitions used to build standard ToA histograms, it is straightforward to simulate N frames of individual detection events.This particular case deserves a thorough comparison with existing batchbased methods, which is out of scope of this paper focusing Fig. 11: Range RMSEs obtained for the static scene using the parameters defined in Fig. 8 (red lines) and by multiplying the probabilities of detection in Fig. 8 (middle subplot) by 10 (blue lines).
of individual detection events.This should be considered in future work.
Finally, we applied our algorithm to the 3D reconstruction of a synthetically generated dynamic scene which consists of flat homogeneous rectangles, in front of a static backplane.For this experiment, we used N = 2400 frames of 100 × 100 pixels with π p,n = 0.5, ∀(p, n) and T r = 2500.During the first 800 frames, two objects are present.The first object is static while the second object describes a counterclockwise circular trajectory, centred at the centre of the image (rotation of 0.45 • per frame).During this rotation, the second object completely occludes the first one which then reappears.During the next 800 frames, the first object disappears suddenly and the second one describes the same trajectory as before (while its range remains unchanged) but its size varies.At frame 1600, a third object enters the field of view from the left and describes a horizontal movement (constant range), while the first object moves away from the backplane.Moreover, we set w p,n = 0.5 for the pixels associated with the backplane and w p,n = 0.7 for those associated with the two objects.This scenario is chosen to assess the robustness of the algorithm to occlusions and appearance of new objects.The parameters of the algorithm have been set to M = 5, α = 0.1, ν = 0.5 and γ 2 = 100.The observed data as well as the estimated range profiles are provided in the second supplementary video associated with this paper (see Video 2 also available at https://youtu.be/NYeoZ99BdtM) .As an example, Fig. 12 depicts estimated range profiles and associated uncertainties for three frames, namely before, during, and after the occlusion of one of the objects.Here, the range uncertainty is measured using the width on the confidence intervals (CI) defined as 6 times (±3) the standard deviations of the approximating Gaussians.For the three frames, we observe, as expected, higher uncertainties at the boundaries of the small rectangles.Moreover, this figure illustrates that the proposed method is able to recover occluded objects when they become visible again.
As mentioned in Sections II and III, one important property of the method is that, for a given frame, all updates (expect the smoothing step in line 11 of Algo. 1) can be performed in parallel, using only estimates from one previous frame.In this work, the method has been implemented using Matlab 2017b running on a MacBook Pro with 16GB of RAM and a 2.9 GHz Intel Core i7 processor, leading to a average processing time of 4ms per frame (with P = 10 4 pixels).

V. CONCLUSION
In this work, we presented a first 3D reconstruction algorithm using individual photon detection events for online analysis of dynamic scenes.Based on assumed density filtering, the proposed method is computationally efficient as the data are processed partly in a parallel fashion (pixels in a given frame) and sequentially (successive frames), with a fixed computational cost.The results presented in this paper have illustrated the flexibility and ability of the method to be used for static and slowly moving scenes (compared to the frame rate).Whilst the code has not been fully optimised, preliminary results conducted with a tailored implementation using a Titan Xp GPU indicate significant computational improvement (well below 1ms per frame), paving the way to new and efficient streaming and processing of data directly from actual SPAD detector arrays.While the proposed method is able to track relatively slow changes of the 3D profile, ongoing work include the development of more sophisticated models, able the better predict the dynamics of the 3D profile and in particular, sudden changes associated with the appearance of objects or the occurrence of new objects.This problem is also related to the potential presence of an unknown number of objects per pixel, as in [15] for instance, which should be addressed in future work, in particular for fast object detection.Although the range estimation does not seem to be significantly affected by the quality of the estimation of the probability of signal detection in the scenarios investigated, it would be also interesting to investigate in future studies whether the proposed methodology can be made more robust to extreme ambient illuminations where w p,n 1.

Fig. 1 :
Fig. 1: Probability density functions (p.d.f.) of the time of arrival of a signal photon (red) for d = 750 and a background photon (blue), for T r = 1500 and s 2 = 200.

Fig. 6 :
Fig. 6: Analysis of dynamic scene (single pixel) with smooth changes of d p,n and sudden changes of w p,n .Top: Examples of background (black) and signal (red) detection events for N = 2000, π p,n = 0.5.Middle: Estimation of {d p,n } n for α = 0.01 and γ 2 = 100.Bottom: online estimates { wp,n } n (red lines) of w p,n .

7 :Fig. 8 :
Fig.8: Ground truth parameters used for assessing the performance of the proposed method for reconstruction of a static scene.

Fig. 9 :
Fig. 9: Online range estimation performance for a static scene: Estimated instantaneous means (first and third columns) and ±3 standard deviation confidence intervals (CI) (second and fourth column) after N = 100 frames (top), N = 500 frames (middle) and N = 5000 frames (bottom).The two columns on the left-hand side (resp.right-hand side) have been obtained with (resp.without) the proposed ST model.

Fig. 10 :
Fig. 10: Range RMSEs obtained with (red lines) and without (blue lines) the proposed spatiotemporal (ST) model for the static scene considered in Fig. 8.

Fig. 12 :
Fig. 12: Example of range estimation for a dynamic scene with occlusion of one object.The full estimated range sequence can be seen in the supplementary Video 2.