Energy Efficient Approximate 3D Image Reconstruction

We demonstrate an efficient and accelerated parallel, sparse depth reconstruction framework using compressed sensing (compressed sensing (CS)) and approximate computing. Employing data parallelism for rapid image formation, the depth image is reconstructed from sparsely sampled scenes using convex optimization. Coupled with faster imaging, this sparse sampling reduces significantly the projected laser power in active systems such as light detection and ranging (LiDAR) to allow eye safe operation at longer range. We also demonstrate how reduced precision is leveraged to reduce the number of logic units in field-programmable gate array (FPGA) implementations for such sparse imaging systems. It enables significant reduction in logic units, memory requirements and power consumption by over 80% with minimal impact on the quality of reconstruction. To further accelerate processing, pre-computed, important components of the lower-upper (LU) decomposition and other linear algebraic computations are used to solve the convex optimization problems. Our methodology is demonstrated by the application of the alternating direction method of multipliers (ADMM) and proximal gradient descent (PGD) algorithms. For comparison, a fully discrete least square reconstruction method (<inline-formula><tex-math notation="LaTeX">$d$</tex-math><alternatives><mml:math><mml:mi>d</mml:mi></mml:math><inline-graphic xlink:href="wu-ieq1-3116471.gif"/></alternatives></inline-formula>Sparse) is also presented. This demonstrates the feasibility of novel, high resolution, low power and high frame rate LiDAR depth imagers based on sparse illumination for use in applications where resources are strictly limited.


I. INTRODUCTION
The recovery of dense 3D images from LiDAR sensors is a challenging task, especially at high resolutions and frame rates, and with constraints on eye safety that limit the signalto-noise ratio (SNR) and operating range as these are highly dependent on the emitted power. Further, a large amount of raw data is produced by the sensor which requires significant storage and processing power when employing expensive computational methods. This prohibits its usage in resource constrained systems when limited power or physical space are available, for example in battery powered mobile devices, simultaneous location and mapping (SLAM) using a drone [1], and augmented reality [2].
Depth reconstruction, the key to 3D information recovery, is often ill-posed and relies on the solution of inverse problems [3]. To enable real time perception for higher resolution in all three dimensions, significant computing power and memory is required to recover the desired information at scale [4]. To maintain high frame rates, massively parallel compressed sensing (CS) enables the capture and recovery of 3D images quickly enough for dynamic scenarios by reducing the number of measurements, and thus memory, to maximize use of the limited data bandwidth. Sparse illumination further enables eye-safe long range low power LiDAR with block based parallelism [5]. For resource efficient implementation, each processing unit needs to have as few logic and memory gates as possible to minimize both power consumption and area.
Approximate computing (AC) is a promising approach in resource constrained systems, enabling energy efficiency [6]. This has been widely adopted in many areas such as signal processing [7], robotics [8], and machine learning [9]. Reduced precision (RP) computation, a common AC technique, represents the arithmetic data with less bits throughout the computational stack [10] to reduce resources, specifically, memory and logic units, and hence reduces the energy consumption. When dealing with computational units in proximity to photon detectors it is also crucial to decrease the processing chip power and thus reduce the direct heat to all photon sensitive elements to minimize device noise [11].
Using full precision, researchers have explored the use of more efficient algorithms for compressive depth reconstruction using convex optimisation with algorithms such as the parallel alternating direction method of multipliers (ADMM) [5] and accelerated proximal gradient descent (PGD) [12] for depth reconstruction. Several authors have investigated the use of RP for convex optimization. For example, Rateb et al. [13] employed a fixed point precision quadratic programming (QP) ' 1 -Solver, and a half floating point ADMM was investigated by Wills et al. [14]. An approximate stochastic gradient descent (SGD) method was introduced [15] with fewer quantisation steps; this allowed a faster implementation with a slight reduction in performance. Although efficient hardware implementations of compressed sensing algorithms have been investigated, it is unclear how approximate depth reconstruction can be performed with massively parallel CS solutions using ADMM and PGD.
In this paper, we report the results of a study on approximate 3D image reconstruction using ' 1 minimization based on RP for small scale least absolute shrinkage and selection operator (lasso) problems with the optimized lean ADMM and PGD algorithms on a FPGA platform. Both optimizers are suitable for compressed sensing of images, in our case depth images using time-of-flight sensing [5], [16]. A discrete solver for sparsely illuminated depth sampling, dsparse, is also investigated to assess the impact of additional measurements on resource usage using RP when compared to our CS solutions. By analyzing the effects of the different RP decimal formats on logic and memory resources with respect to their qualitative and quantitative performance, the savings in power and data compression throughout the sensing and computing stack are evaluated.
Our contributions are summarized as follows: A lean, discrete ADMM algorithm for compressive depth imaging, 'ADMM. A lean discrete PGD algorithm for compressive depth imaging, 'PGD.
A discrete least-squares method for sparsely illuminated depth recovery, dSparse.
A comparative study of the effects of RP approximation on 'ADMM, 'PGD and dSparse for 3D image reconstruction.
A comparative study of energy efficient FPGA implementations for approximate, RP parallel depth reconstruction using our 'ADMM, 'PGD and dSparse algorithms.

A. PARALLEL DEPTH RECONSTRUCTION
In a time-correlated single photon LiDAR imaging system, each pixel can capture a set of returned photon events which are often stored in a histogram, h 2 N l of the type shown in Figure 1, where l is the number of bins or channels [18]- [20]. Therefore, for a vectorised array of N pixels, the raw measurements are H 2 R NÂl , where l ranges from tens to thousands of bins depending on the system's timing and thus depth resolution. As larger and larger solid-state photon detectors are developed [11], [21], [22] and potential LiDAR resolution increases, the storage of raw measurements is challenging.
To address this, we presented earlier a depth reconstruction framework [5] using compressive sensing principles [23]- [25]. The original algorithm [16] was made parallel such that the pixel matrix of histograms was divided into smaller blocks, as illustrated in Figure 2, to enable fast processing times. The use of sparse structured illumination also reduces the total emitted power. This is beneficial as it makes LiDAR systems eye safe, and in the context of photon sensitive measurements, can lower the sensing device temperature. Noise is also reduced as temperature decreases [11].
As shown, the entire array of N sensing pixels is divided into blocks of size n < N, which independently sample the scene in a compressed strategy to form p measurements using predefined patterns stored in a sensing matrix A 2 R pÂn . The depth image reconstruction problem is formulated as a set of parallel, problems on each smaller pixel block. By exploiting sparsity in the signal, we can reconstruct two  proxy quantities to recover depth. These are the so called timeof-flight (ToF)-sum x Q 2 R n and intensity (photon count) x I 2 R n [5], [16]. By adopting our blocking scheme, information is recovered from smaller independent measured feature vectors, y Q ; y I 2 R p (shown as Q and I in Figure 2) using the compressed sensing matrix, A, such that y Q ¼ A Á x Q and y I ¼ A Á x I where p < n [5], [16]. Hence, y Q and y I are derived from sampled photon count histogram data such as that shown in Figure 1, measured by the LiDAR system, This leads to parallel reconstruction as a pair of optimization problems for each block as where A 2 R pÂn is the compressed sensing matrix, k Á k 2 2 is the Euclidean norm, > 0 is a regularization parameter to k Á k 1 ¼ P i j Á i j being the pseudo ' 1 -norm and F is a linear transformation. This formulation is known as the lasso [27] algorithm, which has been widely adopted in image reconstruction and de-noising [24], [27]- [29].
We assume sparsity in the discrete cosine transform (DCT) domain i.e F ¼ DCTðÁÞ. Due to the small problem size of each parallel reconstruction, the computational cost per block is greatly reduced. Assuming a single primary reflective surface per pixel [5], [16], this enables good estimation of depth by computing x D ¼ x Q :=x I for each block. Our parallel CS depth reconstruction scheme increases frame rate and reduces pixel storage requirements. To further improve efficiency and reduce resource consumption, we now show how we can reduce the precision of the arithmetic operations when using 'ADMM, 'PGD and dsparse.

B. ARITHMETIC PRECISION AND APPROXIMATION
Reduced precision (RP) arithmetic is an approximation technique in which the bit width in a data structure is shortened, enabling low cost and energy-efficient hardware implementations for computation and memory logic [10]. By exploiting different arithmetic data types, we can scale the area and power relative to the bit width and increase the density of operational units within the same area and power allowance. Here, we use three main categories of representation shown in Figure 3: floating point [30], [31], Q-format fixed point [32], and universal numbers (unum) [33]. (2) Figure 3(b) shows the binary structure of the Q-format fixed point, using instead a fixed number of bits to represent the integer and the fractional parts of a number. Specifically, a number is represented by its sign (1 bit), its integer part (i bits), and its fractional part (f bits): sign Â ð2 integer þ 2 Àfraction Þ : (3) Figure 3(c) shows the binary structure of unum Type III -Posit as an alternative to IEEE 754. This represents numbers by their sign (1 bits), regime (p bits), exponent (e bits), and fraction (c bits): where k ¼ Àp first bit of regime is 0, p À 1 first bit of regime is 1.
Each of these arithmetic types has a different dynamic range of number representation. The floating point and unum representations use less bits but the fixed point representation allows simpler binary operations. This affects not only the performance of an algorithm by limiting the type of operations that can be performed, but also their embedded implementation.
In iterative algorithms any approximation made by reducing the precision will cascade at each loop iteration and errors will accumulate. If the precision is not sufficient, the accumulated error will be unacceptable. For depth imaging with sparse signal assumptions the iterative algorithm is performed twice, concurrently, for x Q and x I , which are combined in the final recovery step. Thus, the effect of accumulated approximation effects is magnified. For example, with very low precision, significant signal components could be forced to zero too early and thus degrade consecutive computations.
We investigate the effects of approximation within the iterative approach of '1 optimization using the alternating direction method of multipliers (ADMM) and proximal gradient descent (PGD) [29] algorithms, and within a non-iterative approach, the pseudo-inverse least-squares (dSparse) algorithm. This is applied to the parallel depth reconstruction problem, reducing resource usage on reconfigurable platforms, such as an FPGA, minimising logic units and memory use and increasing frame rate. We aim to find optimal number representation and bit width for both the iterative and non-iterative frameworks by evaluating the effects of reduced precision on logic reduction, data compression, power, time and reconstruction quality.

III. ALGORITHMS FOR DEPTH RECONSTRUCTION A. 'ean ADMM
The standard ADMM implementation, shown in Algorithm 1 [29], requires OðnÞ flops using a soft-thresholding operator S and a threshold ADMM =r. The large matrix inversion term (Step 1 in Algorithm 1) remains constant throughout and can be pre-computed and stored in a look-up table. Further, one can empirically preset values for all the regularization weights.
By fixing the number of iterations, rather than check convergence at each iteration, the computation can also be reduced, depending on the complexity of the convergence check compared to an iteration. Moreover, using a fixed number of iterations has the considerable advantage of making our implementation on the FPGA deterministic and hence quantifiable in terms of resource usage.
We have modified the standard ADMM implementation [29] to give Algorithm 2 showing what pre-computations are feasible for the compressive sensing problem when we have full control over the sensing matrix, A. After careful analysis of the quadratic programming aspects of the ADMM implementation, inversion of A is often non-trivial. However, LU decomposition is used and these matrices are inverted to compute the x-update. For a constant matrix A, these inversions can be pre-computed, fL À1 ; U À1 g 2 R pÂp . Again referring to Algorithm 2, A T y 2 R n can also be precomputed, where ADMM ¼ tmaxðjA T yjÞ, with t a scaling constant. This allows us to break down and roll out the algorithm into discrete steps involving only simple matrix arithmetic.

B. 'ean PGD
With reference to the 'asso depth reconstruction problem in Eq. (1), considering f ðxÞ ¼ y À Ax k k 2 2 and gðxÞ ¼ PGD x k k 1 , the proximal gradient method iterates the value of the reconstructed depth, x, as follows, where k PGD is the step size. The standard procedure of PGD applied to (7) with proximal projection prox PGD ðÞ is described in Algorithm 3

Algorithm 3. Standard Proximal Gradient Descent [34]
Input: A; y Output: break 10: end if 11: end for f k ðx; yÞ is the upper bound function of f based on majorization-minimization [35] f ðx; yÞ ¼ f ðyÞ þ rf ðyÞ T Á ðx À yÞ þ ð1=ð2 pgd Þ Á x À y k k 2 2 Þ: To improve the efficiency, we again identify the constant components during iteration and pre-compute A T A and A T y. In Algorithm 3, the gradient step k PGD is changed during majorization-minimization, reducing k PGD by a fractional factor b, which introduces uncertainty into the number of iterations. This causes unpredictable timing closure in circuit design. Therefore, we propose a lean PGD algorithm, shown in Algorithm 4, including the essential gradient descent and proximal projection where W ¼ A T A and V ¼ A T y are precomputed. Instead of searching the gradient step k PGD dynamically, the majorization-minimization procedure is removed and an approximate method is considered with a pre-defined gradient step and a fixed updating schedule. By carefully analyzing the 'PGD algorithm, we introduce an additional fractional factor g on b, which slows down the decrement of k PGD and approaches the optimal gradient descent step size gradually. To further improve the timing closure, early termination by checking the objective difference e between iterations is not checked or allowed. By introducing a fixed number of iterations, as with 'ADMM, an efficient implementation can be achieved due to its' determinism.

C. PSEUDO-INVERSE LEAST-SQUARES: dSPARSE
By distributing the sparse signal recovery across many small blocks of size n, the size of the sensing matrix, A, is small enough to consider increasing the number of measurements to p > n without a dramatic increase in size or sampling time for a constant block size. This problem is over-determined, but uses a sparse measurement scheme on each laser projection to reduce emission power. In a pattern sequence A ¼ ½A 1 ; . . .; A i ; . . .; A m , every pixel is activated at least once in the sequence to ensure that A is full-column rank [36]. This allows depth recovery using an approximated least squares solution [37]. Let A aux ¼ ðA T AÞ À1 A T 2 R nÂp with p > n, then an image for each signal is simply Assuming a fixed sequence of patterns in A, A aux can be pre-computed and stored. We call this approach dSparse [26]. This approach requires more pattern exposure cycles and thus has a slower sampling time than the full compressive formulations, but the sampling time is constant for each approach as defined on a per block basis even if the number of blocks and thus full image resolution increases.

D. DEPTH RECONSTRUCTION
Solving the twin optimization problems in Section II using 'ADMM or 'PGD, the inverse DCT is applied to both the recovered ToF-sum x Q cs and intensity x I cs as shown in Eq. (9).
By dividing the outcomes of the twin optimization problems, the final depth pixel vector x D is recovered as shown in Eq. (10).
x D ¼x Qcs =x Ics ; ðfor 'ADMM and 'PGDÞ x Q ds =x I ds ðfor dSparseÞ: where x Q ds and x I ds represent the ToF-sum and intensity for dSparse.
Investigating the effects of reduced precision, we show how approximate computing (AC) enables a flexible hardware design for a parallel sparse reconstruction unit to enable efficient arrayed 3D imaging sensors for LiDAR applications. Using the small footprint enabled by reduced precision (RP), we next evaluate the balance between resource efficiency and accuracy of reconstruction for all three algorithms when implemented on re-configurable FPGA devices.

IV. AN EVALUATION OF THE PROPOSED APPROACH A. COMPUTATIONAL COMPLEXITY
To analyse the computational complexity of 'ADMM, 'PGD, and dSparse, the key parameters are the number of measurements, p, the signal size, n, and the maximum number of iterations, k max . p and n determine the dimensions of the sparse sensing matrix A in Algorithms 2 and 4. The matrix A aux in dSparse is derived from the sensing matrix, where p > n because of oversampling.
The arithmetic operations of addition/subtraction, multiplication, division, and memory addressing are counted separately. By considering the linear algebra computations for the three algorithms, the corresponding complexities of the three algorithms are compared in Table 1.
It is clear that dSparse has the least complexity as a function of n, p and k max . 'ADMM and 'PGD are of similar complexity, though 'PGD is slightly less. However, when comparing dsparse to the CS implementations we must take into account that the number of measurements, p, is much greater so these calculations do not imply that dsparse will be the most efficient.
Given the same n for depth image reconstruction using all three approaches, the number of parallel blocks is identical. Hence, the computational complexity within a single block is representative of the overall complexity. To further explore the parallel block based reconstruction, the arithmetic complexity for a single block with vector size n is illustrated in Figure 4 where a fixed number of measurements, p, and iterations, k max , are considered. For 'ADMM and 'PGD, the number of measurements, p ¼ 8, while for dSparse, p ¼ 48. For a fair comparison, five iterations have been set for both the 'ADMM and 'PGD algorithms for processing of all scenes. By carefully defining the algorithm parameters, such as r, , 4pn g, and b, we can approach guaranteed convergence [38], [39] for the illustrative block size of n ¼ 16.
It is clear from Table 1 and Figure 4 that the complexity is a linear function of the block vector size n, for fixed p and k max . For all three cases, this shows the importance of minimising the block size for efficient implementation, provided accurate reconstruction is maintained. The other key observation is that the complexity of the 'ADMM approach is much higher than the alternative 'PGD CS approach.
In contrast to solving (1) for p ¼ 8 or less in parallel for each block (n ¼ 16), a full frame, (n F ¼ 64 Â 64 ¼ 4096) image would require p F > 600 measurements, making the reconstruction by 'PGD or 'ADMM much more expensive and would require 4 orders of magnitude more storage for A. Some CS implementations rely on implicit generation of the sensing matrix, A, but add more computational steps [40]. In general, CS techniques will reduce the amount of sampled data but increase the computational complexity in the reconstruction. We try to reduce this impact on complexity, and thus resources, by applying reduced precision, where we assume a fairly small 4 Â 4 sensing block size with a block image vector size of n ¼ 16.
Notice that, in Figure 4, although the comparison of complexity in arithmetic and memory operations suggests the 'PGD is superior given the fixed pre-defined condition of A, the performance and cost are not reflected exactly by the complexity. Detailed analysis is given in following sections.

B. ACCURACY OF RECONSTRUCTION
The central argument of this work is that we can achieve almost equivalent quality of depth reconstruction with significant savings in resource consumption using reduced precision. Hence, to justify that argument, we assess first the quality of approximately reconstructed depth images using a number of different scenarios.
We use two synthetic benchmarks, one indoors and at short range of a room with furniture [41], the other outdoors and at longer range simulating an automotive environment [42]. The other histograms that we use are real data collected at our university using a photon counting LiDAR system as part of an earlier study into underwater object detection and classification [43]. We emphasise that the source data in all cases is a 2D matrix of histograms of the form shown in Figure 1.
For the real data, that is what we collect, for the synthetic data we create these from the known depth values using a simulation package that we have used successfully for work on surface detection using deep learning and compressive sensing [5], [44]. The final signal is simulated as a Poisson counting process, i.e. h ¼ Pðs þ 2Þ, where s 2 N l contains photon count means with a simulated instrumental similar to the system used in [43] at their respective surface return locations in the histogram [18], [44] and a mean background level of around 2 counts per bin.
We consider all approaches presented in Section III, i.e. the parallel CS 'ADMM and 'PGD algorithms and the parallel, over-sampled least squares approach, dSparse, at various precision using floating point (FP), fixed point (FXP) and unum posit (UP) arithmetic. In all cases, full, double precision floating point computation using dSparse is considered as the baseline comparison for the metrics for quality of reconstruction, i.e PSNR and SSIM [45]. For the CS algorithms, the data are re-sampled using pseudo-random binary patterns. For the over-sampled dSparse algorithm, random patterns are also used but each pixel is sampled at least once. In all cases, if the precision is reduced below the values shown, the image quality is considered unacceptable.
The first observation is that the results for dsparse are clearly better than the CS approaches. For example, SSIM is greater than or equal to 0.99 in all cases, compared to 0.75 or less for the CS cases, and the PSNR metric is also much higher, above 60dB compared to a best case of 30.57dB (UP20). Visually, the images are less noisy and show far less "blocking". The block artefacts in 'ADMM and 'PGD are primarily due to the chosen basis and the use of an identical pattern sequence for all blocks. However, the ability to share the pattern sequence across all blocks reduces storage since that pattern can be stored in a single, rather than multiple, different blocks.
As shown, the precision of dSparse can be reduced to 18 bits for both floating point and unum posit arithmetic and 38 bits for fixed point arithmetic without any noticeable effect. Beyond these points the images degrade noticeably. However, compared to full precision CS depth reconstruction, both 'ADMM and 'PGD can reduce the arithmetic bit width to 20 bits for floating point. With fixed point arithmetic, 'PGD can use reduced 22 bit fixed point, 1 bit less than 'ADMM. With unum posit 'PGD can be reduced to 22 bits, 2 bits more than 'ADMM.
The corresponding results for the outdoor benchmark scenes [42] are illustrated in Figure 6 for 'ADMM (r ¼ 1:2; a ¼ 1: The PSNR values are again much higher for dsparse, but we observe that there is a marked difference between the indoor and outdoor simulations. That is because we include a full, optical laser propagation and reflection model in our simulations so that the peak power at longer range is much reduced in comparison with the noise level at the receiver. However, what is important is that the algorithms can reconstruct the depth images at the lower PSNR levels. A visual comparison of the final depth map reconstruction using the several data types and precision, using simulated photon count data created from a real depth scene [42]. The depth image size is 128 Â 128. FIGURE 5. A visual comparison of the final depth map reconstruction using the several data types and precision, using simulated photon count data created from a real depth scene [41]. The image size is 64 Â 64.

2) REAL LIDAR DATA
These underwater images were all collected in a tank with a stand-off distance of 1.33 metres, slightly less than the total tank length, using a time-correlated single photon counting LiDAR system [17]. Our methodology was identical to the synthetic cases. For the first scene, which consists of a set of circular target blocks on a sandy base, the results are illustrated in Figure 7 for 'ADMM (r ¼ 1:2; a ¼ 1:4; t ¼ 10 À9 ) in (c)-(e) , 'PGD (g ¼ 4 on b ¼ 0:6) in (h)-(j) and dSparse in (m)-(o) respectively. In the second example, which is of a tethered, metal ball bearing (actually simulating an underwater mine), hidden behind foliage, the results are shown in Figure 8 for 'ADMM (r ¼ 1:2; a ¼ 1: In general, our observations of the real data confirm the synthetic cases, i.e that the dSparse algorithm will produce better results than the CS methods, and that the latter also have more pronounced blocking effects. The differences between the CS methods are slight, as indeed they should be as they use the same objective function to minimise, although the methods are different.
In terms of reduced precision, we again observe that significant reductions in bit width are possible, for example, referring to Figure 7  In summary, the parallel CS approaches, 'ADMM and 'PGD, and the least-squares approach, dSparse, are all able to reconstruct the depth information for simulated and real photon count data at significantly reduced precision. From the algorithmic point of view, dSparse dominates depth reconstruction performance measured by SSIM, as the data is over-sampled and the reconstruction is computed by a direct, least-squares solution to the inverse problem. Yet, both CS approaches retain relatively high reconstructed depth image quality as measured by PSNR when compared to dSparse. It is also important to re-affirm that a major justification for the use of CS for active LiDAR systems is to reduce the projected power, and thus allow eye-safe operation at longer range than would be possible using a full dSparse or other 3D reconstruction method. Given the comparative evaluation of depth reconstruction quality between full and reduced precision, it is clear that the approximate solutions from all three methods are only slightly degraded when compared to corresponding full precision solutions. This enables cost savings in hardware implementations as illustrated in the next section.

V. IMPLEMENTATION ON AN FPGA PLATFORM
Our goals are to save power and memory on a dedicated sensor platform using an FPGA architecture, and to support designs using variable precision. In this section we describe how we design and synthesise the hardware for the three different algorithms, 'ADMM, 'PGD and dSparse. To A visual comparison of the final depth map reconstruction using the several data types and precision, using real underwater scene photon count data [17]. The image size is 192 Â 192. support these algorithms, we develop functions using a C++ template for arbitrary approximate algebra operators based on the Xilinx fixed-point [46] and the unum SoftPosit libraries [47]. The customized floating point solutions are based on post-processing of the standard implementation from the Xilinx Vivado High Level Synthesis (HLS) suite. All arithmetic data-types implemented in C++ are synthesized for the Xilinx FPGA UltraScale+ ZCU106 using Vivado 2019.1.

A. APPROXIMATE ALGEBRA MODULES
In Figure 9, the matrix/vector arithmetic modules, including vector addition/multiplication/division/comparison and matrix to vector multiplication, are listed.
The vectors and matrices are buffered through a FIFO data path while the scalar number is transmitted by a single register line data path. The approximate algebraic modules are able to perform scalar arithmetic operations using the different data types and precision. In Figure 9, only the module interfaces are shown, neglecting the external memory addressing of the matrix and array elements.
VEC ADD performs the addition of two arrays using the same specific data type with pre-defined precision.
Since the inputs are two vectors, two FIFO data paths used to stream the data to the scalar arithmetic where the computation is pipelined for each pair of inputs. The output is streamed through a single FIFO data path to successive modules. The same process is operated for VEC SUB and VEC DIV. VEC SCALAR ADD has one array input and one scalar input, both using the same data type with pre-defined precision. Each time an element in the input array is added to the input scalar number, the outcome is streamed through the output FIFO. This process is also applied to VEC SCALAR SUB, VEC SCALAR MUL, and VEC SCALAR CMP. VEC SCALAR CMP does not only compare the elements in the input array but also selects the maximum as the output. The MAT VEC MUL module multiplies a matrix by an array using column-wise element multiplication and accumulation of the same specific data type with predefined precision. All the modules are designed with the Xilinx HLS stream pragma specifically for the AXIS bus interface. The arithmetic units are shown as generic basic blocks while the different data types (FP, FXP, or UP) and precision are tailored on-demand for different depth reconstruction methods and scenes. FIGURE 8. A visual comparison of the final depth map reconstruction using real underwater scene photon count data for a real depth scene from [17]. The image size is 192 Â 192.

B. APPROXIMATE DEPTH RECONSTRUCTION ARCHITECTURE
First, we consider the CS approaches. By composing the approximate algebra modules of Section V in different combinations, the implemented iterative architectures for Algorithms 2 and 4 are illustrated in Figures 10 and 11 respectively as dataflow diagrams. External memory addressing is represented by the arrow links. The input data is read from external memory into the buffer of the corresponding modules while the data is streamed between modules. The output is only written to external memory once the iterations are concluded.
For 'ADMM and 'PGD, the CS depth reconstruction architecture is illustrated in Figure 12, where the outcomes from the twin solvers are processed within a parallel data path. Each data path contains a pair of MAT VEC MUL units to pre-process the algorithm input and post-process the output by inverse DCT. The outcomes of both data paths are combined in a VEC DIV unit to generate the final depth values.
The simpler, dataflow architecture of dSparse is illustrated in Figure 13, where the scalar arithmetic operations are performed in two nested iterations. The inner iteration performs two parallel matrix to vector multiplications column-wise on the input measurement vectors y i and y q , while the outer iteration generates the depth pixels x by dividing x p with x i over the length of the sampling block. The output pixels are buffered for writing to external memory or other units.
Notice that throughout the hardware synthesis process, the identical modules are folded and reused for resource optimization, where each module is counted once in the final implemented resource utilization.

C. RESOURCE UTILIZATION AND PERFORMANCE
The resource utilization of the depth reconstruction architectures shown in Figures 12 and 13 are illustrated in Table 2 for a sensing block using the several custom floating point, fixed point and unum posit representations across all scenarios. The comparative metrics are compared to full double precision and a data ratio given by data ratio ¼ ðp=nÞ Á ðbits=64Þ; with constant p and n, where bits indicates the total bit width of corresponding arithmetic types. Using Vivado power estimation tools, the FPGA system has a static power consumption baseline of roughly 0:6 W consistently for all three depth reconstruction approaches. The power savings of dynamic power consumption above this baseline are analyzed.
For 'ADMM, there are dramatic savings of above 65% in LUT, 90% in DSP, 100% in BRAM using customized floating point (FP) enabling more than 75% power saving and 60% data compressive ratio reduction. The processing latency using customized floating point is reduced by a factor of almost 2 compared to full precision. With fixed point arithmetic (FXP), there are similar resource savings greater than 60% in LUT, 100% in BRAM and around 40% in DSP, enabling around 75% power saving and more than 40% data FIGURE 11. Approximate 'PGD iterative architecture.   reduction. The processing latency using fixed point is further reduced by a factor of nearly 6. Using unum posit, there is over 70% more LUT usage than full precision while there is 40% in DSP and 100% BRAM usage reduction. The power consumption is also reduced by over 80% with more than 50% data reduction. It is not as fast as either customized floating point or fixed point, but still gains over 30% processing latency reduction when compared to full precision.
For 'PGD, there are also significant savings of above 75% in LUT, 85% in DSP, and 100% in BRAM for customized floating point, with over 80% power saving and 65% data compressive ratio reduction. The processing latency is also reduced by around 45%. With fixed point arithmetic, the resource utilization is reduced by over 65% in LUT and 100% in BRAM, however, slightly more DSP usage, around 10%, is introduced. The power consumption is reduced by over 75% while the data ratio is reduced by over 40%. The latency of fixed point arithmetic is reduced by nearly 95% which is over 10x faster than full precision. By using the unum posit for 'PGD, there is slightly 10% more LUT usage and slightly 20% less DSP usage than full precision. The power consumption is reduced by over 85% with above 25% less processing latency.
For dSparse, similar trends are observed. With customized floating point, the resource saving is reduced by 85% in LUT, 80% in both DSP and BRAM. and 87% power saving is achieved with over 67% data ratio and 75% processing latency reductions. By using fixed point arithmetic, the resource saving is reduced by 90% in LUT, 60% in DSP and 73% in BRAM. Power is reduced by over 92%, while the data ratio and processing latency are both reduced by around 50%. The unum posit data type for dSparse saves resource by 45% in LUT, 50% in DSP and 46% in BRAM, which introduces saving of 89% in power and 65% in data ratio. The latency using unum posit is slightly higher than the full precision by over 12%.
Hence, in all cases, significant reductions in resources, data usage, power and latency, are demonstrated. This enables great flexibility in system design for 3D imaging systems with control over resource use and power efficiency. Comparing the three algorithms, the two CS approaches have similar logic and memory resource utilization but dSparse out-performs both in terms of logic resource use and dynamic power. Indeed, dsparse drops to less than 100 mW in the best case. This is due to the much simpler computational operations shown in Figure 13. However, using dSparse, the data addressing is more complex for matrix to vector multiplication and so more BRAM storage units are used at all levels of reduced precision below the full FP64 case. dSparse has a much higher data ratio, which is due to the lengthened data acquisition stage and the increase in the sensing matrix size.
In our FPGA implementations, we have considered processing of blocks of pixels corresponding to the size of the sensing pattern, which is 4 Â 4. This allows a maximum frame rate of well above 3 kHz for 'ADMM, over 7:5 kHz for 'PGD, and 11 kHz for dSparse if all blocks are processed in parallel. Given the image sizes, there are 256 blocks in Figure 5, 1024 blocks in Figure 6, and 2304 blocks in Figures 7 and 8. Depending on the number of parallel processing blocks, the frame rate can be adjusted by trading the overall resources and power consumption. For example, if 32 blocks are processed in parallel for the scenario in Figure 5, this reduces the frame rate to 375 Hz for 'ADMM, over 937:5 Hz for 'PGD, and 1:375 kHz for dSparse. This is comparable to the hypothetical x86 processing times presented in [5] but here we have been able to show synthesised FPGA results that clearly demonstrate the effectiveness of RP in saving resource for the depth reconstruction problem.

E. EFFICIENCY EVALUATION
The efficiency of RP approximated depth reconstruction is also illustrated in Table 2. For the 4 Â 4 sensing pattern of 16 depth pixels, two major metrics are considered: throughput with reconstructed depth pixels per second and energy consumption with micro Joules per reconstructed depth pixel. By considering the latency time of processing a single sensing block (4 Â 4 pixel block), the system throughput in pixels per second is Throughput ¼ 16=Latency; (12) while the energy of pixel processing in mJ per pixel is Energy ¼ ðLatency Á Dynamic PowerÞ=16: As shown, when working with RP, the efficiency of 'ADMM is not comparable to the other two approaches as it has the lowest throughput and the highest energy consumption for all the different scenes. On the other hand, RP approximated 'PGD using fixed point has the highest throughput, but RP approximated dSparse using fixed point has the advantage of lower overall energy consumption, especially for those synthetic and real LiDAR scenarios in Figures 5(n), 6(n), and 7(n) with only 0.11 mJ per reconstructed pixel. However, the 'PGD algorithm achieves both higher throughput of processing, 1.6 million pixels per second, and lower energy consumption, 0.09 mJ per reconstructed pixel, for the synthetic scenario in Figure 4(i). The highlighted numbers with bold font in Table 2 indicate the top selected design metrics across all three approaches for different scenes.
Notice that all the implementations use the HLS method for the purpose of prototyping comparisons without heavy design optimization which can be potentially improved with fine-grain RTL design. The prototyping outcome indicates that a simpler algorithm does gain advantages in resource utilization and power consumption with approximation, while the throughput and energy efficiency might vary depending on the generated architecture. Nevertheless, one can trade the metrics from HLS outcomes as guidelines to meet the particular design requirements.

VI. CONCLUSION
We have presented the effects of applying reduced precision as an approximate computing technique to depth reconstruction using small scale ' 1 ('ADMM) and ' 1 ('PGD) parallel blocked compressive sensing, and a discrete least-square solver (dSparse).
We have demonstrated significant savings in logic resources and much improved latency using reduced precision floating, fixed and unum posit data types with minimal degradation of reconstruction, in comparison with the commonly used double precision floating point arithmetic. We achieved very fast per block computation times, reducing latency from 0.48 ms to 0.01 ms, and significant power savings, by as much as 95%, in the best cases. In the compressed sensing cases, we achieved additional savings by reducing further the overall data bandwidth. This demonstrates a pathway to dedicated logic for depth computation for resource constrained devices. The approach also shows potential ways of mitigating the computational complexity of compressed sensing techniques, while benefiting from low power, eye-safe sparse illumination and lowering the data bandwidth throughout the computation stack.