Fast-Gated 16 × 16 SPAD Array With 16 on-Chip 6 ps Time-to-Digital Converters for Non-Line-of-Sight Imaging

We present the design and characterization of a fully-integrated array of $16 \times 16$ Single-Photon Avalanche Diodes (SPADs) with fast-gating capabilities and 16 on-chip 6 ps time-to-digital converters, which has been embedded in a compact imaging module. Such sensor has been developed for Non-Line-Of-Sight imaging applications, which require: i) a narrow instrument response function, for a centimeteraccurate single-shot precision; ii) fast-gated SPADs, for time-filtering of directly reflected photons; iii) high photon detection probability, for acquiring faint signals undergoing multiple scattering events. Thanks to a novel multiple differential SPAD-SPAD sensing approach, SPAD detectors can be swiftly activated in less than 500 ps and the full-width at half maximum of the instrument response function is always less than 75 ps (60 ps on average). Temporal responses are consistently uniform throughout the gate window, showing just few picoseconds of time dispersion when 30 ns gate pulses are applied, while the differential non-linearity is as low as 250 fs. With a photon detection probability peak of 70% at 490 nm, a fill-factor of 9.6% and up to $1.6 \cdot 10^{8}$ photon time-tagging measurements per second, such sensor fulfills the demand for fully-integrated imaging solutions optimized for non-line-of-sight imaging applications, enabling to cut exposure times while also optimizing size, weight, power and cost, thus paving the way for further scaled architectures.

retrieve its distance from the sensor [1]. In order to go beyond LiDAR, non-line-of-sight (NLOS) imaging technique relies on measuring the TOF of photons, although instead of collecting the light directly scattering from the target to the sensor, it exploits indirect diffuse reflections to gather information out of the line-of-sight [2]. With advanced computational methods for processing the photon TOF [3]- [7], it has been demonstrated that it is possible to reconstruct scenes that occluders hide from the direct view by collecting light that has scattered multiple times, thus leading to see around corners.
This capability of looking beyond the direct line-of-sight will enable to remotely survey areas that are either difficult or dangerous to access, making NLOS imaging compelling for several real-world applications, such as military intelligence, security and surveillance, search and rescue or even lunar subsurface exploration [8]. Being able to collect information about NLOS targets could even improve the spatial awareness of autonomous vehicles as well as industrial monitoring systems, allowing to proactive react to hazards or hidden threats before they even enter the direct field of view.
Among all the different approaches to NLOS imaging that have been presented and demonstrated in literature in the last decade [9]- [14], in [2] the authors proved the feasibility of retrieving the 3D shape of hidden objects by using single-photon avalanche diodes (SPADs). SPADs are essentially p-n junctions reversed-biased above their breakdown voltage [15], [16], where absorbed photons can generate a self-sustaining carrier multiplication process (avalanche), resulting in a macroscopic current to be sensed as a digital pulse. Fig. 1 shows a NLOS setup where a pulsed laser illuminates one point of a surface (i.e., the relay wall), which is directly visible from both the hidden target and the observer. The laser light diffuses as a spherical wave towards the detector as well as the target to be imaged, and a few photons interacting with the target back-scatter towards the relay wall, finally reaching the detector after three (or more) bounces. Only after processing the TOF of detected photons, a 3-D reconstruction of the target can be retrieved. The returning signal is relatively weak, but more importantly is distributed over the entire visible surface and reaches the detector after the strong first bounce from the relay wall. Therefore, NLOS imaging systems require many pixels able to determine both the location and the precise arrival time of few photons returning from a large area of the relay wall after such strong first bounce.
As such, NLOS sensors require a list of features that have not been combined in any existing LiDAR system: i) swift activation of the SPADs (i.e., fast-gating technique), so as to rearm the detectors in few hundreds of picoseconds after the bright direct reflection from the relay wall, not to miss the TOF information of subsequent photons back-scattering from the hidden object; ii) single-shot precise photon timestamping, for achieving high depth resolution reconstructions; iii) single-photon sensitivity with high detection efficiency to harvest as many signal photons as possible, as light becomes extremely weak after consecutive scattering events. The first proof-of-principle NLOS experiments were based on either a fast-gated single-point SPAD [2] or a small linear array of fast-gated SPADs, paired with external timecorrelated single-photon counting (TCSPC) units [17]. In such works, spatial resolution is extended with a galvanometer mirror positioning system to scan many illumination points on the relay wall. As a consequence, state-of-the-art NLOS setups are currently bulky and expensive, and most importantly require long acquisition times for achieving high-resolution reconstructions.
Here, we present a compact SPAD camera based on a novel monolithic 16 × 16 SPAD array with integrated 6-ps TDCs and fast-gating capabilities, i.e., each SPAD is switched ON in less than 500 ps. Time-of-flight is measured with an average timing jitter of 60 ps full-width at half maximum (FWHM), without requiring any external TCSPC unit. Our new NLOS SPAD camera leads to shorter exposure times, thus enabling video-rate reconstructions as well as offering room for lowering the active illumination optical power.

II. SENSOR ARCHITECTURE
A block diagram of the designed 16 × 16 SPAD sensor is reported in Fig. 2. Pixel organization follows a two-step hierarchical structure: the building block of the array consists of a macrocell, i.e., a group of 4 × 4 SPADs, in which 4 clusters of pixels are laid out as 2 × 2 sub-arrays, named microcells, with their own gating and sensing circuitry, and common logic. Each macrocell propagates the first timing event (i.e., an avalanche triggered inside one of the SPADs) and its coordinates to the corresponding TDC by means of a fixed priority arbiter (FPA) tree, which also asserts a validation signal to manage collisions. Indeed, given the photon starved nature of NLOS signals, only 16 TDCs have been integrated and shared among the 16 SPADs of a macrocell, in order to save area and power consumption. Converters have been placed outside the imaging area on one side of the array, as their complex and area-consuming multistage interpolation architecture is not suitable for in-pixel integration. Each TDC provides the pixel address and the photon TOF to an output dual-channel serializer, enabling to transfer data out of the integrated circuit (IC) via 32 200-MHz output pads, for a total bandwidth of 6.4 Gbits/s.
All the electronic circuits of the acquisition chain will be thoroughly described in the following sections.

A. Multiple Differential SPAD-SPAD Sensing
Time-gated detectors are intended for all applications requiring a time-filtering of incoming light. Nevertheless, typical gating approaches consist in simply masking avalanches if generated within well-defined time intervals. Many SPAD-based imagers capable of masking the avalanche signal of the SPADs have been presented in literature [18]- [22]. Despite this being a simple and effective solution to avoid triggering the pre-processing electronics, SPADs are not turned off and they still operate in free-running mode, i.e., they are always light-sensitive, meaning that they can be triggered and blinded by undesired strong light pulses preceding the faint signals of interest. Therefore, such gating approach is not suited for NLOS applications, as spurious first-bounce reflections from the relay wall always anticipate back-scattered photons that have interacted with the hidden scene. In order to completely disable the SPADs during direct reflections, we opted for a different gating approach, which was reported in literature in few SPAD arrays [23]- [26]: the SPAD is turned off by lowering its voltage below its breakdown level. When SPADs must be enabled right after the first spurious reflection, OFF-to-ON transitions need to be as fast as few hundreds of picoseconds, thus naming this operation as "fast gating". However, rapidly rising the SPADs' bias induces strong and undesired voltage fluctuations at the avalanche sensing node. Such disturbances can be effectively and robustly rejected with a comparator-based differential sensing approach, which is also ideal for minimizing the timing jitter by sensing the avalanche with a low-threshold [27].
A typical sensing scheme consists in coupling each SPAD with a dummy device mimicking the SPAD parasitic capacitance, thus leading to cancel the unwelcome feedthrough pulses appearing as common-mode variations at the input of the comparator [28]. Nonetheless, this SPAD-dummy approach is not ideal for a monolithic array integration, as the dummy occupies roughly the same area of the SPAD, thus halving the achievable fill-factor, and doubling the power consumption for gating operation as well. As shown in [23], a better approach for scaling to multi-pixel architectures is to compare the output of adjacent detectors by arranging a SPAD-SPAD differential sensing. Such scheme is very effective in rejecting the feedthrough, but is mainly suited for applications requiring narrow gates: when two avalanches are triggered simultaneously in both the SPADs of the same pair, the comparators discard both events, but when the gate is narrow (few nanoseconds) its falling edge disables both SPADs and promptly quenches such avalanches, despite they were missed by the read-out circuit [15]. However, NLOS imaging typically relies on longer gate pulses (e.g., tens or even hundreds of nanoseconds), as the gate-ON duration must be selected according to the depth of the scene to be reconstructed. Hence, comparing the output of two adjacent SPADs may cause concurrent avalanches to be missed and not quenched rapidly.
In order to improve the SPAD-SPAD approach, in the SPAD array here presented we extended it to a cluster of 2 × 2 SPADs (named microcell) and arranged the circular differential scheme shown in Fig. 3: assuming that all 4 comparators C0-C3 are unbalanced to provide a grounded default output state, when comparing the outputs of 4 adjacent SPADs with this circular scheme, even if both SPADs driving a comparator fire simultaneously, one of the other two comparators sensing the other two SPADs is able to spot the unbalance: as long as at least one SPAD does not fire synchronously with all the others, this scheme can identify the onset of an avalanche and quench it correctly, solving the issue of missed detections and consequent lack of quenching in case of concurrent events. Such ability to unambiguously distinguish the correct event even when detections happen simultaneously (also in case of optical crosstalk events [29]) also eases the synthesis of an identification logic, which attributes each photon-event to the first firing SPAD of the microcell, or, when events happen within few tens of picoseconds, following a hardware-fixed priority list.
When an avalanche is triggered in just one pixel, e.g., SPAD1, its sensing comparator C1 is able to detect the event, thus causing PH1 to rise, while PH0 stays grounded. If two (or even three) adjacent SPADs fire simultaneously, e.g., SPAD1 and SPAD2, only one comparator can change its output state, while other conditions are forbidden: PH0 remains grounded as well as PH1, since both C1's inputs rise at the same time, whereas C2 is able to sense the current pulse as a differential input unbalance and only PH2 rises. Only when two non-adjacent SPADs detect photons simultaneously, e.g., SPAD1 and SPAD3, two different comparators can change their output state at the same time (PH1 and PH3 both rise), thus no fixed priority applies in such conditions. Nonetheless, PH1 and PH3 can successfully trigger the quenching feedback and the dedicated identification logic manages the contention. A drawback of this architecture is that it is only possible to disable the entire microcell, not single SPADs, by grounding the signal nHP in Fig. 3, for example in case of hot pixels (i.e., the SPADs showing significantly higher noise than the average one). However, in NLOS imaging applications timegating reduces the effective dark counts and, typically, the background illumination dominates over SPAD noise, so it is quite unlikely that a hot pixel needs to be disabled. Fig. 4 shows the schematic representation of the fast-gated frontend circuit driving a single SPAD. Transistors M4 and M5 drive the SPAD ON and OFF, respectively, by modulating its anode voltage. The SPAD can be biased up to 5 V beyond its breakdown voltage (i.e., the maximum excess bias voltage V EX is 5 V), as transistors M1-M5 are thick oxide MOSFETs, while transistor M3 grants electrical compatibility between the 5 V circuitry and the following 1.8 V one: the comparator and the following fast-switching logic are supplied by the 1.8 V rail to spare power and make use of wider-bandwidth transistors. Avalanches are sensed by means of transistor M1 and M2, which are used as programmable degeneration resistor at M4's source. As such, transistor M1 and M2 can be enabled or disabled by the user in order to select the sensing resistance, thus trading off faster activations with lower timing jitter: while a highly resistive path towards ground slows the anode discharge (i.e., the gate-window's rising-edge), it makes the sensing node reach the comparator threshold at a lower avalanche current, thus mitigating the effect of the time uncertainties intrinsic to the carrier multiplication process [27]. M1's aspect ratio is twice the one of M2, so that 3 possible combinations of gate transition vs. sensing threshold can be set. Aiming to achieve the sharpest possible Instrument Response Function (IRF), a prompt triggering of the comparator is granted by its low threshold. Hence, the input differential pair of such comparator has been slightly unbalanced to introduce a tailored offset voltage acting as the differential sensing threshold, whose mean value is V T = 26.7 mV (σ T = 7.5 mV), according to Monte Carlo simulations. This value turned out to be sufficiently high to grant a robust fabrication yield (V T > 3σ T ), while also preventing the output of the comparators to chatter. Also, PMOS transistors have been used to extend the input common mode voltage to ground.

B. Fast-Gated Front-End Circuit
As soon as an avalanche is sensed, transistor M1 and M2 also start the quenching mechanism: the comparator response to a photon detection quickly disables both M1 and M2 in less than 1 ns in order to interrupt the avalanche current flowing towards ground, thus passively quenching the SPAD. The quenching process is then actively completed in less than 2 ns, when the HVGATE signal driving transistor M4 and M5 is grounded by the hold-off logic via the 1.8 V to 5 V level-shifter (shown in Fig. 3).
Besides completing the avalanche quenching, the hold-off logic limits the afterpulsing probability [30]: since avalanche carriers may get trapped in deep energy levels, SPADs are kept OFF for at least 30 ns after an avalanche is triggered in order for trapped carriers to be released before the SPAD is subsequently re-activated. For a proper fast-gated operation, the hold-off ends at the rising-edge of the first GATE after it, rearming the SPADs as soon as the next gate-window starts.

C. Identification Logic
The identification logic asynchronously discerns the spatial coordinates of the first firing SPAD, also in case of multiple events happening within the same gate window. Such logic is laid out as a 2-stage FPA tree (see Fig. 2, inset) and consists of a timing branch, propagating the timing information of the first avalanche ignition to the TDC, and an arbitration branch, encoding and propagating the detection coordinates to the TDC. A complete block diagram of the identification logic is outlined in Fig. 5 (left).
The timing branch is made by cascading 4-input OR gates. A group of 4 gates, each one being the logical OR between the outputs of the 4 comparators in a microcell (PH0-3), represents the first stage of the timing branch, while the second stage is just one additional OR that directly feeds the TDC. All OR gates, whose architecture is reported in Fig. 5 (right), have been designed as pseudo-NMOS opendrain NORs followed by an inverter to restore the correct logic levels. Pseudo-NMOS NORs have been chosen instead of normal NOR gates to make the propagation delays uniform for all the pixels. The pull-down of the NOR gates has been degenerated with an additional NMOS, i.e., transistor M6, sized to limit the discharge current toward ground. Adding such transistor makes the output transition independent from Similarly, the arbitration branch encoding and propagating the event coordinates is organized as a cascaded architecture: its first stage consists of 4 2-bit encoders, one per microcell, exploiting the forbidden output codes of the comparators and the intrinsic priority list discussed in the previous sections to provide the X-Y binary coordinates of the first firing SPAD within the microcell. In order to avoid that further detections cause the encoder to generate glitches or produce faulty assignments, an input register is used to sample the state of the comparators by means of the microcell's timing OR, right after the photon detection. Each encoder also includes a validation circuit to reject collisions, which could happen in case of coincident events occurring in two non-adjacent pixels of the same microcell, thus allowing to rearm the TDC for further conversions when the VALID bit is not asserted. Each microcell conveys to the second arbitration stage, that is a 4-way arbiter circuit [32] followed by a binary encoder driving a multiplexer. Such combinational circuit handles requests (i.e., signals from all 4 microcell ORs) to propagate datasets (i.e., the SPAD coordinates and VALID bit of each microcell). Whenever a photon detection occurs, a request is asserted in one branch of the OR tree. After receiving such request, the 4-way arbiter at the second stage of the FPA identifies the microcell generating the propagation request. The following encoder produces a 2-bit MICRO ID indexing such microcell, thus enabling the multiplexer to convey its dataset (i.e., its 2-bit SPAD ID and VALID bit) and the MICRO ID to the output. However, in case further SPADs fire, more than one request could be asserted: if the time-delay between requests is more than about 40 ps, the first requesting dataset wins the contention and is propagated to the TDC, otherwise the arbiter forwards the data-channel with the highest priority.

D. Time-to-Digital Converters and Output Serializers
The design of a high-resolution TDC has been dictated by the strict timing jitter requirements of NLOS applications. In order to minimize the contribution of the converters to the overall timing jitter of the imager, we opted for integrating 16 TDCs with 6 ps of resolution. Each TDC is based on the multi-stage interpolation architecture [33] shown in Fig. 6. The converter employs a 10-bit counter operated at 416.67 MHz by an external reference-clock (T ck = 2.4 ns), granting a full-scale range (FSR) of about 2.45 μs. Improved conversion linearity is obtained by exploiting the sliding-scale technique [31]: two separate interpolators for the START signal (i.e., the avalanche signal) and for the STOP signal (i.e., the synchronism signal from the pulsed laser) are employed in order to measure their corresponding delay from the first rising-edge of the reference clock. Each interpolator is based on a two-stage coarse and fine interpolation approach [32], in order to reach very precise timing measurements while also granting brief conversion times: a 4-bit coarse multiphase interpolation stage keeps the maximum conversion time as short as few tens of nanoseconds, while a 6-bit fine interpolation stage, featuring a single-stage cyclic Vernier delay line, provides a nominal resolution that is better than the shortest intrinsic gate propagation-delay achievable with this technology node [34], [35]. The synchronization between coarse and fine stages is granted by an arbiter-based synchronizer circuit [32]. A coarse delay-locked loop (DLL) generates the 16 multiphase clocks feeding the coarse interpolation stage, granting all the 16 phases to be equally spaced in time and resistant to process, voltage, and temperature (PVT) variations. Since the final resolution of the converter is obtained as the difference between the propagation delay of two separate delay paths τ 1 and τ 2 , only an extremely accurate propagation delay within the fine interpolator can guarantee a stable and precise picosecond resolution, therefore the Vernier delay line is also biased by means of fine DLLs providing PVT resilient bias for the fine voltage-controlled delay cells. According to post-layout simulations, the nominal LSB of the converter is LSB = τ 1 − τ 2 = 177.8 ps -171.4 ps ≈ 6.4 ps, while the employed delay cells show a linear and low gain delay dependence on the control voltage V C , with the desired delay values τ 1 and τ 2 finely locking in a wide range of temperatures between 0 • C and 90 • C. Fig. 7 describes the operating principle of the TDC. The counter outputs a conversion T CTR , which corresponds to the number of clock periods elapsing between the rising edge of the START and STOP signals. Coarse interpolators measure the time-interval (T c,START and T c,STOP , respectively) between the rising-edge of the first clock phase following the START or STOP signal and successive rising-edge of the reference clock. Finally, the fine interpolation stages retrieve the time conversions (T f,START and T f,STOP , respectively) indicating the time-distance between START or STOP and the following clock phase. A complete time-conversion T MEAS is obtained as follows: Converters are operated in a reversed START-STOP mode [36], hence the START signal is the first photon event, i.e., the TIM signal conveyed to the TDC by the FPA tree (see Fig. 5), while the STOP signal is the external laser synchronism signal. Therefore, since a START pulse is generated only when an avalanche occurs, adopting such a reversed START-STOP mode results in a reduced power consumption, especially when operating the sensor in photon-starved conditions. The low light conditions of the final application also led to the implementation of first-photon TDCs rather than multihit ones, thus saving area, power consumption and overall circuit complexity.
The area occupation of one TDC is 355 × 130 μm 2 , including both START and STOP interpolators, and its simulated power consumption can reach up to 12.9 mW in saturation regime. Additional 24 mW of power consumption are due to coarse and fine DLLs, which have been shared among multiple TDCs so to help mitigating the overall power consumed by the TDC bank.
Each TDC yields a 34-bit result, consisting of a 4-bit SPAD coordinate, the 10-bit value of the counter and the 10-bit START and 10-bit STOP time conversions. A dedicated dualchannel serializer per each TDC transfers its output to an external FPGA, enabling conversions during data transfers thanks to the pipelined architecture, thus minimizing each TDC's dead-time. The 16 serializers are operated at the frequency of 200 MHz by the FPGA, exploiting a custom-made eventdriven readout protocol, thus achieving a conversion rate up to 10 Mevents/s each, for an overall IC throughput of 1.6·10 8 conversions per second. As such, we chose to include just 16 TDCs as a trade-off between area /power and data throughput, in order not to saturate the USB 3.0 bandwidth.

III. SPAD CAMERA INTEGRATION
We fabricated our 16 × 16 SPAD array in a 160 nm BCD (Bipolar-CMOS-DMOS) technology [37], which demonstrated to deliver SPADs with state-of-the-art photon detection probability (PDP) and temporal response (up to 70% PDP at 490 nm, about 10% PDP for near-infrared wavelengths in the range 820 to 840 nm, and less than 30 ps of timing jitter, FWHM). Noise is low as well, as dark count rate (DCR) is less than 1 kcps at V EX = 5 V (at room temperature) and afterpulsing probability is well below 1% (even when hold-off time is as short as 10 ns). More information on the SPAD detectors fabricated in this technology is reported in [37]. In addition, BCD SPADs can be integrated with transistors, while being electrically isolated from the fast-switching circuitry, including the 1.8 V digital pre-processing circuitry and the 5 V fastgating electronics, by means of deep trenches and triple-well isolation. Squared SPADs with 32 μm side have been chosen as a trade-off between high fill-factor and low timing jitter: while the achievable fill-factor raises when increasing the size of the detectors, smaller SPADs provide improved jitter performance, along with lower DCR and a reduced number of hot-pixels. Rounded corners (8 μm radius) avoid premature edge-breakdown effects due to the presence of peaks in the electric field profile. With a 100 μm pitch for accommodating the in-pixel circuitry and enabling the routing of signals towards the peripheral electronics, the overall fill-factor is 9.6% (which leads to a photon detection efficiency peak  PDE peak = 6.72%), and it could be theoretically improved up to about 78% by mounting a microlens array (MLA) on top of the sensor [38]. On the other hand, MLA might be less effective for small f-number lenses, such as what used in NLOS imaging setups [39]. As an alternative, the low fill-factor can be addressed in future designs with 3-D stacked technologies. A micrograph of the chip, whose overall size is 4.8 × 4.8 mm 2 , is shown in Fig. 8, where its main sections have been highlighted.
In order to characterize and exploit the 16 × 16 SPAD array, we developed a compact 10 × 7 × 5 cm 3 camera (see Fig. 9), which is going to be integrated in the NLOS imaging system. The module is based on a stack of three printed circuit boards (PCBs): i) a chip-carrier board hosting the SPAD sensor; ii) a power board to provide power to the sensor and handle the synchronization with the pulsed laser source; iii) an FPGA board to readout the IC, calibrate the TDCs [32], pre-process the TOFs, manage the USB 3.0 data-transfer to the PC.
The camera has been designed to receive three external signals as inputs for TCSPC measurements when embedded in the final NLOS setup: i) a GATE signal enabling the SPADs; ii) a SYNC signal, to manage the synchronism with the pulsed laser and stop the TDC conversion right after the backscattered laser light reaches the sensor; iii) a SUB-FRAME signal to synchronize the event-driven readout to the scanning galvo mirror, thus providing time-stamped TOF conversions to the PC through a custom USB protocol.

IV. CHARACTERIZATION RESULTS
Basic BCD SPAD performance (DCR, PDP, afterpulsing) were fully characterized in [37]. In the following sections the characterization of the 16 × 16 SPAD module is presented in terms of DCR distribution, gate uniformity, IRF and optical crosstalk.
For all the measurements presented in this work, the 256 SPADs have been operated in fast-gated mode to better represent the operating conditions of the imager in real-world usage scenarios, while the excess bias voltage has been set to 5 V as the optimal trade-off between detection performance metrics. In addition, the light flux has been controlled for keeping the overall count rate of each TDC lower than 5% of the gate repetition rate, in order to avoid distortions due to pile-up effects [40]. Fig. 10 shows the percentage distribution of DCR among the 256 pixels of the array when the camera operates at room temperature and the GATE frequency is 5 MHz. While a change in the slope can be observed around the 60% mark, the median value of 1 kcps is equivalent to ∼ 0.98 cps/μm 2 , as expected for this technology at the temperature of ∼ 320 K [37]. 44 pixels show a DCR higher than 10 times the median value, while just 6 exceed 100 times the median value.

B. Gate Uniformity
The gate uniformity has been assessed keeping the module in a dark environment and providing synchronous GATE and STOP pulses by means of an external pulse generator. The counts distribution of a single pixel with a GATE duration of 30 ns is reported in Fig. 11. Besides a mild oscillation, the counts distribution shows a much faster OFF-to-ON transition than the one foreseen by post-layout simulations, with a 10% to 90% rise time of ∼ 100 ps, and a noticeable distortion at the beginning of the gate window. To better understand this  phenomenon and assess the response uniformity, we collected the time responses of the system to a pulsed laser at 850 nm (∼ 50 ps FWHM and 1 MHz repetition rate) scanned over the 30 ns gate window at 50 ps steps by sweeping the delay of the GATE signal. Fig. 12 shows the distribution of the total counts in such acquisitions (calculated by integrating the corresponding acquired waveform). The rise of the total counts is proportional to the PDP, which increases when the excess bias rises during the gate opening, thus can be related to the gate-on transition when the SPAD instrument response function has a short tail, as confirmed in next section. From such total count distribution, an activation time of ∼ 400 ps (10% to 90%) can be estimated. According to post-layout simulations, feedthrough pulses occurring at the gate opening swing up to 730 mV and last for ∼ 400 ps, in line with the measured activation time. Indeed, parametric analysis confirms that the comparator output piles up events occurring during this time interval, because transistors M1 and M2 in Fig. 4 do not provide enough current to sustain both the anode discharge and a superimposed avalanche pulse, which ends up being sensed right after such common-mode pulse peak. In addition, avalanches ignited during the activation of the SPAD build up with a slower rising edge and lately trigger the sensing comparator, causing early events to be registered later, thus making the accumulation peak in Fig. 11 even more pronounced. Fig. 13 shows the timing jitter (measured as FWHM) and the laser peak shift (i.e., the shift of one IRF peak position relative to the previous one) for all the histograms acquired in the scan, as a function of the GATE delay. While the SPAD is biased below its breakdown level before the gate opening, photogenerated carriers are not able to immediately trigger any avalanche. Nevertheless, they can diffuse and start the multiplication process as soon as the SPAD is enabled (these are the events giving rise to the tail in the instrument response function). In all such cases, the avalanche is registered right after the SPAD activation, regardless of the different photon absorption time, leading the curve in Fig. 13 (top) to start with a zero relative displacement (i.e., all such events are registered at the same time). Then, when the laser pulse arrives with the SPAD fully enabled (i.e., V EX = 5 V), the peak displacement is about the 50 ps step employed for scanning the gate and the residual damped oscillation is due to the ringing of the front-end power supply rail triggered by the high peak current required for fast-gating operation. The same oscillation pattern affects the timing jitter, as shown in Fig. 13 (bottom).  However, the time response shows just 3.4 ps of standard deviation, with negligible impact on practical measurements (about 1 mm uncertainty on the final NLOS reconstruction). The overall IRF is indeed slightly affected by the narrow time response of BCD SPADs, as it is dominated by the following circuits jitter.
We computed the overall linearity along the 30 ns gate window, both in terms of differential non-linearity (DNL) and integral non-linearity (INL), by performing a statistical code density test (see Fig. 14). Root-mean-square (rms) values of DNL and INL are as low as 250 fs (∼ 0.042 LSB) and 21.56 ps (∼ 3.59 LSB), respectively, even when increasing the repetition rate of the GATE pulses up to 25 MHz.

C. Instrument Response Function
We characterized the IRF of the module by means of a narrow-pulsed laser (15 ps FWHM) at 820 nm, operated at 100 kHz repetition rate with the same setup already described for the gate uniformity measurements. Fig. 15 represents the typical IRF of a single pixel, where timing jitter is ∼ 58 ps (FWHM). Such IRF is consistent and uniform across all 256 pixels, showing narrow responses ranging from ∼ 50 ps to ∼ 75 ps (FWHM), with an average of 60 ps and dispersion of ∼ 5 ps. Fig. 16 shows a colormap of the timing jitter of the entire SPAD array. Concerning the IRF, it is also worth noting that temporal response of each pixel can experience a constant and deterministic offset: when comparing the 256 time-responses, the spread of the peak positions is 31.3 ps rms. This is primarily due to an uneven routing of the timing signals towards the peripheral TDCs, as well as to a spread of the propagation delays of the sensing comparators, whose input threshold voltages are quite sensitive to process variations. Nonetheless, temporal offset can be easily measured and compensated for in postprocessing, thus not impairing the final performance of the SPAD camera.

D. Optical Crosstalk
During each avalanche, electron-hole pairs flowing through a SPAD can undergo hot-carrier relaxation phenomena, leading to the emission of secondary photons. If such photons are absorbed inside the active region of nearby SPADs, they can trigger further avalanches with detrimental effects on signalto-noise ratio (SNR). This mechanism is known as optical crosstalk [29], it is strongly correlated to the detected signal, and depends on several factors, such as the pitch of the sensor, the presence of deep oxide trenches, the avalanche quenching time, and other device structural properties.
As crosstalk events can happen hundreds of picoseconds after the originating avalanche, we needed to verify if the FPA tree is able to correctly identify the detection coordinates. Therefore, optical crosstalk has been measured by focusing an 850 nm pulsed laser (50 ps FWHM) into a 20 μm spot, which is smaller than the SPAD side, in order to selectively trigger avalanches mainly in one pixel and recording the temporal responses of all neighboring detectors. The total number of events recorded by each pixel has been normalized to the total number of laser counts, after subtracting each SPAD's DCR and spurious laser counts due to residual beam divergence in the optical setup. The latter have been measured by turning off the illuminated microcell, then performing the same acquisition again. Table I summarizes the crosstalk probability of the first neighbor pixels, both in the orthogonal and diagonal directions, with pixels either from the same macrocell (i.e., sharing the same TDC) or belonging to different macrocells. Thanks to the resource sharing architecture, avalanches due to optical crosstalk cannot be observed in SPADs belonging to the same  17. Temporal response of single pixels after DCR subtraction. Red curve shows the response of the "aggressor" pixel where a pulsed laser is focused on. Blue and green curves represent the temporal response of neighbor "victim" pixels in the orthogonal and diagonal directions, respectively, resulting from the effect of optical crosstalk. Cyan and lite green curves are the temporal responses of such "victim" pixels when the aggressor is turned off, thus including only photons coming from the laser beam that are falling out of the main spot. An offset on the time axis has been introduced on the blue/cyan and green/light green curves for the sake of clarity.
microcell, as they are so close in time that just the coordinates of the first firing SPAD are propagated. Nevertheless, optical crosstalk can affect SPADs belonging to adjacent macrocells, with a triggering probability of less than 0.12% and 0.025% when considering orthogonal and diagonal neighbors, respectively. In such cases, the effect of optical crosstalk can be clearly observed from the temporal responses of single pixels in Fig. 17, where crosstalk last ∼ 1 ns, equal to the avalanche duration as results from post-layout simulations.

V. NLOS MEASUREMENTS
To evaluate the performance of the system in an actual NLOS reconstruction, we added the array to the NLOS system that has previously been used with different SPAD detectors [39]. The system uses a NKT Katana HP with a pulse width of about 30 ps and adjustable repetition rate. The repetition rate used is 5 MHz and the average power is about 400 mW at 532 nm. We scan the laser across the entire relay surface for 30 seconds over a regular grid of about 24,000 laser positions (Fig. 18). Each SPAD pixel is treated as a single pixel non-confocal SPAD system like the one introduced in [5] collecting a histogram of photon arrivals for each laser position. This results in 256 sets of 24000 histograms. For reconstruction we treat this data as 256 separate single pixel NLOS measurements and use the reconstruction algorithm presented in [5] to create 256 separate reconstructions of the scene using the histograms collected by each pixel. The resulting set of reconstructions is shown in Fig. 19. Most pixels have an SNR that is sufficient to reconstruct the scene and reconstruction quality is similar to Fig. 18. Reconstruction of a scene with the numbers 2 and 4 for all pixels of the array. Since the pixels focus on different spots on the relay surface, certain features are not visible in some reconstructions due to occlusion and missing cone artifacts. Fig. 19. Reconstruction of a scene with the numbers 2 and 4 for all pixels of the array. Since the pixels focus on different spots on the relay surface, certain features are not visible in some reconstructions due to occlusion and missing cone artifacts. the quality from comparable single pixel systems [5]. There is significant variability in dark count rate among the pixels that in some cases affects the reconstruction quality. In most pixels the dark counts, even though substantially higher than in previous single pixel sensors, are comparable to the noise from ambient light (even in a dark room) and from higher order reflections from the previous illumination pulse that persist in the scene. Therefore, the observed higher dark count rate does not significantly affect the reconstruction quality in most of the pixels.
To obtain a high-quality reconstruction, one could in principle just sum over all the individual reconstructions. This is computationally inefficient and would make little sense for this slow scan where each individual pixel already has sufficient SNR for a good reconstruction. It will also require more careful alignment of the capture system to make sure all the reconstructions overlap. A system with a faster scanner, better alignment, and a new algorithm that allows reconstruction from the entire array at higher speeds is subject of future work.  II  COMPARISON BETWEEN THE SPAD ARRAY PRESENTED IN THIS WORK AND OTHER TIME-GATED SPAD ARRAYS VI. CONCLUSION In this paper we presented the design and characterization of a 16 × 16 SPAD array with 16 integrated high-performance TDCs. Such sensor has been designed for high-throughput time-tagged TCSPC measurements, specifically for NLOS imaging, thanks to specific features such as: i) fast-gating capabilities, granting OFF-to-ON transitions faster than 300 ps; ii) an IRF as narrow as 60 ps (FWHM) for the whole acquisition chain (including the TDCs), with excellent uniformity throughout the entire gate window. Table II lists a comparison between this work and other time-gated SPAD-based imagers presented in literature. Linear arrays and Silicon photomultipliers (SiPMs) have not been included in this list, as the former suffer from scalability issues in most cases, while the latter lack information about spatial resolution. To the best of our knowledge, our imager is the first one to combine fast-gated SPADs with such a narrow IRF. Despite the limited number of integrated SPADs, the possibility to perform up to 1.6 · 10 8 TOF measurements per second and transfer them to a PC via USB 3.0, while operating synchronously to an external scanning galvo mirror, enables to improve exposure times, range, accuracy and resolution of state-of-the-art NLOS imaging systems. Sharing hardware resources also proved to be an effective solution for the design of a more compact and scalable sensor, without compromising the throughput, nor the performance of the array. Nonetheless, the 500 mW of dissipated power make this array one of the most power consuming in Table II. Power consumption is dominated by the high resolution TDCs (250 mW) and 3.3 V LVCOMS output serializers (up to 130 mW). With about 100 mW dissipated by the comparators and digital electronics, hundreds of microwatts by the 256 SPADs in normal light conditions and less than 20 mW from the gating electronics when applying gate signals with a repetition rate of up to 50 MHz, this architecture still leaves room for scaling to wider arrays by sharing one TDC among a larger number of SPADs, or even by employing differential LVDS output serializers rather than single ended ones, which could possibly enable higher throughputs as well. While the monolithic integration of SPADs and electronics on the same die constrained the fill factor to just 9.6%, the presented architecture can serve as a proof-of-concept for scaling towards more dense imagers by exploiting 3D stacked technologies, with the final goal of enabling video-rate reconstructions within eye-safety limits.