A 500 × 500 Dual-Gate SPAD Imager With 100% Temporal Aperture and 1 ns Minimum Gate Length for FLIM and Phasor Imaging Applications

—Inthis article,we report on SwissSPAD3(SS3), a 500 × 500 pixel single-photon avalanche diode (SPAD) array, fabricated in 0.18-µ m CMOS technology. In this sensor, we introduce a novel dual-gate architecture with two contiguous temporal windows, or gates, guaranteed by the circuit architecture to be nonoverlapping and covering the totality of the sensor’s exposure period. The gates can be adjusted with a temporal resolution of 17.9 ps, and the minimum measured gate width is 0.99 ns; to our knowledge, the shortest reported to date among large-format SPAD imagers. In the dual-channel mode, the burst frame rate is 49.8 and 97.7 kframes/s in the single-channel mode. A 2690-MB/s PCI express (PCIe) interface has been added to the data acquisition framework, enabling continuous operation at approximately44 and 88 kframes/s. Due to optimizations of the gate-signal tree, we achieved a signiﬁcant reduction to gate skew and gate width variation, which is negligible with respect to the SPAD temporal jitter. These improvements,along with sub-10-cpsdark count rate (DCR) per pixel and 50% maximum photon detection probability (PDP), result in a sensor particularly well suited for fast acquisition ﬂuorescence lifetime imaging microscopy (FLIM) experiments, for which we demonstrate reduced dispersion versus a single-gated sensor.


I. INTRODUCTION
F LUORESCENCE lifetime imaging microscopy (FLIM) is a popular imaging method that measures the fluorescence decay time of molecules when excited by light.Largely insensitive to background illumination and environmental noise, FLIM has become widely used in many areas of life sciences, such as biophysics and biochemistry [1]- [3].Phasor imaging is a projection of lifetimes over a sine-cosine basis and gives a convenient 2-D representation of multi-lifetime systems.Time-correlated single-photon counting (TCSPC) is a technique by which a fast-pulsed light source, generally a laser, excites the molecule and a single-photon detector, e.g., a single-photon avalanche diode (SPAD), captures the photons resulting from the fluorescence decay.A histogram is constructed, timed with respect to the laser pulse, and photon time of arrivals can be computed directly using time-to-digital or time-to-amplitude converters [4]- [6] or indirectly by way of a global or rolling shutter, to achieve time gating [7], [8].
If each single-photon detector has access to a time converter, then direct methods can have a temporal aperture of 100%, and every photon is accounted for [4], [9].Unfortunately, due to the electronics required by time converters, these methods cannot reach a large pixel count.In time-gated sensors, a smaller portion of the detection cycle is used (∼10%).While the resulting temporal aperture is reduced, less area is required per pixel, and larger pixel counts can be achieved.An example of this approach was introduced with the SwissSPAD family of time-gated, high-speed, and large-format image sensors.
Developed in 2011, SwissSPAD achieved the highest spatial resolution (512 × 128 pixels) in SPAD arrays at the time [10].In 2019, SwissSPAD2 (SS2) was introduced with 512 × 512 pixels, and a revised architecture that resulted in significant improvements to fill factor, photon detection probability (PDP), dark count rate (DCR), and crosstalk [11].SS2 is currently being employed in a variety of applications; however, for some applications, low temporal aperture can diminish performance.For example, the error in the measured decay lifetimes for a biological sample with a high photobleaching rate will increase with the required acquisition time.
In this article, we present SwissSPAD3 (SS3), a 500 × 500-pixel SPAD sensor, and the latest member of the This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/Fig. 1.PDP characterization of the p-i-n SPAD structure in [13].
SwissSPAD family.Along with improvements to the power distribution network and uniformity of the gate-signal distribution tree, SS3 comes with the addition of a second contiguous gate channel.This results in single-photon sensitivity over the entirety of the exposure time and a lower required acquisition time for a given exposure time, making it highly suitable for FLIM and phasor imaging applications.We present an overview of the sensor's implementation and architecture and provide a characterization of its major performance parameters.Finally, we compare the performance of phasor-based FLIM when using a single-gate versus dual-gate approach.

II. IMAGE SENSOR ARCHITECTURE A. Pixel Architecture
In SS3, the physical structure of the SPADs was kept identical to the circular version of those used in SS2, both to increase performance reliability and to facilitate the imprinting of existing microlens models.First reported in [12], each SPAD is a 2-D front-side illuminated (FSI) p-i-n-junction fabricated in 0.18-μm CMOS technology.The pixel pitch is 16.38 μm and the active area has a diameter of 6 μm.The PDP profile was measured in [13] and shown in Fig. 1.At 7-V excess bias, the PDP exceeds 50% at 520 nm and persists above 30% across the visible spectrum.While this structure's geometry results in a relatively low fill factor (10.5%), it exhibits a lower DCR, lower crosstalk probability between pixels, wider spectral range, and a more uniform detection probability profile over a traditional p-n-junction approach.The fill factor can also be improved with microlenses [14], [15].
As shown in Fig. 2, each SPAD is embedded in a digital pixel of 14 nMOS transistors, positioned around the periphery of the active area.A reverse bias (VOP) is applied to the SPAD's cathode, biasing it above the breakdown voltage (V BR ) and enabling Geiger mode operation.The circuit layout does not allow for direct measurement of the SPAD's anode (and thus V BR ), and however, we have estimated V BR = 23.0 ± 0.1 V under typical operating conditions, in agreement with the analysis done in [13].After a successful photon detection, the anode rises to the excess bias voltage V EX = V OP − V BR .Normally, V EX > 3.3 V would exceed the maximum permissible rating for this technology, and however, the use of a cascode transistor (T 1) allows the SPAD to be biased at V EX = 6.0 ± 0.1 V (V OP = 29 V), improving PDP and timing response [16].Avalanche current is passively quenched through transistor T 2, where the quenching time constant is determined by the gate voltage V Q , which tunes the transistor's resistance.
Each pixel's exposure starts with the assertion of RESET on transistors T 3 and T 9, clearing the pixels ungated (O1) and gated (O2) output nodes.These two intermediate outputs are then used to form two contiguous temporal windows, Gate 1 (G1) and Gate 2 (G2).Because O1 represents the entire intensity of the frame, the two outputs must be subtracted from one another to isolate counts falling only into G1, and thus, G1 = O1 − O2 and G2 = O2.This step is done in postprocessing.
Before detection, the state of GATE_M is set by the GATE input.After detection, node O1 is pulled high and GATE_M is set to zero by the NOR gate implemented by T 4-T 7. If the previous state of GATE_M was high, O2 is also pulled high through T 8 before it is disabled by the NOR gate; otherwise, O2 remains low.Thus, the valid output values of O1 and O2 represent the states "no photon" (00), "photon in Gate 1" (10), and "photon in Gate 2" (11).A 01 output is invalid and represents a fault if it occurs.
The pixel readout is implemented through transistors T 10-T 14 and controlled through input signals Sel1 and Sel2, which are manipulated to read O1 and O2 through a shared data bus.Between readouts, the output bus is reset by a dedicated pull-up transistor located outside the pixel.After the frame is read out, RESET is asserted again to start a new exposure.A timing diagram of the readout sequence is shown in Fig. 3.
Unlike in SS2, there is no global active recharge signal to reset each pixel and no global gate to control the duration of the exposure window.The addition of these features significantly increased the number of required transistors, inactive pixel area, and timing constraints in simulation.Because the emphasis of this sensor was to maximize the duty cycle for FLIM, SS3 does not feature global shutter and can operate only in rolling shutter mode.

B. Sensor and Readout Architecture
As shown in Fig. 4, SS3 is approximately 1 cm 2 in area and split into two identical halves, each containing 250 × 500 pixels.Readout is done by row, with rows being selected by 8-bit decoders located on the left of the array.To meet both the area requirements of the chip and I/O capabilities of available field-programmable gate arrays (FPGAs), SS3 was limited to 250 output pins.Each half is assigned 125 pins at the top and bottom of the array.To enable column selection, each output pin is connected to a 4:1 multiplexer.Control signals for the row decoders and column multiplexers are provided externally by an FPGA.During readout, 1-bit D-registers store the binary photon count of each column, and a pull-up network resets its voltage after acquisition.
The gating signal is distributed from the input pad to the bottom pixel of each column via a balanced signal tree.In SS2, the global shutter and active recharge feature required three balanced signal trees; however, SS3's architecture requires only one.This additional area allows for more decoupling capacitors than in SS2 and in turn enables an order of magnitude reduction in minimum gate length and lower skew among pixels.
In SS2, a minimum exposure time of 10.24 μs and a binary frame rate of 97.7 × 10 3 frames/s (97.7 kframes/s) were reported.SS3 was made in the same technology; however, the readout of each pixel can contain 2 bits: intensity and gate.Therefore, when both channels are used, the maximum binary readout speed is reduced by a factor of 2.
The precise acquisition speed depends on the FPGA readout scheme.For example, it is often convenient to acquire 256 × 512 pixels (instead of 250 × 500) to match the size of internal memories.Adding a small fraction of "dummy pixels" slightly reduces the maximum achievable frame rate but greatly increases the ease of FPGA addressing.Due to these complexities, there can be a small (∼1%-2%) variation in achieved frame rate across implementations.For our current test systems, the observed maximum frame rates for intensity-only and both channels are 97.7 and 49.8 kframes/s, respectively.

C. System Architecture
The complete system consists of several printed circuit boards (PCBs): a sensor board, motherboard, and two FPGA boards.Each SS3 is mounted on a sensor board that routes all 538 pins to high-density connectors.These interface with the motherboard, which is populated with a microcontroller (μC) to configure linear regulators; providing stable supply voltages to various rails on SS3.A thermoelectric cooler and temperature sensor are mounted directly beneath SS3, making contact with the sensor's exposed ground pad.Under typical operating conditions, the sensor can be actively cooled to 26 • C. The current setup is shown in Fig. 5.
For data acquisition and postprocessing purposes, each sensor interfaces with two OpalKelly XEM7360 evaluation boards [17], which are populated with either a 160 or 410 T Xilinx Kintex FPGA, 2 GB of DDR3 RAM, and a USB 3.0 interface.For most conventional applications, 160 T is sufficient, and however, 410 T can be used for more demanding computations, such as real-time coincidence detection.
Each FPGA is responsible for half of a sensor (250 × 500 pixels), which are input through 125 data pins clocked at 100 MHz.If data compression is desired, the native USB 3.0 interface on each FPGA is sufficient.For example, at an exposure time of 10.24 μs, compressing into 8-bit images (255 exposures per pixel) results in an output bandwidth of 47.9 MB/s per FPGA (250 pixels × 500 pixels) well below the 330-MB/s maximum of the USB 3.0 interface.
For binary frame readout, the sensor can be operated either in burst mode, where frames are stored in the DDR3, or continuous mode at a lower frame rate.For continuous operation, a PCI express (PCIe) cable interface is supplied on the motherboard.If used in this configuration, data from one FPGA are sent to the other through 32 parallel data lines and stored in an intermediate buffer, and then, the entire sensor's data are output to a PC over an eight-lane PCIe v2.0 bus.
In theory, the PCIe interface [18] is capable of handling the sensors' maximum throughput of 2.98 GB/s.The current bottleneck, however, is the DDR3 RAM on the FPGA board.Although necessary as a buffer to account for intermittent stalls in the PCs operation, the pipeline limits the output bandwidth to 2.69 GB/s.Thus, when saving directly to the PC's memory, intensity-only frame rates of 88 kframes/s have been observed.When saving to a file, the solid-state drive (SSD) of our workstation is the current bottleneck and further limits the frame rate to approximately 62.5 kframes/s.Note that this performance is entirely dependent on the PC; if streaming directly to a GPU, for example, the DDR3 buffer may not be needed and the full 97.7-kframes/sframe rate can be achieved.

III. PERFORMANCE CHARACTERIZATION A. Dark Counts
Dark counts, or thermally generated detections in the absence of light, are one of the main sources of noise in SPAD image sensors.DCR determines the lower bound of the sensor's dynamic range, and the DCR uniformity across pixels can affect the spatial resolution.In addition, DCR can have an especially large impact on sensitive computational applications, such as multispeckle diffuse correlation spectroscopy [19].
In large SPAD arrays, dark counts are typically quantified with a combination of two parameters: average DCR and hot pixel percentage.Hot pixels, or pixels with a DCR one or two orders of magnitude higher than the median, depending on the authors' definitions, are usually discarded in postprocessing.
To measure the dark count characteristics, the sensor was placed in the dark at an excess bias voltage of 6 V and actively stabilized to 27.0 • C. A series of 1024 8-bit images at T EX = 10.24 μs were captured, averaged, and normalized by the exposure time.As shown by the intensity image in Fig. 6, the hot pixels are randomly scattered across the array with no visual patterns.This apparent randomness was further verified by inputting the raw data into the National Institute of Standards and Technology (NIST) Statistical Test Suite [20]; no correlations or patterns were found.Fig. 7 shows the population distribution of the DCR when the camera is actively stabilized to various temperatures around ambient.The breakdown voltage dependency on temperature was characterized by the methods in [21] and measured to be V BR /T = 0.04 ± 0.01 V/ • C. At each temperature, the SPADs bias was adjusted such that V EX =  6.0 ± 0.1 V. Fig. 8 shows the DCR dependence on excess bias voltage.At our typical operation conditions (V EX = 6.0 ± 0.1 V and T = 27 • C), the median DCR is 9.20 ± 0.05 cps, and over 90% of the pixels are under 20 cps.At 0.33 cps/μm 2 , this is well below the DCR of other reported large-format SPAD imagers [22]- [26].The choice of a hot pixel threshold ultimately depends on the application.Conventional imaging applications may be significantly less sensitive to noise than computational applications.Here, we have chosen to classify hot pixels as those with a DCR two orders of magnitude above the median.Under this criterion, hot pixels account for 1.8% of the total.

B. Pixel Crosstalk
It is defined as false detection events triggered by an initial event in a neighboring pixel.As this is a source of correlated noise, it is an undesired effect in image sensors.The crosstalk probability can be estimated by comparing the DCR of hot pixels to their neighbors, and however, several corrections need to be applied to account for the sensors' nonideality [27].
First, a pile-up correction equation [28] was applied to all pixels to account for undetected photons, due to the binary nature of the camera.Second, bounds were placed on which hot pixels to consider.Exceptionally, hot pixels are excluded from consideration, as even a pile-up correction cannot reasonably estimate their true count rate.The lower bound was set at 500 times the median DCR, as to not include pixels whose DCR is low enough to be significantly affected by process variations and shot noise.For SS3, applying these criterion results in a subset of approximately 2000 pixels or 0.8% of the total.Third, the median DCR was subtracted from all pixels to separate crosstalk from ordinary dark counts.Fourth, pixels adjacent to other hot pixels are excluded.
Finally, the crosstalk percentage was calculated by comparing the counts of hot pixels to those adjacent ones.The average crosstalk values are shown in Fig 9, and are below 0.06% and 0.03% for the nearest neighbors and nearest diagonals, respectively.These values are below those typically reported for SPAD arrays (0.17% [8], 0.71% [29], 3.5% [10], and 4.3% [27]), as it is populated with the same SPAD type, almost identical to those reported for SS2 [11].

C. Temporal Gating
The time gate profile of an image sensor can have a significant influence on its timing characteristics.Interpixel gate skew and gate width variation are commonly encountered issues in large SPAD arrays and can severely degrade performance.In SS3, the elimination of global shutter mode and active recharge removed two global signal trees versus SS2's design, and extra area was allocated for gate performance optimization.Along with modifications to the circuit architecture, these improvements result in a greatly improved gate profile.
Gate characterization was performed at room temperature and an excess bias of V EX = 6.0 ± 0.1 V.The entire active area was illuminated by the collimated output of a 637-nm laser, ∼40-ps full-width at half-maximum (FWHM), and pulsed at a frequency of 40 MHz.Other than to ensure that the saturation regime was not entered, no special efforts were taken to achieve high spatial uniformity in the photon flux across the sensor.The laser and camera were synchronized by a common signal generated by the laser controller, and the camera was operated at a dual-channel frame rate of 49.8 kframes/s.
As discussed earlier, SS3 has two gate channels, which are guaranteed by the circuit architecture to be contiguous and to cover the entirety of the exposure window.The position of the gate border can be adjusted prior to the experiment by multimode clock managers (MMCMs) on the FPGAs with a resolution of 17.9 ps.Fig. 10 shows the performance of the gate channels for a random selection of pixels, at the shortest achievable gate length of 0.99 ± 0.07 ns.The complementary channel G2 is also shown with a length of 24.0 ± 0.  the array.This effect can be corrected for with proper IRF characterization.

IV. PHASOR FLIM MEASUREMENT
As mentioned earlier, FLIM offers high background rejection and insensitivity to tissue thickness, photobleaching, and fluorophore concentration.Various methods exist for extracting the measured lifetimes; however, they are often very computationally intensive.Phasor FLIM overcomes this limitation by replacing each exponential decay by the coefficients of a single term of its Fourier series.Single-exponential terms can be represented by an intuitive semicircular phasor plot, with a simple correspondence existing between the phasor's location and its lifetime.
In SS2, we demonstrated the potential of SPAD arrays for use in FLIM with performance consistent with, and oftentimes outperforming that, of commercially available camera systems.During this analysis, however, we identified two drawbacks affecting FLIM performance, namely, a single-gate channel and the lack of rolling shutter operation in gated mode.Both of these features lower the camera's duty cycle; photons that arrive outside the gate or during the readout phase are lost.By adding gated rolling shutter mode and a second output channel, these issues have been directly addressed in SS3.Following the procedure outlined in [30], a phasor-based FLIM analysis was performed on a sample of mammal colon, dyed with hematoxylin and eosin (H and E).The sample was illuminated by a 517-nm pulsed laser, operating at a  Subsequently, in the case of dual-gated analysis, the final phasor value was calculated by the average of the two gate channel phasors, weighted by their respective total photon counts.Fig. 12 shows the sampled FLIM image.In Fig. 13, we show the clear reduction in signal dispersion for a single versus dual-channel architecture.When two gates of approximately equal length were used, the signal-to-noise ratio was increased by a factor of 1.35, approximately the expected √ 2 improvement.

V. CONCLUSION
In this article, we have reported on a 500 × 500 SPAD pixel array, fabricated in 0.18 μm CMOS technology.As shown in Table I, the performance of this sensor is comparable to, or exceeds, the current state of the art in many respects.The sensor has two contiguous time gates, a novel addition that allows for a temporal aperture of 100%.The gates are adjustable with a temporal resolution of 17.9 ps, and the minimum gate width was 0.99 ns, which, to our knowledge, is the shortest reported to date on a large-format SPAD image sensor.The sensor achieves 49.8 kframes/s in the dual-channel mode and 97.7 kframes/s in the single-channel mode, both in burst readout, while 44 and 88 kframes/s are achieved in continuous mode due to a 2690-MB/s PCIe interface.The sensor was tested in an FLIM imaging experiment, where the beneficial effects of timing/skew improvements, along with sub-10-cps DCR per pixel and 50% maximum PDP, are recognizable.

Fig. 2 .
Fig. 2. Pixel circuit schematic in SS3.Nodes O1 and O2 are sampled to provide two outputs per pixel: intensity and gate.

Fig. 3 .Fig. 4 .
Fig. 3. Timing diagram illustrating the two outputs of a single pixel.Due to the rolling shutter architecture, the position of each pixel's gate relative to the start of its exposure time may slightly vary.Channel O1 represents the total intensity and Channel O2 represents the second gate channel G2.Gate channel G1 is formed by subtracting O2 from O1. Green arrows represent detected photons, and red arrows are those that are missed due to a previous detection occurring in the same exposure window.

Fig. 5 .
Fig. 5. Block diagram of SS3 and data acquisition hardware.Two FPGAs receive the data from half an array and output to a PC over either USB 3.0 or PCIe.The main motherboard includes a µC and programmable linear voltage regulators.

Fig. 6 .
Fig. 6.Dark count map of 125 pixels × 125 pixels (top) and sample intensity image (bottom) from SS3. Hot pixels are randomly distributed and account for 1.8% of the total.Statistical analysis found no patterns in their spatial location.

Fig. 7 .
Fig. 7. Population distribution of dark counts in SS3 at V EX = 6.0 ± 0.1 V excess bias and at a temperature of 31 • C (yellow), 29 • C (red), and 27 • C (blue).Inset: Median DCR over a range of temperatures.At V EX = 6 V and 27 • C, the median DCR is 9.20 ± 0.05 cps and over 90% of all pixels are below 20 cps.

Fig. 8 .
Fig. 8. Population distribution of dark counts in SS3 at 27 • C and an excess bias voltage of V EX = 6 (blue), 4 (red), and 2 V (yellow).Inset: median DCR over a range of V EX .

Fig. 9 .
Fig. 9. SS3 average crosstalk probabilities.The nearest neighbor pixels are below 0.06%, and the nearest diagonal is below 0.03% 1 ns.Errors indicate one standard deviation of the measured distribution across the sensor.The sum of these two windows encompasses the entirety of the 40-MHz laser repetition rate.Compared to SS2 (rise time ≈ 0.38 ns and fall time ≈ 0.62 ns), the gate edges are both faster and more symmetrical.These fast edges result in a higher bandwidth for the instrument response function (IRF) characterization in FLIM.In SS3, special effort was taken to optimize the architecture for fast gating.As shown in Fig. 11, G1's rising and falling edges belong to a compact distribution, with an FWHM of 109.4 and 153.4 ps, respectively.The figure insets illustrate the gate skew's spatial dependence, which is due to the propagation delay difference between the top and bottom of

Fig. 10 .
Fig. 10.Temporal profile of a 1-ns gate (top) and its complementary channel (bottom) at a 40-MHz pulse repetition rate.

Fig. 11 .
Fig. 11.Skew characterization for gate channel G1.The rising edge distribution has an FWHM of 109.4 and 153.4 ps for the rising and falling edges, respectively.The full range is approximately 500 ps for the rising edge and 650 ps for the falling edge.Behavior is consistent across both gate channels.

Fig. 12 .
Fig. 12. FLIM lifetime image of a mammal colon taken with SS3.Lifetimes are relative and shown in arbitrary units.

Fig. 13 .
Fig. 13.Phasor plot of mammal colon dyed in H and E when sampled for the same duration with one gate (top) versus two gates (bottom).The additional channel in SS3 allows for more photons per acquisition and reduces the dispersion in the lifetime measurement.In these plots, only the phasors of the bottom half of the sensor were displayed.

TABLE I STATE
-OF-THE-ART COMPARISON BETWEEN THIS WORK AND OTHER MEDIUM-AND LARGE-FORMAT SPAD IMAGERS.THE MAXIMUM PDP FOR THIS WORK WAS MEASURED SEPARATELY IN AN IDENTICAL SPAD.FOR BINARY GATED SENSORS, THE MAXIMUM COUNT RATE IS EQUIVALENT TO THE MAXIMUM FRAME RATE.THE NATIVE FILL FACTOR DOES NOT INCLUDE POSSIBLE IMPROVEMENTS FROM THE ADDITION OF MICROLENSES