Review of Quanta Image Sensors for Ultralow-Light Imaging

—The quanta image sensor (QIS) is a photon- counting image sensor that has been implemented using different electron devices, including impact ionization- gain devices, such as the single-photon avalanche detec-tors (SPADs), and low-capacitance, high conversion-gain devices, such as modiﬁed CMOS image sensors (CIS) with deepsubelectronreadnoiseand/orlow noisereadoutsignal chains. This article primarily focuses on CIS QIS, but recent progress of both types is addressed. Signal processing progress, such as denoising, critical to improving apparent signal-to-noise ratio, is also reviewed as an enabling coin- novation.


I. INTRODUCTION
C OUNTING every photon is as sensitive as physics presently allows in measuring light. To count photons incident on the faceplate, optical losses must be minimized, detector quantum and collection efficiencies must be maximized, and detector dead times minimized. Measurement of ultralow quanta (light) flux using single photomultiplier tube (PMT) detector photon counting was suggested as early as the 1960s, e.g., [1]- [3]. A digital photon-counting image sensor using APDs was suggested by Nippon Hōsō Kyōkai (NHK) [4]. In 1996, a hybridized photon-counting image sensor readout integrated circuit (ROIC) was investigated by Jet Propulsion Laboratory (JPL) [5] and the first solid-state single-photon avalanche detector (SPAD) was introduced [6]. In 2005, a new imaging paradigm based on photon counting was described by Fossum [7] that considered a future pixel pitch of 0.5 μm or less and very limited full-well capacity (FWC). A similar concept was proposed again in 2009 by École polytechnique fédérale de Lausanne (EPFL) [8]. Such a device is now often referred to as a quanta image sensor (QIS) [9]. Various photon-counting image sensors were reported in a special issue of Sensors [10]. Most photon-counting image sensors are actually photoelectron-counting devices, with reflection and quantum efficiency (QE) loss, carrier collection loss, and detector dead time presumed to be acceptable, but not perfect. The detection of single electrons with deep subelectron input-referred read noise (DSERN) has enabled the possibility of room-temperature megapixel photon-counting image sensors over the past ten years, with the assumption of high QE, or high photon-detection efficiency, which takes into account detector dead time. To achieve DSERN, two primary methods are used. The first is carrier-gain through the use of high electric field impact ionization either in avalanche diodes or through repeated high clock voltage charge transfer in an "impactron" [11] or electron multiplying (EM) charge-coupled device (CCD) [12]. The second method is the use of charge transfer devices such as a CCD or CMOS image sensor (CIS) with high conversion gain (CG) achieved through ultralow sense node capacitance and/or low noise readout electronics. The required read noise was suggested by Teranishi in 2011 to be less than 0.3e − rms [13], [14] and later reduced to 0.15e − rms in 2013 [15]. SPAD pixels typically achieve DSERN with ease. The first successful CIS-type pixel to achieve DSERN and demonstrate electron quantization was reported in 2015 [16], [17]. Each approach has advantages and disadvantages.
The purpose of this review article is to provide a useful overview and digest of progress in QIS realization, and pointers to the literature that has developed in this field. The article contains three major sections. First is a general discussion of the QIS and its imaging performance. QIS devices have been implemented using CIS-type principles and technology (referred to as CIS QIS) and SPAD devices (referred to as SPAD QIS). A brief review of CIS QIS and SPAD QIS devices will be presented along with thoughts on where each technology may be going.
Section II discusses the recent advances in ultralow noise imaging devices that can operate as CIS-QIS but which also retain legacy advantages of CIS devices. Such devices have benefitted from the technology developed for CIS QIS.
Photon-counting image sensors like the QIS are often operated in low quanta flux environments where photon shot noise limits the detection of signal-to-noise ratio (SNR) in the range 0 < SNR < 10. Computational imaging approaches have been developed to improve apparent image quality through algorithmic and machine learning-based denoising, motion deblurring, and SNR enhancement of moving objects, and make these devices useful for machine vision and consumer use in low quanta flux regimes. Progress in this area is reviewed in Section III.
QIS devices will find applications where imaging in ultralow light is essential. These applications include security, night vision, space science, life sciences, biotech, quantum computing, aerospace, defense, and possibly automotive and consumer smartphones.

A. QIS Imaging Performance (Theoretical)
The QIS consists of an array of specialized pixels referred to as jots that are essentially binary in nature (indicating the arrival of at least one photoelectron, or not.) The QIS was originally envisioned to consist of millions or billions of small-pitch, low FWC jots readout at high frame rates, and thus very high bit rates. The concept originated when contemplating a future image sensor scaled to small pixel pitch and low FWC [7]. Image pixels are created from a local spatiotemporal ensemble of jot outputs (see Fig. 1) that are logically "zero" (no photoelectron) or "one" (at least one photoelectron). Bit density (D) is the number of logic "ones" divided by the total number of bits readout. It could be for a single jot readout many times (e.g., many frames) or a group of jots readout for one or more frames. The image sensor performance of QIS devices was analyzed by Fossum [15] for the expected value of D as a function of the average number of photons or photoelectrons that arrive at the jot during the exposure period, called the quanta exposure (H ), the input-referred SNR H , the dynamic range (DR), the bit error rate (BER) as a function of read noise, and other properties. In general, for H 1, the performance is linear, but then approaching H = 1, the response becomes sublinear with a substantial overexposure latitude. This nonlinearity is fundamental and due to the statistical arrival of photons that are well described by the Poisson distribution probability mass function, which is the underlying cause of photon shot noise in image capture.
Plotting D-log H yields an S-shaped curve as illustrated in Fig. 2. The S-shaped D-log H curve has been known since 1890 [18] where, in this case, D is grain density in developed photographic plates, and H is the light exposure. It was observed in a time before the quantum of light, the photon, was described by Planck and Einstein in the early 1900s. In fact, the same basic Poisson statistics are behind the D-log H characteristics of Hurter and Driffield, and those of the QIS.
The bit density, noise and SNR predicted by the 2013 QIS model was first experimentally verified using a SPAD QIS in 2015 [19]. Measurement of the D-H characteristic can be used to estimate read noise and quantizer thresholds in CIS QIS devices [20], [21].
The binary QIS concept was expanded to include low bitdepth output-i.e., effective FWC greater than unity. The binary QIS is now referred to as a 1bQIS and the latter as a multibit QIS, or mbQIS. In the mbQIS, the low bit-depth digital value is equal to the number of electrons readout. Multibit quantizers can be programmable to trade power and read out speed with bit depth and concomitant nonlinearity, e.g., [16], [22], [23], [24]. This 1-7b photon number resolution capability differentiates mbQIS from higher read noise and higher bit resolution (∼10-14b) regular CIS devices. However, if anything, this differentiation has become blurred as regular CIS devices have emerged with DSERN, as described in Section II-B. Photon-counting error rates in 1bQIS and mbQIS were analyzed in 2016 [25].
It is noted that while the QIS is a binary-output image sensor, it differs from some binary sensors that have appeared in the literature over the years, wherein the threshold for triggering a change in output value typically represents a few or perhaps many photons, e.g., [8], [26].

B. Implementation: CIS QIS and SPAD QIS
In principle, any device that can detect photoelectrons with less than 0.15-0.30e − rms read noise to achieve low BER (i.e., BER < 0.0005-0.005 bit-errors/read) can be used as a QIS device. For example, a cooled EMCCD [12] can operate as a 1bQIS, albeit with a slower readout rate (but not so well as a mbQIS due to gain noise), and a cooled CCD with "skipper readout" (many nondestructive reads of a pixel) can also be used as 1bQIS or mbQIS, albeit with an even lower frame rate [27].
Two major approaches seem promising at this time for room temperature (RT) application. The first is CMOS image   Table I.

1) CIS QIS:
The CIS QIS approach requires a pixel with high CG and/or low input-referred read noise, and a quantizer circuit to convert the analog-sensed voltage signal to a digital value (one or more bits in depth, corresponding to the electron number). The first 1 kpix CIS QIS was reported in 2015 [17]. A 1 Mpix 3D-stacked-backside illumination (BSI)-CIS QIS was reported in 2017 [28] with 1.1 μm pixel pitch, 1 kfps frame rate, 17.6 mW power dissipation, 0.21e − rms avg read noise, and 0.2e − /s dark count rate. In fact, 20 different 1 Mpix QIS devices with varying designs were integrated on a single chip so this might be considered as a 20 Mpix QIS.
The advantages of the CIS QIS approach are small pixels (e.g., 1 μm pitch), high resolution (e.g., >100 Mpixels), very high photon detection efficiency (PDE), relatively low power, low electric field strengths, low DCR, photon number resolution (multibit QIS), and likely high manufacturing yield and lower cost for a given resolution. An indirect advantage is leverage from the advancement of regular CIS pixel technology and shrink, requiring less unique detector device engineering from generation to generation.
Drawbacks to the CIS QIS are primarily in control of the quantizer threshold voltage(s) across the sensor. Reduction in read noise and/or increased CG will ameliorate this drawback, as would self-calibration. Several techniques have been developed to characterize read noise and quantizer threshold [20], [29], [30].
QIS technology is being applied to achieve DSERN performance in CIS devices and enable ultralow-light image capture capability along with high-DR (HDR) and other features found in commercial and consumer CIS devices [31], [32].
In 2020, the first 1 Mpixel SPAD QIS was reported (actually 2 × 0.5 Mpixel arrays) by a Canon/EPFL collaboration [41]. The SPAD QIS had a 9.4 μm pixel pitch with a 24 kfps frame rate with power dissipation of up to 535 mW for 0.5 Mpixel readout. Canon further progressed the technology to achieve 3.2 Mpix with a 6.39 μm pixel pitch and a 60 fps frame rate with DCR and PDE approaching CIS QIS levels using a 3-D-stacked BSI process. This mbQIS has an 11b pixel-parallel digital counter in the bottom tier to allow photon number resolution and HDR. Power dissipation was not reported [42]. A SPAD QIS with a pixel-parallel digital counter, (42.2 kpixels, 12.24 μm pixel pitch, and 60-250 fps) was reported by Sony at about the same time [43]. A novel 1-T SPAD QIS test array (200 pixels, 7 μm pitch) with a single access transistor to the pixel was presented by Fondazione Bruno Kessler (FBK) [44].
The primary advantage of SPAD QIS results from the nearly instantaneous and large carrier-gain provided by the avalanche photodiode breakdown that is triggered by a photoelectron. The voltage pulse it creates can be used to time-stamp photon arrival permitting time-of-flight measurement. The gain can be turned "off" to provide a gating function. Once triggered, the avalanche feedback process results in no apparent read noise. The lack of read noise is usually balanced by lower PDE which relates to photoelectrons triggering the avalanche feedback process, and thus sometimes photoelectrons become lost and uncounted.
The dual-mode operability of SPAD QIS to gate and record photon arrival times, as well as provided QIS-mode imaging, is a strong potential advantage of SPAD QIS compared to present-day CIS QIS but can result in a larger pixel pitch.
The use of high internal electric fields needed to trigger avalanche and high gain is a weakness of SPADs, resulting in the need to isolate pixels, in turn leading to larger pixel pitches. The higher electric fields can exacerbate DCRs and potentially impact device yield. Die cost is a function of pixel size, resolution, and yield, so at the current time, SPAD QIS is expected to be more costly to manufacture than CIS QIS.
Power dissipation at higher photon count rates can cause large CV 2 f power dissipation in the SPAD array (e.g., 1-10 W), which can exceed that of the readout circuits, due to high bias voltages and avalanche currents [45] that must recharge the full pixel capacitance with each photon arrival.
While the digital readout layer shrink will track digital circuit technology node improvement, pixel shrink at the SPAD layer may be more difficult to achieve and there may be little leverage from regular CIS technology improvements in terms of shrink aside from 3-D BSI stacking. However, earlier work in nano-sized APDs in 2007 may guide future SPAD shrink [46] and the minimum SPAD pixel size reported so far is 3 μm [47]. Scaling laws for SPADs were suggested in 2021 [48].

III. ACHIEVING DEEP-SUBELECTRON READ NOISE
In recent years, a significant amount of research effort has been spent on the reduction of read noise, for the development of QIS and the improvement of low-light imaging performance in CIS. Although there are a variety of approaches being explored for reducing the read noise, they can be summarized into two main categories, improving the CG of the pixel and reducing the voltage temporal noise of the in-pixel source follower (SF).
The improvement of pixel CG was realized in two ways: 1) reducing the floating diffusion (FD) capacitance and 2) replacing the in-pixel SF with high-gain amplifiers. Additionally, the reduction of the pixel SF temporal noise was demonstrated with buried-channel SFs and pMOS-based SFs. The correlated multiple sampling (CMS) is commonly used with other techniques to further lower the read noise.
The advancement of the CMOS manufacturing process also contributes to the reduction of read noise. The subelectron read noise performance was reported in [49] and [50] with standard  CIS devices fabricated in a 45 nm standard CIS process and a typical pixel CG of 110-120 μV/e − . The voltage read noise of these devices is reduced to about 100 μV rms without CMS and 70 μV rms with CMS.
The read noise performance of the recently published low-noise CIS is summarized in Table II. Among these listed results, the lowest input-referred read noise was reported in [32] by Ma. Read noise of 0.19e − rms was achieved in a 16.7 Mpix CIS QIS with 1.1 μm pixels. This record-low read noise was realized with a high CG of 340 μV/e − , enabled by the pump-gate pixel structure. As shown in Fig. 3, a photoncounting histogram (PCH) with 0.12e − rms read noise is reported in this work. The discrete photo-electron peaks in the histogram are well aligned with the Poisson-Gaussian model, which demonstrates the reliable photon-counting capability of the sensor. A scatter plot of the read noise of these sensors vs. FD CG is shown in Fig. 4. The dashed reference curves show the input-referred read noise in voltage (μV rms). Without considering the difference of the FD CG, the lowest voltage read noise (∼25 μV rms) was reported in Ge [51] and Lotto [52]. The reduction of voltage read noise was realized with in-pixel non-SF amplifiers with a significantly higher voltage gain. Subelectron read noise was also demonstrated with pMOS-based SF and buried-channel SF [53]- [57]. Both devices demonstrated effective noise reduction compared to the conventional nMOS-based surface-channel SF: ∼80 μV rms voltage read noise (pMOS) without CMS and 45 μV rms voltage read noise (buried-channel nMOS) with CMS.
These read noise reduction techniques are discussed in more detail in the sections below.

A. Small FD Capacitance
High pixel CG is demonstrated in multiple works with significantly reduced FD capacitance [17], [28], [31], [32], [55], [58]- [62]. The capacitance of the FD node in a standard CIS pixel includes a few components: 1) FD p-n junction capacitance; 2) FD to transfer gate (TG) overlap capacitance; 3) FD to reset gate (RG) overlap capacitance; 4) SF gate capacitance; and 5) intermetal capacitance. As the fabrication process advances, the gate oxide becomes thinner and the capacitance components 2)-4) increase proportionally. In the pixels with shared readout architecture [63], the FD node is coupled to multiple TGs, which proportionally increases the FD-TG overlap capacitance.
The FD total capacitance can be lowered by reducing one or multiple of these capacitance components. A pump-gate pixel structure was first reported [64] by Ma for the reduction or elimination of the FD-TG overlap capacitance with a distal FD. As shown in Fig. 5, a three-step electrostatic potential profile including a virtual-phase region is created in the pump-gate device to enable a complete charge transfer from the storage well (SW) to the distal FD node. This device was first fabricated [17] and 426 μV/e − CG was demonstrated in 1.4 μm pixels, which is equivalent to a total FD capacitance of only 0.38 fF. In this work, DSERN (0.28e − rms) was realized for the first time with CIS pixels due to the high CG and its PCH demonstrated photon-counting capability. The pump-gate device was further improved [28], [31], [32] and recently implemented in commercial QIS products [65]. Despite the ultrasmall FD capacitance, good interpixel uniformity and low photon-response nonuniformity (PRNU) (∼1%) are realized in multimega-pixel HDR QIS devices [32].
New pixel structures were also introduced to reduce other FD capacitance components. In [28], [58], [59], and [66], the reset transistor was replaced with a gateless reset diode, often termed "punchthrough reset (PTR)," to eliminate the FD-RG overlap capacitance. With the PTR diode, the FD node is reset by increasing the positive bias voltage of the reset drain (RD) node. As shown in Fig. 6, the higher bias increases the depletion width surrounding the RD node and lowers the potential barrier between the FD-RD junction, which allows the electron current to flow from the FD to the RD. With the PTR, a higher supply voltage is needed to achieve an equivalently high FD reset voltage to preserve the FD voltage swing and the DR. This requires an additional positive charge pump or other on-chip high-voltage generators and increases the complexity of the sensor. Hence, a bootstrapping operation was introduced in [59] to increase the FD reset voltage in the PTR by manipulating the FD capacitance before and after the reset operation, without increasing the bias voltage on the RD node.
The improvement of CG was also reported in the standard CIS pixels with mild implant modifications. In [60], optimized n + and lightly doped drain (LDD) implantation conditions were applied to the FD and the SF drain with lowered dose/energy to reduce the FD junction and the SF gate capacitance. A CG of 240 μV/e − was demonstrated with these modifications, which is equivalent to 0.67 fF FD capacitance.
Novel SF devices are also explored to reduce the SF gate capacitance. A JFET-based pixel SF was proposed in [67]. This is a p-channel JFET SF created in the pixels with implantations. The FD node functions as both the sense node and the gate of the JFET. The JFET is biased with a constant current source, and the output voltage follows the FD voltage when the JFET is biased in the saturation region. The characterization results of this device are reported in [68], and an extremely high CG of 540 μV/e − was measured from some pixels, which is equivalent to a FD capacitance of only 0.3 fF. However, a large across-device variation was also observed, likely due Fig. 6. Gateless reset diode reported in [28]. Fig. 7. Pixel-level common-source amplifier with a negative feedback and self-biased reset method, in reset configuration (left) and amplification configuration (right), reported in [52].
to the nonuniformity of the doping concentration of the JFET across the pixel array.

B. Non-SF High CG Pixels
Another interesting approach to enable high CG in CIS-based pixels is to replace the pixel SF with other amplifiers with a higher voltage gain. In [52], the pixel SF is replaced with a pixel-level common-source amplifier with column-wise load resistors. A nominal voltage gain of 10 V/V and 300 μV/e − CG on the column output node were realized with this open-loop configuration. This yields a relatively low FD-referred CG of 30 μV/e − . The correlated double sampling (CDS) operation was used to cancel the pixel-to-pixel variations of the amplifier offset induced by the mismatch of the threshold voltage of the common-source transistors. A selfbiased reset method with negative feedback (Fig. 7) was used to compensate for the variations of the pixels' linear output swing. A 2.5% PRNU was realized with these compensation schemes, which is still higher than the typical performance of SF-based CIS pixels but remarkably low for pixels with openloop amplifiers. The sensor achieved 0.86e − rms read noise. Considering the relatively low CG on the FD node, the inputreferred voltage noise achieved with this approach is as low as 25.8 μV rms, which is significantly lower than the voltage noise of the SF-based pixels.
A similar pixel-level voltage amplification architecture was also reported in [51] and [69] with an additional column-level Fig. 8. In-pixel differential common-source amplifier, reported in [70].
sinc-type low-pass filter to further reduce the voltage noise. A minimum read noise of 0.31e − rms and peak read noise of 0.42e − rms were reported. However, the sensors suffer from large pixel-to-pixel CG variations (e.g., 240-2200 μV/e − in [69]), which may limit the implementation of this technique in the applications that have strict requirements for PRNU.
With a slightly different approach, an in-pixel differential common-source amplifier was proposed in [70]. As shown in Fig. 8, the differential common-source amplifier is formed with a readout pixel and a reference pixel, providing a nominal voltage gain of about 7.5 V/V and a column-referred CG of 560 μV/e − . The reference nodes, COM and VSL_REF, are connected in parallel among thousands of pixels that are simultaneously readout, which significantly increase the transistor size and reduce the temporal noise from the biasing transistors. This work realized 0.50e − rms read noise and an improved PRNU of 2.5% compared to the single-ended configuration used in [51] and [69], which suggests better uniformity of the CG across the pixels.

C. SF Temporal Noise
In the SF-based CIS pixels, the temporal noise from the SF is usually the dominating noise source. The temporal noise in an SF device consists of thermal noise, 1/f noise, and random telegraph noise (RTN). Thermal noise is present in all electrical circuits, and its cause is well understood to be the thermal fluctuation of the charge carriers inside the electrical conductor [71]. Similarly, 1/f noise is present in almost all the electrical circuits. Its root cause, although has been extensively studied, is still largely debatable [72]- [80]. The popular theories include the fluctuation of the number of charge carriers in the transistor channel and the fluctuation of the mobility of the charge carriers. However, none of the models managed to explain all the experimental results. RTN is often present in a small portion of a large pixel array. The percentage of the RTN pixels can be lower than 100 ppm in a modern CIS. However, because of its high noise magnitude and trimodal noise signature, the RTN pixels are usually shown in the low-light images as "blinking" pixels and have strong degradation to the image quality. The RTN in CIS is well known to be linked to the trapping/emission events of the defects-induced energy states inside the pixels, especially inside the Si-gate oxide interface in the SF channel, e.g., [81]- [93]. Other RTN sources have also been observed in CIS [83], [84], [93], such as the photodiode dark current induced RTN and the gate-induced drain leakage (GIDL)induced RTN.
The use of a "buried channel" was first introduced in buried-channel charge-coupled devices (BCCDs) to reduce the interaction between the charge carrier and interface traps, thus improving charge transfer efficiency [94]. This concept was later expanded to the in-pixel SF devices to reduce the RTN and 1/f noise [56], [57], [85], [95]. The buried-channel SF (BSF) reported in [95] consists of a thin n-type channel located near the Si-SiO 2 interface and between the n + doped source and drain. Because of the n-type buried-channel doping, this device has a negative threshold voltage. When the device is biased in the saturation region, the negative voltage across the gate and the channel creates a potential barrier near the Si-SiO 2 interface with a barrier height more than several kT/q, which protects the charge carriers in the channel from the interface traps. In [95], a 50% read noise reduction compared to the surface-channel SFs with the same size and 205 μV rms input-referred read noise were reported. The effective noise reduction from the BSF was confirmed in [85], in which a 5× noise reduction at the 99.99% percentile and a 90× reduction of the RTN quantity compared to the surface-channel SFs were reported.
Additionally, reduction in 1/f noise and RTN was demonstrated with pMOS SF in multiple works [53]- [55], [96]- [99]. The lower noise of pMOS can be explained by the lower active trap density in pMOS because of the 10-20 times heavier effective masses of a hole in the oxide than that of an electron and a higher potential barrier for a hole to tunnel into SiO 2 [75], [100]. The pMOS SF can be implemented in CIS pixels with a hole-based p-type process [97]- [99], or more commonly in the modern CIS, with an in-pixel n-well made with implantations to host the pMOS SF [53]- [55], [101]. However, the n-well will inevitably increase the pixel size and reduce the fill factor. In [53], a thin-oxide pMOS SF was implemented and 0.48e − rms input-referred read noise was realized, which is equivalent to 76.8 μV rms read noise in the voltage domain. This work was expanded in [55], and the input-referred read noise was further improved to 0.32e − rms with 250 μV/e − CG and CMS readout. In addition, in the pMOS SF reported in [101], a bulk-to-source connection was made to compensate for the body effect and improve the voltage gain of the SF.
As both 1/f noise and RTN are known to be inversely proportional to the gate size of the SF [79], [80], [91], [96], a larger SF size is desirable for the reduction of SF temporal noise. However, a larger SF also increases the capacitance on the FD node and reduces the CG. This tradeoff is discussed in [28] and [102]. Recently, a multigate SF was introduced as a possible solution to overcome this tradeoff with promising preliminary results [103]. Example implementation of CMS operation in (a) digital domain [114] and (b) analog domain [112].

D. CMS and Noise Filtering
The CDS readout is commonly used to in modern CIS to eliminate the FD reset kTC noise and reduce the SF thermal noise and 1/f noise [104]. As an expansion of CDS, CMS readout is often used to further reduce the read noise [17], [28], [31], [32], [49], [55], [57]- [59], [61], [70], [105]- [115]. With CMS, the pixel reset and signal voltage levels are sampled multiple times and the averages are subtracted. Hence, the pixel reset noise can be canceled through subtraction, just like CDS, and the thermal noise and 1/f noise can be further reduced with averaging. The CMS readout has been implemented in CIS in both digital and analog domains. Examples of the digital and analog implementation are shown in Fig. 9.
Compared to analog CMS, digital CMS requires a larger number of analog-to-digital converter (ADC) conversions, which results in a reduced frame rate and increased power consumption. The analog implementation is more time and power efficient; however, it is usually less efficient in noise reduction because of the additional kTC noise in the sample-and-hold circuitry. Novel circuit architectures are actively explored to overcome this tradeoff. For example, in [49] and [108], a selective digital CMS method was used to shorten the ADC conversion time needed for the multiple sampling. With this architecture, the pixel output is sampled simultaneously by a full-range ramp for large signal under strong illumination and a multiple sampling short ramp for small signal under dark conditions. This approach reduces the readout time needed for digital CMS while preserving the noise reduction efficiency, but it introduces additional complexity to the per-column ADC and the signal processing, as well as the chip area and power consumption.
The theoretical read reduction from CMS is as follows: σ CMS = σ CDS / √ N , where σ CMS and σ CDS are the read noise with CMS and CDS, respectively; and N is the number of CMS cycles. However, the noise reduction observed in the experimental results often show lower  [32] and (b) from [114].
efficiency than the theoretical model, especially with a large N (Fig. 10) [31], [32], [112], [114]. This phenomenon can be explained by lower frequency 1/f noise and the accumulation of the dark current on the FD node as the sampling time increases. As discussed in [115], a skipper-type of CMS operation will be the most efficient for the read noise reduction [116]- [118], as the effective sampling time can be kept short for each pair of the reset and signal samples to cancel the low-frequency noise and the accumulation of FD dark current. However, this technique requires a floating gate or similar types of readout architecture in the pixels, which reduces the CG on the FD node and increases the complexity of the pixel structure.
The reduction of read noise has also been demonstrated with other noise filtering methods by limiting the noise bandwidth of the readout circuit. A faster CDS operation with a shorter t between the two samples can effectively reduce the read noise [88], [119], and a similar reduction can be realized with a lower bias current of the pixel SF. However, both techniques have limitations with high-speed operation under high-light conditions when a large signal swing and fast settling time are needed.

E. Superior Low-Light Imaging With DSERN
Reducing read noise from 1e − rms to DSERN levels brings somewhat surprising improvements to the ultralow-light imaging performance with CIS-based multibit QIS. As shown in Fig. 11, a CIS QIS sensor is compared with two industryleading CISs for security and cellphone applications under ultralow-light conditions (10 and 128 mlux) with the same exposure time and lens configurations. Despite the significantly smaller pixel size, the QIS provides remarkably better SNR and image quality, due to the ultralow read noise.

IV. SIGNAL PROCESSING FOR QIS
Data captured by a QIS is a three-dimensional space-time volume where each entry is a 1-bit or multibit digital number. Since in principle the jot size can be small and the temporal response can be fast, the binary outputs produced by the jots can be seen as repeated but independent measurements of the incident photon flux. A schematic of this image formation process is shown in Fig. 12. The process is a combination of color selection, photon arrival, noise injection, and quantization, among other sensor level modeling.
At the very basic level, the mathematical model of the measured jot value Y can be described by the following equation: where H is the quanta exposure, H dark is the dark current, and σ is the read noise standard deviation. The sum of the Poisson random variable and the additive Gaussian random variable accounts for the photon arrivals and the read noise, respectively.
A color filter array (CFA) is applied to the measurement to give color, and an ADC is used to convert the voltage to digital bits. Assuming that the underlying exposure H does not change rapidly over space and time, the random variable Y is sampled repeatedly to produce the observed data.
Vetterli and colleagues at EPFL [8], [121], [122] had a precise abstraction of QIS, referring to it as an oversampling device because the information is embedded in the densely sampled measurements. The nonlinearity of the image formation makes the statistical properties of the data less straightforward compared to CIS [15], [123]- [125], and thus the signal extraction from the raw data to an actual image poses new challenges.
The rest of this section will describe the signal processing aspects of QIS. The mathematical model presented here is one level above the device modeling. What this means is that the model is applicable whenever the image formation follows a Poisson-Gaussian distribution, subject to different parameters, e.g., CIS QIS has a lower dark current than that of SPAD QIS. Because of the identical mathematical formulation, the algorithms are valid for both CIS QIS and SPAD QIS. In fact, the reported algorithms seldom distinguish themselves based on the particular technology [142] and [150].

A. Estimation for 1-Bit and Multibit QIS Signals
The basic building block of QIS signal processing is to consider Poisson (H ) by ignoring the dark current and read noise. The ADC (or simply a threshold mechanism) will turn the measured voltage into a quantized random variable Y depending on the bit depth. For 1-bit signals, Y is binary with two states Y = 1 and   ∞ H t L−1 e −t dt is the upper incomplete Gamma function which is often used to derive theoretical results for QIS [123].
The statistical estimation of H based on Y can be carried out using the maximum-likelihood estimation. In the case of 1-bit measurements with L = 1, the random variable Y follows a Bernoulli distribution. The maximum-likelihood estimate is therefore found by maximizing the likelihood function of a sequence of independent Bernoulli random variables and construct the estimate as the functional inverse of μ Estimators constructed in such a way satisfy the so-called mean invariance property [125].

B. Feasibility and Performance Limit Analysis
The development of signal processing theory for QIS started around 2009 at EPFL [127] where the focus was to understand the oversampling nature of the problem and the corresponding statistical properties. A major report was published in 2012 [121], where they derived the performance limit in terms of the Cramer-Rao lower bound by analyzing the 1-bit maximumlikelihood estimator. Then between 2012 and 2014, a series of articles were published by Rambus [26], [128], [129] showing the feasibility of QIS for HDR imaging.
Three major theoretical questions are of particular interest. The first one is the noise statistics. At the sensor level, the noise analysis reported in [15] and [25] covered most of the essential concepts. Analysis of the multibit signals using the incomplete Gamma function was reported in [126].
The second question is the definition of the SNR. The output-referred SNR (the ratio between the mean E[Y ] and the standard deviation (Var[Y ]) 1/2 is known to be unbounded for 1-bit signals because the bit density will approach the constant one as the exposure is saturated. The more appropriate definition of the SNR is the exposure-referred SNR [15], denoted as SNR H Detailed mathematical analysis of SNRH can be found in [125]. The third theoretical question is the analysis of the threshold. Yang et al. [121] had some basic discussions but a more comprehensive analysis was reported by Elgendy and Chan [123]. The article shows the theoretically optimal threshold and proposed an algorithm to automatically identify such a threshold. Along this line of analysis, there are earlier publications, such as [122], [130]. Sensor-level studies are reported in [30] and [21].

C. From Iterative to Noniterative Algorithms
In early studies of QIS image reconstruction algorithms, a large amount of effort was spent on formulating the likelihood function of the 1-bit data and solving the associated maximum-likelihood [121], [123], [131], [132] or the maximum a posteriori estimation [135], [136]. On one hand, the convexity of the 1-bit likelihood means that the optimization is solvable via an appropriately chosen convex optimization algorithm, e.g., the alternating direction method of multiplier (ADMM) [135] and its plug-and-play variant [136]. On the other hand, the iterative nature of these algorithms makes them practically not favorable especially when hardware constraints are considered.
The first noniterative 1-bit image reconstruction algorithm was based on the concept of variance stabilizing transform [137]. The idea is that if the binary measurement Y i follows the probability distribution where P[Y i = 1] = 1 − e −H , then the sum S n = n i=1 Y i will follow the binomial distribution. Using a classical result by Anscombe [139], there exists a nonlinear transformation T such that the transformed variable T (S n ) will have a uniform variance. One can Fig. 13. First noniterative 1-bit QIS image reconstruction using the transform-denoise concept. Instead of running the maximum-likelihood estimate and then denoise, transform-denoise blends the Anscombe transform and denoising into the reconstruction process. Image courtesy: [137].
then apply any off-the-shelf image denoising algorithm for Gaussian noise to T (S n ), and then apply the inverse T −1 to recover the image [139], as shown in Fig. 13. Variance stabilizing transform is computationally inexpensive. With a lookup table and a built-in image denoising algorithm (such as those in mobile phones), the image can be recovered.
Moving scenes are more challenging. If the motion is moderate, averaging the pixels within a sliding cubicle, plus variance stabilizing transform and denoising is often the most cost-effective solution [137]. There are attempts trying to segment the moving parts so that the foreground and background are processed separately [140], [141]. However, object segmentation of 1-bit and few-bit data is as hard as solving the original reconstruction problem. A better and more reliable approach is to run image registration algorithms (aka optical flow in computer vision) [142]. In general, moving scenes remain an open challenge when the number of frames is small and the bit depth is low.

D. Deep Learning: Designs and Challenges
As a powerful computational tool, deep learning is currently an active research area for QIS image reconstruction. In some sense, training a deep neural network for QIS is no different from training an image denoising network for a CIS. Especially for a multibit QIS where L is large, the forward image formation model is the same as a CIS. Perhaps the only noticeable difference is the read noise where QIS is significantly lower. Because of the (almost) identical procedure in synthesizing data, several designs of the deep neural networks for QIS, e.g., the QIS reconstruction network (QISNet) [143], the U-Net [144], [145], and others [146] are all based on off-the-shelf networks but trained with a different data simulation process.
A more sophisticated deep neural network is a dual-encoder network by Chi et al. [147]. In this design, the network contains two teacher subnetworks where one subnetwork encodes the motion and the other subnetwork encodes the noise, as shown in Fig. 14. During training, the motion teacher network sees a noise-free dynamic sequence, whereas the denoising teacher sees a motion-free but noisy sequence. The features extracted by the two teachers are used as guidance of the student network which is supposed to generate features similar to the teacher. By minimizing the appropriate training Conceptual diagram of the dual-encoder student-teacher network for reconstructing images from QIS data. Image courtesy: [147]. loss, one can decouple the motion and noise to allow the overall network to handle moving scenes with only a few frames. At the core of the dual-encoder network is the concept of knowledge distillation [148]. Theoretical analysis is an ongoing research topic [150]. Variants of the idea have been reported [150]. Fig. 15 shows a reconstructed image using the dual-encoder network.
Despite the superb image reconstruction quality, deep learning approaches will still have a long journey before they can become an integral part of the sensor. Hardware constraint is certainly one obstacle, but even if one can run the computation on a graphics processing unit, the generalization of the neural network remains a question. With the huge variety of noise conditions, scene content, and camera configurations, it is nearly impossible to train one model and fit all scenarios. Some attempts are made to maximize the consistency across the noise levels [151], yet significantly more efforts are needed to close the training-testing gap. Generative models, such as [152], will likely have an even bigger hurdle to overcome, as not every user would appreciate digital image hallucination.

E. Linear Inverse Problem Beyond Denoising
In some applications, QIS needs to overcome a variety of inverse problems such as deblurring, super-resolution, and so on. At the core of these inverse problems is the forward modeling where the exposure is modeled by the matrix-vector Ax. Here, A is a linear operator capturing degradations, such as blur, and x is the underlying clean image to be estimated. (The bolded x symbol means a vector of pixel values instead of a single pixel.) In the simplest case that only considers the Poisson part (i.e., assumes zero-dark current, no read noise, no CFA, and ADC has a large bit depth), the estimation is a minimization for some regularization functions R(x). The equation says that the best estimate is found by minimizing the sum of the Poisson likelihood and the regularization function. The Poisson term captures the forward data fidelity, whereas the regularization encapsulates the prior knowledge of how a good image should look like. In signal processing, minimization of this type is known as maximum a posteriori estimation.
Solving this minimization is nontrivial. At the very least, the algorithm needs to invert the function handle the blur A simultaneously with removing the Poisson shot noise. This type of Poissonian problem has been known for a long time, but most algorithms can only handle Poisson noise to some extent [153]- [155]. One recent proposal is to integrate the classical maximum a posteriori estimation with deep learning via the so-called deep network unrolling [156], [157]. The idea is to consider a three-operator splitting strategy in the classical ADMM formulation [153], and then unfold the neural network to implement the iterative procedure, as shown in Fig. 16. The advantage of such an unfolded network is that it uses multiple iterations to progressively deblur the image, so that the solution trajectory will follow a smooth path and hence a more stable solution. Fig. 17 shows a pair of real image reconstruction results.

F. New Considerations for Color
Color processing of QIS data requires some rethinking because the pixels are now below the diffraction limit. Designing new CFAs is one direction [158], [159] where one needs to find a good compromise between aliasing, crosstalk, and transmittance. The current solution for QIS is to use the so-called quad Bayer pattern where a neighborhood of 2 × 2 pixels is shared by the same color filter. Quad Bayer is gaining popularity in major image sensor manufacturers [161]. However, since a quad Bayer pattern has a fundamentally different frequency response than the traditional Bayer, one needs to either completely redesign the demosaicking algorithm (and hence the image and signal processing pipeline) or convert the quad Bayer to Bayer.
Another challenge of processing color for QIS is the intrinsic Poisson statistics at low light [161]. Traditional demosaicking algorithms are not designed to handle this level of noise. One of the solutions is to demodulate the filter response of the CFA and decouple the luma channel from the two chroma channels. Since the luma channel has a triple SNR than the chroma, it can be used to guide the denoising process of the chroma channels [120]. The idea is to rewrite the classical ADMM algorithm into a sequence of learnable blocks using neural networks. Image courtesy: [156].   [157]. Note the improved detection capability of the sensor-algorithm combination.

G. New Capabilities in Computer Vision
Computer vision applications such as detection, tracking, recognition, and classification can all be performed using QIS such as the examples shown in [141]. The layman solution here is to first run the image reconstruction algorithm to recover the image, and then run off-the-shelf recognition and detection algorithms.
In the case of deep learning, one can perform the so-called end-to-end training for both reconstruction and recognition modules [162]. However, if the end goal is recognition, the reconstruction can be skipped [163] because deep neural networks today often have large enough capacities to handle reconstruction and recognition together. For object classification, one solution is to use knowledge distillation to pull features of the noisy data and match it with the features of the clean data. It was shown that without even a reconstruction module, the recognition performance can be promising [157]. Fig. 18 shows an example of tracking an object in the dark. For more complex scenes involving motion, advanced techniques such as nonlocal feature polling can be added to improve the quality of the features.

H. Bigger Signal Processing Landscape
The mathematical model and signal processing algorithms can be borrowed from/applied to its sister technology. The closest one is the SPAD-based image processing, where recent works have shown a variety of applications from light detection and ranging (LiDAR) to passive imaging [142], [164]- [167]. Another line of work is the first photon imaging, where the goal is to perform time-of-time using one or few photons [168], [169]. On the algorithmic side, seeing in the dark has been a major research thrust in computer vision [170], [171]. The idea is that when the scene is not completely dark, the raw sensor data will contain enough information for image recovery.
As far as applications are concerned, QIS is a natural option for a variety of imaging applications in scientific imaging, medical imaging, space imaging, security and defense, and low-light photography. The choice of the application will determine the corresponding optics, sensors, and the image processing algorithms. One distinction that should be made is who is going to consume the image data. If it is for human consumption, e.g., photography, then image quality will be the highest priority. If it is for machine consumption, such as automated inspections in advanced manufacturing or autonomous vehicles, then features extracted from the data would matter more than the actual image. Given the flexibility and freedom in processing the QIS data in the space-time volume, the different applications should receive different treatments.

V. CONCLUSION
The QIS concept started as a method to address the impact of pixel shrink on CIS, and the use photoelectron counting to create an image. SPAD QIS was an early obvious choice for implementation except for pixel pitch and photon number resolution. Implementation of CIS QIS revealed a path to DSERN for photon counting. SPAD QIS may become commercially viable based on recent progress with continued pixel shrink, including shrink of any in-pixel counter. Technologies developed for CIS QIS for deep subelectron read noise are finding their way into mainstream CIS devices for ultralow-light imaging with small pixels. Improvement in read noise to the 0.15e − rms level for all pixels remains a future goal, either through an increase in CG or lower SF noise. Adoption of the early 1bQIS concept will depend on on-chip data compression and processing, as well as off-chip readout data rates, and will be application-dependent in implementation. QIS devices may also find use in future systems where photon analysis by wavelength, polarization, arrival time, and other properties reduces an otherwise sufficient number of total photons to very sparse photon numbers making photon counting with high accuracy important. Photon counting may also be important in quantum information systems for the optical readout of quantum-computer qubit states and for quantum communications, where pixel spatial density and readout speed will become increasingly important.