Exploring Speed Maximization of Frequency-to-Digital Conversion for Ultra-Low-Voltage VCO-Based ADCs

A frequency-to-digital converter (FDC) performs the role of precise frequency digitization within a voltage-controlled oscillator (VCO)-based ADC. To be compatible with energy-harvesting (EH) Internet-of-Things (IoT) devices, the development of ultra-low-voltage (ULV) FDCs is crucial, where the primary focus must be directed towards the maximization of data throughput under dramatic constraints of reliability and timing variability associated with deep-subthreshold operation. This article investigates the speed maximization of a 0.2V full-custom ULV FDC design, consisting of an array of several parallel XOR-based FDC units, and the multi-rate decimation-filtering digital back-end. At the core of this broad exploration is a high-speed sense-amplify phase sampler (PS) featuring hardware redundancy, capable of sampling the phase of low-voltage-swing inputs. Particular focus is placed on the yield-based reliability-driven design methodology for the sense-amplify phase-sampling circuits running up to 40MS/s and practical variability-mitigation strategies. To overcome the speed bottleneck in the digital back-end, a fully parallel bitstream-processing architectural composition of the computations for summation and decimation are proposed. Experimental verification through measurements of the FDC integrated within a 10-bit 160kHz bandwidth (BW) open-loop VCO-based ADC across clock frequency with supply variations demonstrate robust operation of the first 0.2V multi-phase FDC in the advanced 28nm CMOS process.

To enable the widespread adoption of EH for a broader range of IoT applications, the achievable data throughput should therefore be improved by several orders-of-magnitude. This means that conventional ULV design methodologies for circuit energy minimization must now be concurrently accompanied with their speed maximization. The digital inverter gate in 28 nm LP CMOS technology experiences a dramatic increase in propagation delay (t inv ) of 200× as V DD scales from 0.8 V down to 0.2 V, as shown in Fig. 1. Despite this voltage scaling offering a 16× switching-energy saving (CV 2 DD ), the t inv of around 4 ns indicate that ULV digital circuits can function efficiently with a clock rate pushed to only a few MHz. Moreover, a caveat of exploiting the shorter transistor channel length afforded with a nanometer-scaled CMOS process [i.e., by migrating to the 28 nm node to increase the transistor's transit frequency ( f T )] is that local intra-die mismatches become more prominent and prohibitively difficult to manage, especially in weak-inversion [17]. To cover the random delay variability This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ (i.e., σ inv /μ inv ) of 20% at 0.2 V compared to less than 5% at 0.8 V, extensive timing margins must be allocated such that the circuit will rarely operate at its fastest intended speed.
The class of hardware used for frequency-to-digital conversion seems particularly affected by the deep-subthreshold operation, yet, it has remained relatively unexplored. Frequency-to-digital converters (FDCs) are most known for their use in the voltage-controlled-oscillator (VCO)-based analog-to-digital converter (ADC), where they provide a precise digitization of the frequency-modulated (FM) VCO output [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. FDCs have also found importance in all-digital phase-lockedloops (ADPLLs) requiring high-speed phase accumulators [28], [29], [30]. Despite using the same foundational building blocks (i.e., latches and combinational logic gates) as digital processors and memories, their functionality and requirements are fundamentally and uniquely different. A basic digital processor performs arithmetic computations synchronous to the system clock. The processing throughput can be enhanced by dividing operations into multiple pipeline stages, while budgeting timing margins for the critical delay paths of the pipeline stage prevent timing violations from occurring. Timing variations are addressed with techniques for dynamic error monitoring, detection and resilience [31], [32]. FDCs, on the contrary, operate asynchronously to the system clock to sample and synchronize the continuous-time frequency information, presented as VCO phase outputs with sluggish transition edges and highly amplitude-modulated voltage swings, resembling more of an analog rather than strictly digital waveform. To digitize the frequency information with sufficient precision, FDCs employ parallelism, oversampling and noise-shaping principles, taking the form of a multitude of modulators operating in the multi-phase configuration. In addition to the high-speed digital processing for summation and decimation, the need for a fixed and fast clock rate (i.e., a constant and short sampling period which defines the time-base and timing resolution of the frequency measurement respectively) at the asynchronous phase-sampling interface raise serious concerns regarding reliability.
In this article, we introduce the deep-subthreshold multiphase FDC (consisting of parallel XOR-based FDC units with embedded sense-amplify phase-sampling) of the 0.2-V open-loop VCO-based ADC shown in Fig. 2, whose front-end (consisting of the VCO core and the analog phase-processing circuits) was described in [1]; we also cover the downstream decimation filtering (DEC block). We wish to emphasize that the common theme for the proposed solutions throughout this work is not to propose novel FDC architectures, nor to over-exploit circuit sizing optimization strategies. Rather, the aim is to provide an in-depth study of the FDC operation in the deep-subthreshold regime and, moreover, to smartly and efficiently re-engineer the FDC and DEC circuit blocks for high-speed PVT-tolerant operation at a V DD of 0.2 V.
Framed into context more familiar to the designer, 40 MS/s at 0.2 V translates to upwards of 10 GS/s operation around 1 V for 28 nm LP CMOS technology (a ballpark approximation based on the inverter propagation delay characteristics in Fig. 1) but exhibiting more than 4× delay variability, impacting the yield dramatically. To address these critical concerns, the contributions of this article are as follows: 1) the 'bottomup' exploration of the high-speed sense-amplify asynchronous phase-sampling interface; 2) the incorporation of hardware redundancy techniques for the mitigation of circuit variability; 3) the integration of the multi-phase FDC array for large BW frequency digitization; and 4) the architectural composition for the multi-rate decimation filtering and summation logic of the digital back-end. We begin in Section II by describing the main classes of open-loop FDCs and their challenges encountered at 0.2 V. Sections III and IV investigate the sense-amplifier phasesampling FDC at, respectively, the circuit and system levels of abstraction. Section V describes the digital processing of the FDC bit-stream outputs, namely to provide digital filtering and downsampling necessary for output decimation to the Nyquist data-rate. Section VI offers experimental characterizations of FDC and DEC blocks around 0.2 V, at different clock frequencies.

A. XOR-Based FDC
The simple FDC structure illustrated in Fig. 3(a) observes whether the VCO output oscillating at frequency f VCO transitions within a clock period t CLK of the synchronous sampling clock CLK (of frequency f CLK ) by sampling the phase state of the VCO and differentiating two consecutive samples using an XOR gate. With the multi-phase VCO outputs (e.g., φ 1 to φ 4 ), the expected digital output is where N FDC represents the number of parallel FDC slices working in tandem. In the N FDC = 4 configuration of Fig. 3(a), the toggling between counts 3 and 4 maps f VCO to an intermediate frequency between 3 f CLK /8 and 4 f CLK /8. When the clock oversampling rate (OSR = f CLK /2BW) is beyond 100, a single bit-stream can encode frequency information with an effective resolution of more than 10 bits after the decimation to the Nyquist rate [26]. The maximum detectable frequency is upper-bounded by f CLK /2 but, interestingly, sub-sampled operation [33] is also permitted [n f CLK /2 < f VCO < (n + 1) f CLK /2 for integer values of n > 0].

B. Counter-Based FDC
The structure in Fig. 3(b) accumulates the number of VCO periods in the consecutive cycles of the much slower CLK [thus, allowing f VCO < (2 N cnt − 1) f CLK ], where N cnt is the word length of the digital incrementer (CNT). The digital subtractor differentiates the consecutive stored counter values to determine a measure of the frequency information in its digital format (D out = E{ f VCO / f CLK }) [28], [34].
The need to synchronize the asynchronously incremented binary values of CNT to the sampling clock leads to regular occurrences of metastability. Furthermore, mismatch-induced propagation delay skews across the incrementer bits can result in severe timing misalignment and cause unrecoverable hard-failures of the FDC even if the sampling process itself is free of metastability. Several solutions for high-voltage designs successfully combat this erroneous synchronization. The Gray coding scheme [35] limits metastability-induced sampling errors to within a single LSB, but sacrifices the speed of operation. Double sampling introduces redundancy to avoid timing windows where the count transitions, but at the cost of more than 2× power consumption. The valid sampling window is less than half of the VCO period, demanding the critical path delay of the counting circuitry (e.g., the frequency divider [36]) to be significantly shorter. From the perspective of maximizing the clock frequency to increase OSR, the counter-based topologies become nonviable at ULV due to the need to run both sequential and combinational logic of the timing-sensitive multi-bit digital incrementer at the VCO frequency, a much higher speed than the sampling registers already running at the outlined upper limit of f CLK .
The coarse-fine architecture shown in Fig. 3(c) extends the counter-based structure (integer-count value) to detect the phase transitions in a power-efficient manner (fractional-count value). The power efficiency of this architecture is similar to the XOR-based FDC (with decimation, it actually becomes superior) [24]. However, all of the issues encountered in the counter-based FDC (and their remedies) are inherited here. In fact, the coarse-fine partition brings additional timing issues due to the delay skew between the coarse and fine quantizers, along with phase delay mismatches within the finequantizer itself. Architectures utilizing phase reordering [37], scrambling [38] or linear-state feedback registers [39] can be employed, at the cost of an increased system complexity. It remains to be seen whether such solutions can be efficiently ported to the deep-subthreshold environment without functionally limiting its operation to a very slow, suboptimal clock speed, where constraints such as leakage power, mismatches, variability and exponential decrease in maximal speeds place significant boundaries on what can be practically implemented.

C. Proposed XOR-Based FDC Unit Implementation
The bit-stream processing nature of the XOR-based FDC allows us to simultaneously maximize the "raw" f CLK speed and increase the N FDC parallelism while eliminating potential sources of errors that can arise from the unpredictable timing variability. Therefore, the XOR-based FDC is utilized in our work as a base to explore high-speed, multi-phase frequency-to-digital conversion. For completeness, we briefly describe our circuit implementation of the single FDC unit presented in Fig. 4. The sense-amplifier flip-flop (SAFF) phase sampler (PS) reads the state of the asynchronous differential VCO waveforms (φ p and φ n ). The amplification of the FDC differential inputs φ p and φ n upon assertion of CLKi leads to a reduction in the FDC metastability window when sampling slow, commutating transition edges of the VCO phase outputs [40]. Negative-edge triggered masterslave D-flip-flops (DFF) store the regenerated signal Q0. The digital differentiator XOR uses Q1 and Q2 to isolate its transient dynamics from the input-dependent CLK -to-Q delay of the SAFF stage. The clock is routed in the opposite direction of data flow, and the negative clock skew is introduced to prevent hold violations [41]. D out,d is obtained by passing the FDC unit bitstream D out through a 2 nd -order, downsample-by-4 decimation filter.

III. SENSE-AMPLIFY PHASE SAMPLING A. StrongARM Sense-Amplifier
An intriguing aspect yet to be explored in deep-subthreshold is the underlying structure of the high-speed analog phasesampling SAFF circuit. For the design of this PS, we concern mostly with its speed and thus metastability behavior. The FDC's first-order noise-shaping of phase sampling non-idealities through its differentiation (1 − z −1 ) means it is not additionally burdened by the strict constraints on the input-referred offset, noise and kickback encountered with high-speed voltage comparators. A baseline SAFF to first consider is the StrongARM-style sense-amplifier [40] shown in Fig. 5(a). The latch embeds the back-to-back 'inverters' M 4−7 , whereby outputs Qi and Qi are preset to V DD during the pre-charge phase (CLKi is low). The back-end SR-latch to hold the state of Q, Q must be of the NAND-type. When CLKi is high, M 4−7 form a positive feedback network. The regeneration time constant (τ) equals to C latch /G m,latch , the large-signal transconductance discharging the effective capacitance seen at one of Qi or Qi to zero, dependent on the polarity of the differential input voltage Circuit implementation of the standard XOR-based FDC unit. The front-end asynchronous sampling interface and back-end decimation filter (DEC) are particularly difficult to design at 0.2 V for speed-maximized conditions. the latch regeneration, and thus the speed of the overall SAFF is the method whereby the analog inputs (φ p , φ n ), through the input pair M 2,3 , unbalance the latch inverters.
Observe that for a small V φ , the input common-mode sets the gate-source voltage V GS of M 2,3 to be just V DD /2, (i.e., 0.1 V). The latch inverters become severely currentstarved. The integration of a differential voltage across Qi, Qi is therefore sluggish, impeding the kick-start of the positive feedback action by M 4−7 . Even for rail-to-rail inputs ( V φ of ±200 mV), the finite on-resistance R on of M 2,3 within the series stacking of four transistors results in less voltage headroom allocated for M 4−7 , effectively degenerating G m,latch . The only solution that can remedy the detrimental effects of the M 2,3 input devices is to dramatically up-size their widths. To compensate for the subsequent increase in the capacitive loads seen at φ p and φ n , power-consuming input buffers would be required.

B. Modified Sense-Amplifier
An alternative approach taken in this work is to employ a modified SA structure, as shown in Fig. 5(b). Here, the input pair is connected in parallel with the latch to form a series stack of only three transistors. An obvious advantage of this configuration is that M 4−7 have the maximum available voltage headroom (largely independent from the sizing of M 2,3 ), to achieve fast regeneration. To ensure that only dynamic power is consumed, a clock-gating pMOS device M 8 is inserted so that both outputs Qi and Qi are pulled to ground when CLKi is high (during the pre-charge phase). The back-end SR-latch is modified to the NOR-type. To unbalance the latch inverters, signals φ p , φ n modulate the respective impedance seen looking into the drain of input devices M 2 /M 3 . This offsets the latch away from the bi-stable state during the pre-charge phase, unlike in the StrongARM SA where the unbalancing of the latch only occurs at the start of the regeneration phase. Upon the activation of M 8 at the negative edge of CLKi, M 4−7 immediately initialize and direct the latch regeneration.  Evidently, even with the maximum rail-to-rail input offset ( V φ = ±200 mV), the StrongARM SA requires a M 2,3 width of 16 μm to match the t latch of ∼6 ns exhibited by the modified SA with a corresponding M 2,3 width in the range of only 1 μm to 4 μm. The StrongARM SA necessitates a large input device sizing, where for a V φ of ±10 mV and ±200 mV, the respective t latch improves from 55 ns to 10 ns and from 9 ns to 5 ns as M 2,3 is swept from 1 μm to 16μm. The speed of the modified SA actually deteriorates when its input pair is inappropriately oversized (e.g., for M 2,3 width of 16 μm) as the direct coupling of the input branches to Qi and Qi adds unnecessary capacitance to C latch . In that case, for a V φ of ±10 mV and ±200 mV, t latch is slowed down from 11 ns to 17 ns and 6 ns to 8 ns, respectively as M 2,3 of the modified SA is swept from 1 μm to 16 μm.

C. Discrete-Time Pre-Amplification
For low-voltage-swing signals, or in the occurrence of a small V φ , t latch is >12 ns, which may not be sufficient to fully regenerate and trigger the SR-latch in time. To further enhance the sampling capabilities of the SAFF, an upstream pre-amplification (pre-amp) stage [see Fig. 7(a)], identical to the modified SA of Fig. 5(b), is inserted to form the high-speed PS circuit shown in Fig. 7(b). The timing coordination between this latch stage and the modified SAFF is demonstrated in Fig. 7(c). When encountering a metastable state at L , L [enclosed within the red circle in Fig. 7(c)] due to the sampling of a small V φ , this on-going regenerating signal is further amplified to obtain a rail-to-rail waveform at Qi , Qi (SA output of the SAFF stage, enclosed within the green circle), thus improving the functionality of the SR-latch and the following flip-flop re-sampling stage (respectively the Q0 and Q1 outputs of Fig. 4). In terms of metastability, the effect of the pre-amp sampling latch is shown in Fig. 8 where t latch , defined as the time from the negative edge of CLKi to the fully regenerated Q i ,Q i . within the SAFF stage, now exhibits a much lower metastability window and appears to remain almost constant across V φ .
While this may give the impression that VCO phase outputs (φ p ,φ n ) with minuscule voltage swing levels  ( V φ,max V DD,FDC ) can be sampled correctly, the frontend pre-amp's internal mismatches must also be considered. As visualized with the 0-to-1 output transition threshold of the pre-amp sampling latch in Fig. 9 for 1000 Monte-Carlo runs, local mismatches shift the ideal output transition point away from V φ = 0. If V φ,max is too small, the SA output may remain 'stuck' to either 0 or 1 regardless of the VCO phase outputs' sampled differential voltage (| V φ | < V φ,max ). The histogram plot shows an input-referred voltage offset standard deviation σ of 26 mV, predominantly arising from the backto-back regeneration inverters. This sets a 3σ lower bound of 80 mV on V φ,max , such that V φ,max /V DD,FDC > 0.4.

D. Phase-Sampling FDC Dynamic Performance
Consider the three PS configurations shown in Fig. 10, namely (#1) D-flip-flop (DFF), (#2) SAFF, consisting of the  Pre-amp's 0-to-1 output transition threshold and (b) simulated input-referred voltage offset characteristics with local mismatches. modified SA followed by the SR-latch, shown in Fig. 5(b) and (#3) the pre-amp stage followed by the SAFF, shown in Fig. 7(b). Note that the DFF is sized to have a similar power consumption to the modified SA. To gain a deeper insight into the advantages of these structures, the PS configurations are empirically investigated by computing the VCObased ADC's signal-to-quantization-noise ratio (SNR Q ) with a full-scale 150 kHz sinusoidal input, covering around 70% of the XOR-based FDC quantization range, where SNR Q is dependent on the VCO tuning range, number of FDC readout phases (N FDC ), f CLK and signal BW [1]. For a BW of 160 kHz and N FDC = 1, the ideal SNR Q is approximately 60 dB with f CLK set to 40 MHz. A yield-based reliability-driven design methodology is imperative for deep-subthreshold operation, with the yield of the FDC embedding the PS disturbed by local mismatch-induced effects, defined as: where P(SNR Q < SNR Threshold ) is the probability that the FDC digitizes the FM signal with an SNR Q below the defined pass/fail threshold (taking a lower bound of 58 dB here). The yield contour plots of Fig. 11 sweep V DD,FDC and V φ,max /V DD,FDC . For a V φ,max /V DD,FDC ratio of 1, the (#1) DFF PS outperforms both the (#2) SAFF and (#3) Pre-amp+SAFF configurations (at V DD,FDC of 210 mV, its yield is 100% instead of 96% for the SAFF-based configuration). While the equivalent t latch regeneration time for the DFF (its CLK -to-Q delay) is also around 6 ns, the SAFF must additionally incur the propagation delay through its backend SR-latch, which may result in the setup-time violation of the following DFF in the FDC processing chain. The advantages of the SA-based PS become evidently obvious for lower voltage-swing inputs. Even for the high V DD,FDC of 210 mV, as V φ,max /V DD,FDC goes below 0.8, the yield quickly deteriorates from 95% to below 20% for the DFF PS. On the contrary, the (#2) SAFF experiences a yield drop from around 95% to 85% and the superior (#3) Pre-amp+SAFF configuration sees its yield drop to only 90%. Note also that in these simulations, the common-mode (CM) voltage of φ p ,φ n is set to V DD,FDC /2. Due to the single-ended nature of DFF, any slight shift in this CM voltage would prove further detrimental for small V φ,max conditions.
In the context of the VCO-based ADC integration, these simulations imply that with a heavily modulated VCO output swing [26], or when the VCO core circuitry operates from a lower supply voltage relative to the FDC domain, the introduction of the sense-amplification phase-sampling mechanisms aid to relax the requirements on any explicit level-shifters (LS) and/or VCO buffers that may be necessary (facing issues with static power consumption, loading on the VCO outputs, phase mismatches and harmonic distortion induced by group delay dispersion [1]) in order to functionally interface with the FDC domain.

E. Second-Order Effects and PVT Variations
The following dynamic effects of the SAFF PS, with/ without the presence of the pre-amp sampling latch, are further presented: 1) VCO output edge transition time: Reducing the VCO edge transition time (tt VCO , quantified between 0.1V DD to 0.9V DD ) relative to its period (T VCO ), narrows the width of the SAFF metastability window. Consequently, the yield can be improved from 50% to above 65% when tt VCO /T VCO reduces from 0.2 to 0.1 [see Fig. 12(a)].
The addition of the pre-amp sampling latch not only improves the equivalent PS yield to above 80%, it also becomes relatively independent of tt VCO . 2) Pre-amp sampling stage delay: The negative edge of CLKi must be asserted in advance of the positive edge of CLKii to ensure L, L are sampled before the front-end pre-amp sampling latch resets, otherwise a hold violation can occur. If t pre is below 1 ns, the pre-amp sampling latch actually worsens the yield of the overall PS. The optimal delay is 3 ns [see Fig. 12 When t pre is longer than 5 ns, the yield reduces by 10% since less time is dedicated for the regeneration phase of the pre-amp sampling stage. A simple inverter element is inserted to accomplish the delay of t pre , obviating the need for complicated non-overlapping clock generation. Although, such sensitivity of t pre on the cascaded PS structure means it would be more desirable to insert a coarsely programmable delay generator.   Fig. 13(b) to demonstrate the extreme effects of global process corners. An SS process requires a minimum V DD,FDC of 300 mV, equivalent to a 10× reduction in speed. On the other hand, an FDC operating in the FF process corner can tolerate V DD,FDC as low as 150 mV.

A. Standby-Hardware Redundancy
Local variations may cause a fault in the operation of the phase sampling mechanism and, therefore, cannot guarantee sufficient yield (i.e., in the order of parts per million). One solution could be to up-size every device, as mismatch is inversely proportional to the square root of its gate area, at the cost of large loading capacitance. Alternatively, digital calibration can be employed to tune the switching threshold (i.e., imbalances between pMOS and nMOS devices in both differential paths), but the source of imbalances, random in nature, can arise from anywhere in the SR-latch, the sense-amplifier within the SAFF or the pre-amp stage. The hardware overhead, which would then be required to enable the re-configurability of nearly every MOS device in the PS makes such a solution highly impractical. Instead of relying on the absolute robustness of a single PS, consider an array of N red redundant PSs. Now, by selecting one of N red phase samplers that is verified as functional, the joint probability that all of the statistically uncorrelated phase samplers fail is the product of their individual failure rate. The composite yield therefore becomes: (2) Such "standby hardware redundancy" techniques are hallmarks of fault-tolerant system design, and have been exploited to improve mismatch resilience in the implementation of flash ADCs [42] and SRAM memories [43].
The improvement in PS yield is verified as illustrated in Fig. 14(a). Note that there are actually two benefits of employing the hardware redundancy. At a relatively high supply voltage of 210 mV, the yield of a single phase sampler is 94%, while the composite yield reaches 99.7%, 99.98% and 99.999% respectively for N red of 2, 3, and 4. Moreover, the large random variability associated with deep-subthreshold operation means that, in some cases, the phase sampler could be in reality, much faster than its mean performance. Exploiting this phenomenon essentially reduces the minimum supply voltage, as the yield curve shifts to the left of the V DD,FDC axis. In this work, we implement the 2-choose-1 (2C1) configuration. This choice is governed by three main factors. First, leakage energy consumption grows linearly with the amount of hardware redundancy (from 4% for 2C1 to 8% for 4C1, see Section IV.B). Although, this additional leakage would still be rather small, with further circuit-level leakage-minimization optimizations possible. Second, yield maximization in conjunction with V DD minimization is not aggressively pursued. For example, at V DD,FDC of 190 mV, the original yield is improved from 80% to 96% (2C1) and > 99% (4C1). An industry-oriented product would certainly demand significantly tighter margins on yield, hence 4C1 (and beyond) might be preferred. Third, the fault tolerance mechanism of the asynchronous PS interface is not applicable in the downstream flip-flops, XOR gates, decimation and digital recombination blocks (designed instead to meet the timing margins of the synchronous clock domain). Without the protection of hardware redundancy, further reduction in V DD may lead these blocks to fail the timing slack requirements and become the new speed bottleneck. This could negate any yield/speed improvements reaped with a high N red -choose-1 PS configuration.
The implementation of the 2C1 hardware-redundant PS (i.e., N red = 2) is shown in Fig. 14(b). Multiplexers in the clock and output data paths incur minimal digital hardware overhead. For the PS that is not selected (i.e., in standby), its clock signal is replaced with a constant voltage bias of V STBY . With reference to Fig. 5(b), the clock device M 8 morphs into a sleep transistor to minimize the leakage current through the 'standby' PS. Furthermore, AND gates are placed at the output, so that in the very rare scenario where neither redundant phase samplers work, its output is disabled by the static control signal EN to not affect the multi-bit digital output of the entire FDC array, composed of several FDC units. Figure 15 shows the power consumption breakdown of the single FDC unit at V DD,FDC of 210 mV, consisting of the clock buffering (CLK), standard XOR-based FDC cell (2 DFFs and 1 XOR) and the phase sampler [standby leakage, preamp, sense-amplifier (SA) and SR-latch (SR)]. As shown in Fig. 15(a), the total power consumption is 490 nW when f CLK is 40 MHz and V STBY (the bias voltage applied to the M 8 clock devices of the modified SA circuit when it is in standby) is 210 mV. The phase sampler (158 nW) allocates 58 nW each for the pre-amp and the SAFF's sense-amplifier, along with 42 nW for its SR-latch. The local clock buffering for the FDC unit (partitioned roughly equal between driving the PS circuits and the 2DFF+1XOR FDC block) consumes a further 143 nW, with the downstream standard XOR-based FDC cell (2 DFFs and 1 XOR gate) consuming 150 nW.

B. FDC Power-Leakage Trade-Off
The standby leakage of the redundant PS within the 2C1 arrangement dissipates 40 nW (10 nW each for the pre-amp and SA, 20 nW for the SR-latch). Despite the M 8 sleep transistor cutting off the power supplied to the standby-redundant PS blocks, the leakage power (10 nW per SA) is still significant. To mitigate the issue of leakage for standby circuits, an on-chip switched-capacitor voltage doubler [13] for high-impedance loads is implemented in our ADC prototype to straightforwardly boost the V STBY bias to 420 mW, with the consequent power breakdown shown in Fig. 15(b). The leakage current of both SAs combine to less than 1 nA (i.e., virtually non-existent). Since this technique was not applied to the SR-latches, its leakage power remains at 20 nW.
It is interesting to show the effect of leakage currents at f CLK of 10 MHz with f VCO scaled accordingly [see Fig. 15(c),(d)]. The standby leakage component is static and thus remains unchanged, dissipating a bigger portion of the total power budget. Leakage is also concerning for the actively switching components of the FDC unit. The total power reduces from 490 nW to 275 nW as f CLK is scaled from 40 MHz to 10 MHz. This sub-linear power-frequency scaling indicates that the deep-subthreshold energy efficiency is much degraded for slower speeds of operation. Therefore, it is recommended for ULV standby circuits in a large-scale system to employ sleep transistors reversebiased (for a pMOS device) through a shared, simple switched-capacitor voltage doubler, while actively switching circuits must operate at their fastest possible speeds to remain maximally energy-efficient. This insight implies that systems which promote lower parallelism coupled with high-speed processing slices (e.g., the XOR-based FDC), are preferred over low-clock-rate multi-bit processing (typical of counterbased/coarse-fine architectures) for efficient deep-subthreshold operation.

C. In-Situ FDC Performance Monitoring
To determine the selection of one functional PS within an array of N red redundant PSs (and moreover evaluate the functionality of the overall FDC structures against severe PVT variations), we must first provide a method to characterize the individual FDC units. Figure 16(a) shows the SNR Q (BW = 160 kHz) of a single FDC unit for a frequency-modulated VCO waveform when a full-scale (FS) voltage sinusoidal input is applied, across 50 Monte-Carlo runs. Of course, observing the achieved SNR Q makes it trivial to determine whether the FDC functions or not. However, this computationally intensive fast Fourier transform (FFT)-based testing method is only feasible in a laboratory environment. It becomes apparent that only very simple methods should be used to facilitate a built-in self-test (BIST) of a microwatt-level ULV design.
Due to the asynchronous nature of the phase sampling process, it is not strictly necessary to provide a dynamic input stimulus (e.g., a sinusoid). A VCO with constant frequency ( f VCO ), asynchronous to f CLK , generates an intrinsic periodic phase-ramp at the input of PS, which exercises the full 0-to-2π phase of the clock period. This greatly simplifies the test stimulus, as the ADC input can now be tied to either the supply or ground power rails. Another factor which makes it difficult to diagnose faults in the single XOR-based FDC unit is that it outputs an oversampled bit-stream which toggles between just two levels (the high and low levels being V DD and ground, respectively) with a very high toggle density. In other words, the switching activity induced by high-frequency noise-shaped quantization components may make the FDC appear functional but in fact, its faulty operation has corrupted the low-frequency signal of interest. To alleviate this issue, we pass the output of the individual FDC unit through a 2 nd -order, decimate-by-4 digital filter to obtain a better time-domain behavior of the FDC output. The implementation of the high-speed decimation filtering stages is discussed in Section V.
With both transient waveforms of the sinusoid and DC input cases at respectively Run 1 and 30 for the decimated digital output (D out,d ), see Fig. 16(c) and Fig. 16(d), we can now observe that FDC is most prone to faulty frequency digitization when f VCO is close to its full-scale value of f CLK /2, where metastability events cause catastrophic glitches in D out,d . This glitching is quantified with D out,d , the jump in consecutive digital output codes, otherwise known as the derivative of D out,d . For Run 1, the PS/FDC slice consistently fails to sample correctly the VCO phase information along its processing chain, leading to a largely unusable and corrupt digital output. For Run 30, the glitches, less frequent in time, occur when the PS/FDC samples metastable states which result in the occurrence of a bitstream string of three or more consecutive 1's or 0's, leading to the incorrect overflow of the XOR-based FDC structure.
Consequently, a functioning PS results in the decimated digital output [see Fig. 16(e)] to toggle by 1 only [i.e., exhibiting a maximum derivative max( D out,d ) of only 1]. Note that this assumes the VCO (i.e., FDC input) is noiseless, but in the realistic case of the VCO exhibiting phase noise (PN), being equivalent to a noisy DC input source, max( D out,d ) is shown to be less than 4 for our design. This sets the threshold by which the FDC under test is designated as functional (≤Threshold=4) as opposed to faulty (>Threshold=4). Figure 16(b) demonstrates this direct one-to-one correspondence between quantifying max( D out,d ) for a DC input, which places f VCO near f CLK /2, compared to the SNR Q measurement for a full-scale input sinewave visualized in Fig. 16(a). The proposed FDC testing protocol is visualized with the flowchart in Fig. 17. For each available FDC unit, its internal hardware redundant PSs ('A' and 'B' in this case) are individually tested with max( D out,d ) computed. Only the first instance of a functional PS needs to be detected to determine the PS selection bits (SEL A or SEL B ). The calibration procedure ends after the iteration through all FDC units under test.
V. DIGITAL DECIMATION AND FILTERING The decimation filter, by virtue of being in the digital domain, is often neglected in the literature and in most prototypes not considered for on-chip implementation. Here, the need for a large OSR dictates the first stages of the digital filter to operate as fast as the speed-maximized phase sampling interface. Moreover, utilizing the multi-phase outputs of the ring-VCO results in a significant portion of the ADC's hardware dedicated to the output summation logic to perform digital recombination of the parallel FDC streams [24]. Aside from the necessary additional power consumption, the combination of high-speed operation and large mismatch-induced delay variability may compromise the functionality of these seemingly trivial digital blocks in deepsubthreshold.

A. High-Speed Design Challenges
We first describe the timing characterizations for D-flipflop (DFF) and full-adder (FA) circuits (implemented with extremely-low V t "elvt" devices in 28 nm LP CMOS), being the backbone of all digital processing hardware, to demonstrate the high-speed design difficulties. Pertaining to the clocked storage elements, the non-ratioed master-slave DFF topology has been proven to provide relatively robust functionality in subthreshold [4]. Suppose at 0.2 V, we may operate the FDCs at a realistic clock rate of 40 MS/s (i.e., a t CLK of 25 ns). We estimate the setup time to be around 4 ns for a CLK -to-Q delay t CQ of 5 ns. In other words, the 'delay' of the flip-flop within a pipeline stage accounts for nearly 40% of the timing budget. With local mismatch-induced delay variations, a setup time of 4 ns translates statistically into t CQ below 5 ns for only 20% of DFFs. To guarantee the reliability of every flip-flop in the design, 13 ns (more than 50% of the cycle timing budget) is sacrificed just to allow for pipelining, since this precious time and associated energy costs are not expended for useful computations.
Consequently, the remaining allocation of 12 ns is permitted for the inter-stage combinational propagation delay (t comb ) in the best-case scenario. Our evaluation of an FA cell sees a simulated compute time of around 11 ns (from input transition to carry output generation) and a standard deviation (σ comb ) of 1 ns for a worst-case delay approaching 15 ns. Therefore, a single pipeline stage can accommodate at most 1 or 2 FA stages. The registers required for storing the intermediate values along the pipeline stages end up consuming significant power and furthermore, takes up half of the timing budget. Latch-based super-pipelines [45], [46] mitigate this to an extent through time borrowing, but are affected by hold violations, necessitating strict coordination between all clock and data paths of the digital signal processing chain.
A simple circuit optimization technique may be to aggressively up-size the width of every transistor and use alternative  flip-flop variants [44]. This improves the timing situation, but the performance gains remain rather incremental as the decreased gate delays cannot sufficiently cover slower PVT conditions. The bulky transistors are hampered by their increased capacitive loading and leakage currents, leading to a diminishing throughput return, all the while incurring a decrease in both static and dynamic energy efficiency.

B.
Bit-Stream Processing To minimize the dependence on optimization strategies such as circuit sizing as a means to overcome the encountered timing bottleneck, we must first revisit, at the architecture level, how the digital summation and decimation processes are carried out. In the conventional processing chain of Fig. 18, the parallel outputs of the FDCs are digitally recombined at the full-rate clock (CLK) with the output summation logic (e.g., implemented as a Wallace adder tree). The multi-bit output is then processed by the classical cascaded-integrator-comb (CIC) Hogenauer filter [48] to provide 2 nd -order low-pass filtering (necessary to filter out the 1 st -order noise-shaped FDC output) before the alias-free downsampling by a factor of M = 4. This CIC topology moves the combing differentiators to the quarter-rate clock (CLK/4), but high-speed integrators, built from multi-bit digital adders, are still required. The computations this digital back-end entails make such digital processing unpractical under the aforementioned timing constraints. . Interestingly, it is neither necessary to perform the digital summation nor filtering operations at the fullrate clock. We suggest two modifications to the traditional back-end implementation in Fig. 19 in order to restructure the digital signal processing. Using the Noble identities [49], the order of downsampling and anti-aliasing filtering can be commuted, so as to move the filter coefficient multiplication and summation operations to the low-rate clock domain. Furthermore, applying linear superposition, it is possible to perform digital recombination after the downsampling, rather than immediately following the FDC digitization.
The polyphase composition, through the expansion of the CIC's filter response, reveals in its non-recursive FIR form a triangular-window sinc 2 () filter impulse response. The implementation of the polyphase 2 nd -order, decimate-by-4 filter in Fig. 19 thus allows us to process the individual bitstreams, such that all combinational adders are conveniently relocated to the timing-relaxed quarter-rate clock domain. It is important to note that the modular nature of the proposed parallel decimation filter paths naturally allows the built-in testing protocol and characterization of the individual FDC units discussed earlier in Section IV. This limits the digital output glitching behavior, quantified by D out,d , to be caused by the single PS unit under test, and not the combined faults of the entire FDC array within the VCO-based ADC.

C. Back-End Implementation
The circuit implementation of the decimation filtering back-end for our pseudo-differential VCO-based ADC prototype is shown in Fig. 20(a). Four FDC streams (N FDC = 4) in both positive and negative complementary halves (a total of 8 FDC units outputting D outA± to D outD± ) are individually fed to their own polyphase decimation filters to produce the respective 4-bit outputs D out,dA± to D out,dD± , organized as in Fig. 20(b). Only the shift registers are clocked at CLK1 (40 MHz). The flip-flops operating at CLK4 (10 MHz) perform the downsample-by-4 operation. As the input values are bit-streams, the filter coefficient multiplication simply uses logical bit shift operations (thus, costless and delayfree). Three pipeline stages, for a total processing time of approximately 300 ns, are used. The first pipeline stage adds the filter taps with 2-bit to 4-bit carry-ripple adders [see Fig. 20(c)], with an arithmetic depth of 9 FA cells. The subsequent pipeline stages perform the digital recombination of the decimated FDC outputs within the single-ended ADC halves (producing D tot,d± ) and the 2's complement subtraction between the pseudo-differential ADC halves (an arithmetic depth of 9 and 8 FA cells, respectively) to obtain the final ADC output (D tot,d ). This partitioning of digital processing at CLK4 rate results in the critical path delay of each pipeline stage to be fairly similar. The respective circuit implementation of the constituent DFF and FA blocks are shown in Fig. 20(d) and (e). The 4× longer clock period (100 ns) allows for smaller transistor sizes. The large arithmetic depth per pipeline stage averages out the accumulating propagation delay mismatches of a long cascade of FAs [47], resulting in a more robust operation. Pipelining is now straightforward, as it contributes to only around 10% of the available timing cycle budget. It is important to note that the data movement interface between the multi-rate clock domains (i.e., the re-sampling of data clocked at CLK1 by CLK4 within the polyphase decimation filter) represents a potential point-of-failure. It is imperative that CLK4 leads CLK1 to prevent any hold violations.
Such fully parallel signal composition in both space and time (for recombination and decimation, respectively) allows for the greatest achievable throughput as the operational speed is limited by only the delay of a single DFF. Of course, the complete removal of any computational speed bottleneck does come at a cost to power and area consumption, which must be carefully balanced with increased parallelism beyond N FDC = 4 implemented in this work. The cost of decimation increases linearly with N FDC . Moreover, as decimation increases the word length (from 1-bit to 4-bit), the benefit of the quarter-rate processing may be outweighed by more multi-bit computations required in general for the digital adders. Another consideration is that while this decimation filtering hardware demonstrates a down-sampling factor of 4, further decimation stages down to the Nyquist data-rate are still required since the OSR of our VCO-based ADC design is >100. These decimation stages face much more benign timing considerations (e.g., an additional 4× decimator stage would operate with a clock rate of f CLK /16), thus omitted from the prototype implementation. Figure 21 shows the chip micrograph and core structure of FDC and DEC digital back-end blocks integrated within the 0.2-V VCO-based ADC prototype, whose front-end was described in [1]. The IC is fabricated in TSMC 28-nm LP CMOS and the ADC occupies an active area of 0.12 mm 2 , of which the FDC and DEC array take up 0.03 mm 2 . The serial-to-parallel interface (SPI) and designfor-test (DFT) digital read-out enable the outputs of FDC units and DEC processing slices to be multiplexed out for individual characterization.

VI. EXPERIMENTAL RESULTS
The spectrum at the output of the individual FDC+DEC processing slice (e.g., D out,dA+ ), sampled at 45 MS/s (digital data stream at the decimated 11.25 MS/s rate) with the VCO core modulated by a 20 kHz, 0.2 V pp single-ended input sinewave, is shown in Fig. 22(a). For this intermediate single FDC unit, the SNR for 'A' (blue) and 'B' (red) PS configurations are 40.7 dB and 52.9 dB, respectively. Clearly, PS 'A' malfunctions, destroying the noise-shaping property of phase quantization errors, thus leading to an elevated noise floor. PS 'B' operates as intended, so the dominant in-band noise contribution originates from the VCO phase noise. This measured 52.9 dB value is therefore less than the ideal SNR Q > 60 dB with only the quantization noise of the FDC unit taken into account. The distinction in functionality  between both phase samplers can be easily demonstrated with the digitization of a DC input as shown in Fig. 22(b). The decimated digital output D out,d toggling between codes 10 and 11 places f VCO from 14 MHz to 15.5 MHz. However, the FDC output with the PS 'A' selected experiences large glitches in the output code. For the digitization of a DC input, this may only occur if the FDC's operation is faulty. Figure 23 extends the SNR characterization to all available FDC units (8 in total), labeled from FDC 1 to FDC 8 , across FDC supply voltage V DD,FDC at sampling rates of 30 MHz, 40 MHz and 50 MHz, summarizing the results of a total of 192 unique data measurement spectra. The SNR of the individual FDC units is computed by multiplexing the internal 4-bit signals D out,dA± to D out,dD± off-chip one-by-one through the DFT circuitry. In order for the ADC to function correctly, the entire FDC array (4 FDC units per positive and negative complementary halves) must not fail; in this case, the complete SNR measured at node D tot,d reaches 62 dB [1]. By introducing the 2C1 hardware-redundant PS, only one of two PSs within a FDC unit needs to work. Take for example Fig. 23  is visualized with the contour plot of Fig. 24(b). At V DD,FDC and f CLK of 215 mV and 40 MHz, the FDCs consume 4.4 μW (i.e., 550 nW per FDC unit).
The functionality and power consumption of the decimation-filtering digital back-end are characterized respectively in Fig. 24(c) and (d). This digital back-end consists of 8 polyphase decimator slices, the quarter-rate digital recombination logic, an 8-bit subtractor and local clock drivers. At 30, 40 and 50 MHz, V DD,DEC , the supply voltage of this power domain, must be respectively above 205, 225 and 235 mV to meet the necessary timing requirements. At V DD,DEC and f CLK of 225 mV and 40 MHz, the total power consumed by the digital back-end is 8.6 μW.
To the best of our knowledge, this work represents the first full demonstration of a deep-subthreshold 0.2 V multi-phase FDC architecture running at ∼ 40 MHz in advanced CMOS, such as a 28-nm node. Additionally, it is worth mentioning that further experimental verification of the entire 0.2 V openloop VCO-based ADC, such as SNR performance across several ICs, dynamic measurements versus input amplitude and input frequency (i.e., including the VCO analog front-end core) and its comparison to state-of-the-art VCO-based ADC implementations are available in [1].

VII. CONCLUSION
Energy harvesting IoT network solutions call for the development of ultra-low-voltage (ULV) circuits operating deep into the subthreshold regime of the CMOS devices. The expositions in this article explore the speed maximization of the multi-phase frequency-to-digital converter (FDC) architecture integrated within an open-loop VCO-based ADC prototype, operating with the supply voltage approaching 0.2 V at clock frequencies between 30 MHz and 50 MHz. Such a feat is made possible by the high-speed sense-amplify asynchronous phasesampling interface, capable of sampling small-voltage-swing signals. Hardware redundancy is incorporated to mitigate large circuit variability associated with deep-subthreshold operation for significantly improved fault tolerance and yield. The FDC array consisting of several parallel XOR-based FDC units are decimated and combined with polyphase decimation filters and quarter-rate digital logic. Consequent speed bottlenecks due to the operation with timing margins of merely 10's of nanoseconds imposed at the clock-sampling limit are overcome. Experimental characterizations verify that at 40 MS/s, the FDCs and decimation-filtering back-end are functional at the minimum supply voltage of 215 mV and 225 mV, consuming 4.4 μW and 8.6 μW respectively.