A Fast-Lock All-Digital Clock Generator for Energy Efficient Chiplet-Based Systems

An all-digital clock frequency multiplier that achieves excellent locking time for an energy-efficient chiplet-based system-on-chip (SoC) design is presented. The proposed architecture is based on an all-digital multiplying delay-locked loop (MDLL) to provide fast locking time and multiplied output clock frequency. The proposed MDLL has two operation modes: TDC tracking and sequential tracking. At the beginning of the operation, the MDLL utilizes a cyclic Vernier time-to-digital converter (TDC) to detect the initial phase error between the reference clock and the output clock. Then the TDC generates a digital code word (DCW) for controlling the digitally controlled oscillator (DCO) to achieve a fast lock time. The gains of TDC and DCO are designed to match well with each other, enabling phase and frequency locking in only two searches in the TDC tracking mode. After locking, the TDC is turned off, and the MDLL performs the sequential tracking mode and minimizes jitter by using the delta-sigma modulator (DSM)-based dithering jitter reduction scheme. The prototype all-digital MDLL is fabricated in a 40-nm CMOS process and achieves a fast lock time of less than six reference clock cycles at 1.6 GHz from a 100 MHz reference clock. Even when the 100 MHz reference clock has a relatively high RMS jitter of 2.19 ps (peak-to-peak jitter = 15.74 ps), the measured RMS and peak-to-peak jitter values of the 1.6 GHz MDLL output clock are only 2.75 ps and 23.01 ps, respectively. The proposed all-digital MDLL occupies an active area of only 0.024 mm2 and dissipates 3.56 mW at 1.6 GHz.

I/O interface minimizes the number of required lanes by using a simple extra-short reach/ultra-short reach (XSR/USR) SerDes physical layer devices (PHY) with a high data rate per lane of up to 112 Gbps [19], [20], [21] On the other hand, the parallel bus-based I/O interface provides the required bandwidth using a huge number of ultra-fine pitch low-speed (up to 16 Gbps/line) single-ended lines [22], [23], [24], [25], [26], [27].
Furthermore, die-to-die I/O links generally have a low loss channel with a shorter distance and lower latency compared to off-chip or off-package I/O links, which can further reduce the complexity and power consumption of the die-to-die I/O PHY. Fig. 1 shows two typical die-to-die I/O interface architectures and their clocking schemes used for inter-chiplet communication. Die-to-die I/O interfaces are similar to general chip-to-chip I/O structures [28] on conventional multi-chip module (MCM) substrates or printed circuit boards (PCBs). However, in recent advanced 2.5D or 3D integration, the flip-chip bare dies are un-packaged and directly bonded on the die-to-die interconnect substrate. Therefore, the length of the inter-die wire is much shorter (usually below 10 mm), and the wire inductance is negligible [29]. Also, the required electrostatic discharge (ESD) protection circuitry overhead is much smaller than the off-package I/Os.
The first, shown in Fig. 1(a), is a parallel bus die-to-die I/O interface that uses a forwarded clocking (or source synchronous clocking) scheme [22], [23], [26], [27]. Usually, one clock lane and multiple data lanes are placed in parallel.
Chiplet A's PLL receives the low-frequency reference clock from an external clock source (i.e., crystal oscillators), multiplies it to high frequency, and transmits it to Chiplet B through the clock lane at full or reduced rates. Data lanes are usually single-ended for area efficiency, and the clock lane uses a differential line for signal integrity. In Chiplet B, a delay-locked loop (DLL) can be used to de-skew or maintain proper phase alignment between data and clock, regardless of process, voltage, and temperature (PVT) variation [14]. In some cases, quadrature clock-to-data timing is implemented between the forwarded clock and data lanes using a matched-delay clock forwarding scheme without using a DLL [26].
The second, shown in Fig. 1(b), is a serial link dieto-die I/O interface that uses an embedded clocking scheme [19], [20], [21]. These structures are particularly effective when ultra-high shoreline (or die edge) bandwidth density (=Gb/s/mm) is required. Chiplet A's PLL generates a high-frequency clock for driving the serializer and Tx driver. This structure does not use a separate clock lane for data transmission. In Chiplet B, a clock and data recovery (CDR) circuit is used to provide the driving clock of Rx that receives the transmitted data. That is, the clock for driving the Rx receiver is directly recovered from the incoming data, and the clock information is embedded in the transmitted data. This CDR-based serial link structure is often called simply SerDes because it contains a serializer and deserializer. This serial link can realize a very high data rate with a small number of data lanes and pins. Still, the power consumption is extensive due to power-hungry high-speed building blocks such as equalizers (EQs), PLLs, and CDRs.
As the performance and power consumption of chipletbased SoC increases, the importance of the I/O interface and clocking energy efficiency becomes crucial. We can consider the dynamic voltage and frequency scaling (DVFS) and burst mode communication as aggressive power management methods for reducing the chiplet I/O interface and clocking energy consumption. However, these methods inevitably require the clock generator's fast lock time or rapid power-on capability. It was shown that if a fast lock DLL is used to eliminate idle mode power in the mobile memory interface, about 30% energy efficiency can be improved (in a case of 40% CPU utilization) [30]. Furthermore, another mobile memory interface architecture utilizing a global synchronous PLL clock pause technique was introduced to enable rapid idleto-active power state transitions and achieve power-efficient bandwidth scaling [31].
On-chip clock generators typically use phase-locked loops (PLLs) to provide the necessary frequency multiplication and phase alignment functions. Output clocks of this PLL are transmitted to computation processor cores and communication I/O blocks through clock distribution networks. Reduction of lock time or power-on time of PLLs enables aggressive power management using DVFS, thereby reducing system power consumption [32], and even multiprocessor SoC's percore power management is possible [33]. Also, it can be seen that a fast-lock PLL plays an essential role in reducing I/O power in wire-line applications composed of high-speed serial links [34].
Although many all-digital ring oscillator-based PLLs targeting fast lock characteristics have been announced, most have a long lock time of more than several tens or hundreds of reference clock cycles [32], [35], [36]. Also, fast lock MDLLs have been proposed to improve further the locking time of clock generators [37], [38], [39], [40]. For example, [38] claims to have a fast lock time of two reference clock cycles by using an analog voltage-controlled oscillator (VCO) and a bang-bang phase detector (BBPD). However, this VCO-based MDLL attempts to lock by reusing the frequency code stored during the previous turn-off period, and a fast lock cannot be guaranteed when the PVT condition changes. [39] has a lock time of 16 cycles using a modified SAR-based binary search, but it is not easy to further reduce the lock time. Similarly, [40] achieved a lock time of 40 reference cycles using a SAR-based binary search. As described above, conventional fast lock PLLs and MDLLs generally have a lock time of at least several tens of reference clock cycles. Clock generators claiming a lock time of less than ten reference clock cycles usually reuse pre-recorded frequency codes [38], which cannot be a suitable solution due to PVT variations. This paper presents a new all-digital MDLL-based fast lock clock generator for low-power chiplet-based SoC design. The proposed all-digital MDLL measures the initial phase error using a wide-range fine-resolution time-to-digital converter (TDC) and utilizes this information to the digitally-controlled oscillator (DCO) control to implement a fast lock time of less than 6 reference clock cycles. This is the first all-digital MDLL-based clock frequency multiplier with measurement results showing how TDC can be applied to a digital MDLL to achieve fast lock capability. The proposed chip was implemented with a conventional analog design flow.
The rest of the paper is organized as follows: Section II introduces the proposed all-digital MDLL architecture and circuit design. Section III presents the measured results. Finally, section IV concludes the paper.    binary-to-thermometer decoders (5-to-31, 4-to-15, and 2-to-3), a second order delta-sigma modulator (DSM), a lock detector, a DSM controller, and a start controller. Fig. 3(a) is a flowchart illustrating the operation modes of the proposed all-digital MDLL. The proposed MDLL has two operation modes: TDC tracking and sequential tracking [41]. The locking process is described in more detail in the time domain in Fig. 3(b). When the MDLL is turned on, the MDLL starts at the maximum operating frequency, and the TDC tracking mode is first performed. The TDC measures the initial phase error ( t) between the MDLL output clock (CLK OUT ) and the reference input clock (CLK REF ) through a coarse/fine TDC search. The generated TDC output code, TDC[8:0], is then applied to the DCO control through DLF#1. The 9-bit DLF#1 filters the 9-bit TDC code and generates the input DCW[8:0] codes of the 5-to-31 and 4-to-15 binary-to-thermometer decoders. In TDC mode, the gain of this DLF#1 corresponds to 1. Three reference clock cycles are required for one coarse/fine TDC search and DCO application. In this design, the TDC search is performed at most twice, corresponding to six reference clock cycles, where Fig. 3(b) explains when it is performed only once for simplicity.
In Fig. 3, the phase lock condition is when the residual phase error t is reduced to less than 10 ps in this design. If the phase lock condition is met after the TDC search, the MDLL turns off the TDC and starts sequential tracking by using the BBPD as follows: First, with the DSM turned off, the MDLL continues its sequential tracking to reduce the residual phase error further. Second, when the COMP (=output of the BBPD) signal changes, the MDLL turns on the DSM and continues the sequential tracking to maintain a closed loop and achieve dithering jitter reduction. Then, the 4-bit DLF#2 shown in Fig. 2, acting as an accumulator using the COMP signal, generates the DLF[3:0] signal. The second-order DSM provides the high-frequency DSM[1:0] signal operating at a frequency 16 times higher than the reference clock. And the 2-to-3 decoder provides the D[2:0] signal that controls the dithering cell of the DCO. Fig. 4 shows the  simulation results in which this DSM-based dithering jitter reduction scheme effectively improves the jitter performance. If there is no DSM, F[14:0], which controls the fine cells of the DCO, is toggled by at least one bit at every reference clock injection, which causes a large reference spur and increases the deterministic jitter [43].  Fig. 6(a) shows the structure of the select logic (SEL), which is similar to that used in [42]. Fig. 6(b) shows the initial operation timing diagram of the select logic before phase locking. When the output SEL signal is low, the DCO forms a closed loop and performs ring oscillation. When the SEL signal becomes high, the reference clock is injected, and the accumulated jitters are removed. Fig. 7(a) shows the simplified architecture of the proposed 9-bit cyclic Vernier TDC. It comprises two slow/fast oscillators, an edge lock detector, a TDC_OFF detector, a slow enable block, a fast enable block, a reset generator, two 5-bit coarse/fine counters, and a divide-by-2 divider. To minimize the mismatch problem and achieve enhanced gain matching between the TDC and DCO, the structure of the delay elements inside the two fast/slow oscillators is identical to that used in the DCO.

B. GAIN MATCHED TDC ARCHITECTURE
The period of the slow oscillator is T SOSC , and the period of the fast oscillator is T FOSC . The resolution of the TDC is determined by T SOSC -T FOSC . The TDC gain (K TDC ) is defined by the ratio between the TDC[8:0] code value and the TDC input time difference. To achieve high linearity, the maximum value F[4:0] of the 5-bit fine counter should be the same as the least-significant bit (LSB) of the 5-bit coarse counter. Since the proposed cyclic Vernier TDC uses a ring oscillator structure, if the number of bits (=K) of the coarse counter is increased, the input detection range can be increased in proportion to 2 K . In this design, 5-bit C[4:0] is used for coarse bit counting, so the input time detection range of this TDC corresponds to about 8.5 ns. Fig. 7(b) shows the initial operation of the proposed TDC. When the MDLL is turned on, the TDC first enables the C_TDC EN on the (n + 1)th rising edge of CLK OUT to operate the slow oscillator. Then, on the rising edge of the next CLK REF , the F_TDC EN is enabled to turn on the fast oscillator. The coarse counter counts the number of SOSC pulses between the C_TDC EN and F_TDC EN to generate C[4:0]. The fine counter counts the number of FOSC pulses between the rising edge of F_TDC EN and the rising edge of Detect to generate F[4:0]. Fig. 8(a) shows a simplified feedback loop between the TDC and DCO in this design. When designing a TDC-based MDLL, it is essential to reduce the TDC latency and match between K TDC (=TDC gain) and K DCO (=DCO gain) to implement fast lock time. The TDC latency means the time it takes to generate an output code (TDC[8:0]) by comparing the two TDC inputs (CLK REF and CLK OUT ). Fig. 8(b) shows the post-layout simulation results for the gain characteristics of the TDC and DCO used in this design. Here, the output  code (TDC[8:0]) characteristic for the input time difference of TDC is indicated by a red line, and this slope corresponds to the TDC gain, K TDC . Also, the input code (DCW[8:0]) on the y-axis and the output delay shift amount of the DCO are indicated by a blue line, and this slope corresponds to the DCO gain, K DCO . In the TDC tracking mode, the gain of the loop filter (DLF#1) is one, so the gain of TDC and DCO was directly compared. As shown in Fig. 8(b), the TDC gain and DCO gain are well designed so that the gain characteristics match each other. It can be confirmed that K TDC = 1/K DCO (or K TDC · K DCO = 1) in almost all ranges. If the gain matching relationship between K TDC and K DCO changes, the MDLL lock time may increase.
The simulated differential nonlinearity (DNL) and integral non-linearity (INL) of the TDC is shown in Fig. 9. The TDC achieved the maximum DNL of ±0.368 LSB and INL of ±0.461 LSB, respectively.

III. MEASUREMENT RESULTS
The prototype all-digital MDLL chip was fabricated in a 40-nm CMOS process with an active area of 0.024 mm 2 . Fig. 10(a) shows the die, test board, and chip layout of the implemented MDLL. The chip is packaged in a quad flat no-lead (QFN) package. Fig. 10(b) shows the measurement setup used to probe the prototype IC. The input (100 and 200 MHz) reference clock (CLK REF ) is obtained from the TI LMK62XX PLL IC, which is mounted on the test board. The digital oscilloscope (Tektronix DPO71254C) is used for the time domain jitter measurements. The spectrum analyzer (Agilent E4440A) is used to measure the reference spurs. The measurements are performed by on-chip probing on the test board. The proposed MDLL operates over a frequency range of 1.6-to-3.2 GHz from a 1.1 V supply. The MDLL consumes 3.56 mW at 1.6 GHz (N = 16) with a 100 MHz reference clock. Fig. 11 shows the measured locking process of the prototype all-digital MDLL operating at 1.6 GHz with a frequency multiplication factor N = 16. As shown in Fig. 11(a), a pre-run for three reference cycles was intentionally allocated for the test before starting the MDLL. At this time, it can be seen that the DCO inside the MDLL operates at the maximum frequency, and the initial phase error t is kept constant. In Fig. 11(a), it can be seen that the TDC search was performed twice, and the phase and frequency locking of this MDLL was obtained within 6 reference clock cycles. Fig. 11(b) shows that the measured initial phase error ( t) of the MDLL is about 3.03 ns. Finally, in Fig. 11(c), it can be seen that the phases of the input and output clocks are well aligned with almost zero phase difference after locking, and the 1.6 GHz output clock multiplied by N (=16) times is appropriately generated. Assuming a residual phase error of 10 ps after two TDC searches, the calculated maximum frequency error at the locking point is approximately 0.1 %. Fig. 12 shows the jitter measurement results. As shown in Fig. 12(a), the root-mean-square (RMS) and peak-to-peak (p-p) jitter of the 100 MHz input reference clock (CLK REF ) provided by the PLL IC on the PCB are 2.19 ps and 15.74 ps, respectively. The main reason for the high jitter characteristics of the measured input CLK REF is that channel termination is not perfect in the on-board interconnect between the PLL IC and the MDLL chip, as shown in Fig. 10(b). Although this low-quality input clock was used as a reference input clock, the prototype MDLL achieved excellent jitter characteristics. As shown in Fig. 12(a), the measured RMS and p-p jitter of the 1.6 GHz output clock (CLK OUT ) are 2.75 ps and 23.01 ps,  respectively. Fig. 12(b) shows the p-p (=28.1 ps) and RMS (=3.09 ps) jitter values of the output clock at 3.2 GHz. If the input jitter is subtracted from the output jitter, the effective RMS and p-p jitter of the proposed MDLL (@3.2 GHz) are only 1.05 ps and 8.96 ps, respectively. Fig. 13 shows the power consumption breakdown of the proposed MDLL at 1.6 GHz. Table 1 compares the performance of state-of-the-art all-digital fast-lock integer-N frequency multipliers employing ring-based PLLs and MDLLs. [34] claims a lock time of four reference clock cycles, but the frequency error is as high as 5%, so the actual lock time is considerable, more than 1 µs. Therefore, to the best of our knowledge, the proposed all-digital MDLL achieves the shortest lock time of less than six reference clock cycles or 60 ns. Two types of figure-of-merits (FOMs) [36] were used to compare the performance of the clock frequency multipliers in Table 1. The proposed all-digital MDLL achieves the best FOM 2 despite using an input clock source with significant jitter values.

IV. CONCLUSION
Fast lock clock generators are essential for energy-efficient chiplet-based SoCs requiring dynamic frequency scaling. This paper presents a new all-digital MDLL utilizing a cyclic Vernier TDC that achieves excellent locking time. The proposed all-digital MDLL is fabricated in 40-nm CMOS technology and achieves fast lock time of less than six reference clock cycles at 1.6 GHz from a 100 MHz reference clock. The measured FOM 2 is -370.14 dB at 1.6 GHz, which shows the best lock time performance compared to the state-of-the art all-digital integer-N clock generators.