Low Power Clock Generator Design With CMOS Signaling

The requirements for computing with higher energy efficiency in the datacenter and for longer battery life in laptop computers, cell phones, and other IoT devices while increasing performance with higher frequency and more cores, drive the needs for more clock generators with increased performance (frequency and jitter) and lower power budgets. The traditional current mode low swing clock generators were used widely in industry about 10 years ago. Although it had the advantage of higher supply noise rejection due to the differential nature of the architectures, however, it had the disadvantages of high-power consumption, large layout area, and not friendly to process scaling. Contrary to current mode low swing design, clock generator architectures with CMOS large swing signaling, which have advantages of low power consumption, small area, and based on circuits friendly to process scaling, have been widely adopted for clocking generation in the industry since 2009. In this paper, phase locked loops, delay locked loops, phase interpolators, high resolution digital to time converter and clock distribution techniques with CMOS large swing signaling will be discussed and reviewed.


I. INTRODUCTION
D UE TO the increasing bandwidth demands of data centers, multi-core computers, cell phones, and other IoT devices, energy-efficient high-speed clock generators are necessary in today's computing systems. Cost concerns dictate that these clock generator circuits, such as phase locked loops (PLL), delay locked loop (DLL), phase interpolators (PI), and clock or clock-phases distribution networks, must meet all performance specifications in standard CMOS technology without adding additional process cost. Since analog and mixed signal circuits, including high speed clock generators, typically constitute ∼15% of a computing system on chip CPU's die area, their power and area scaling directly impact the performance and cost. Despite its great economic benefits, Moore's Law logic scaling poses design challenges for analog circuits. High-performance on-chip clock generators are essential for high speed IOs which are growing 2x higher in data rate every 3∼4 years, driven by the exponential need for higher data bandwidth to support multi-core processing. Fig. 1 shows a typical block diagram of a high-speed serial link that employs a forwarded-clock architecture with a DLL and two-phase interpolators [1]. The input high speed clock is first amplified by the forwarded-clock amplifier (FCA), followed by a duty cycle correction (DCC) circuit. After duty cycle correction, the received clock is fed into a DLL where 8 phases with equal spacing of 45 • are generated. Based on these phases, the phase interpolator generates a sampling clock with a fine step size of T/(8*N), where T is clock period and N is the number of clock phases from the fine-tuning phase interpolator and is equal to 8 if the phase interpolator digital control bit is 3. The bit error rate tester (BERT) was designed to perform on-die testing in the loop-back mode. For a low power transceiver design, it is required that the DLL and the phase interpolators consume as little power as possible. Phase interpolators, or digital to time converters, are also commonly used in the clock data recovery (CDR) circuit of the high speed SerDes where, unlike the forwarded-clock architecture, the receiver sampling clock is derived from the incoming data stream. Fig. 2 shows the block diagram of the DLL used in the receiver clock path of the high-speed IOs in the Intel Core TM micro-architecture (Nehalem) fabricated in 45-nm high-k metal-gate process technology [2]. The delay cells are  built with the current mode differential circuit stage shown in Fig. 3. The phase interpolator (not shown in the figure) was also designed with the differential delay cell in Fig. 3.
These traditional current mode low swing clock circuits were widely used in phase locked loops [3]- [4], delay locked loops [2], [4]- [7], clock-data-recovery (CDR) and phase interpolators [8]- [11] until about 10 years ago. Although they had the advantage of higher power supply noise rejection due to the differential nature of the design, they had the disadvantages of high-power consumption, large layout area, and being not friendly to process scaling. Their highpower consumption is because the differential delay cell is basically a differential pair with an NMOS current source and PMOS loads, with DC current bias. Also, when the lowswing differential signals in the delay chains are converted back to full swing CMOS signals for clock distribution, another differential amplification stage is required for the clock output. The poor jitter performance from a self-biased PLL is the result of noise (flicker and thermal) contributions from the current sources. VCOs with differential delay cells also produce duty cycle error due to differential pair offset and output level shifter distortion. To satisfy the requirement for 50% clock duty cycle, either a duty cycle correction circuit (DCC) block is required, or a divided by 2 in frequency must be performed. In the case of the latter the differential VCO must run at 2x frequency at all process, voltage, and temperature (PVT) conditions, which further increases power consumption and is difficult to design due to transistor high frequency performance limitations.   3 shows the differential delay stage commonly used in the self-biased PLLs and DLLs since early 1990's [2]- [7], which consists of an NMOS differential pair, an NMOS current source, and linearized PMOS load devices. In addition to high power consumption due to DC bias current, it has larger jitter amplification due to lower signal swing and intersymbol interference. As CMOS process scaling continues, the output impedance of the transistor in saturation decreases due to the short channel effect, and noise coupled to node nx (as shown in figure 3) causes current to change in short channel devices, modulating the delay of the stage, and causing deterministic jitter. Although FinFET process technology improves device output impedance significantly [12], however, as scaling trend continues, the device output impedance degrades from generation to generation.
To address disadvantages of the PLL, DLL and PI circuits that use low swing current mode logic (CML), PLL and DLL designs using locally regulated power supply CMOS buffers were published [13]. The matched current DLL (MC-DLL) and matched current PI (MC-PI) with CMOS large signal swing was first proposed and designed in 45 nm and 32 nm CMOS process technology [1]. To facilitate technology area scaling of the clocking circuits in high volume microprocessor products in standard CMOS process technology, all critical analog DLL and PI blocks utilize only minimum channel length transistors, leveraging their speed and the process technology scaling. Since then, several other phase interpolators employing CMOS large supply swing signaling were published [14]- [18] for use in high speed IOs, multiplying delay locked loops (MDLL), and digital to time converter (DTC) circuits.
In this paper, the clocking circuit designs with CMOS large swing signaling will be discussed and reviewed. Section II discusses the low power delay locked loops and frequency locked loops using CMOS large signal swing circuits. Their area and power scaling are compared with corresponding designs with small swing CML signaling. Section III describes the low power phase interpolator design using CMOS large signal swing circuits; Section IV reviews the high-resolution digital to time converter with improved CMOS phase interpolator; Section V reviews the recent development in CMOS clocking for 224Gbps SerDes applications. The conclusions are given in Section VI.

A. DELAY LOCKED LOOP
In a "forwarded" clock based serial IO link, the typical clocking circuitry for receiver (RX) block is mainly implemented with a DLL and two PIs. The DLL generates 4 or 8 uniformly spaced phases across a full clock period and sends them to PIs. The PI interpolates the adjacent delays/phases further to time the sampling clock phase precisely. In the past, many DLL and PI designs have been proposed to aim at high-speed data transmissions. However, these designs either consumed high power or occupied a large silicon area. Further, some designs are not suitable for process scaling because they are vulnerable to device threshold voltage variation or low output impedance due to short channel effect. For example, a low swing CML-based delay cell is vulnerable to the V t variation in the input differential pair because the V t variations cause an undesired input offset in the differential MOSFET pairs.
To achieve high speed, low power, low jitter and small area, a full-swing matched-current delay locked loop (MC-DLL) and a full-swing matched-current phase interpolator (MC-PI) for the receiver clocking circuitry were proposed and designed in both a 45 nm and a 32 nm CMOS process technology. The block diagram of the MC-DLL is shown in Fig. 4. It consists of a startup circuit (not shown in the figure), a phase detector, a charge pump, a loop filter and 9 delay cells. The startup circuit initializes the loop after the proper timing sequence to avoid harmonic-locking. Each delay cell generates a 45 • phase shift. The first delay cell output (0 • phase) and the final delay cell output (360 • phase) are routed to the phase detector to extract the phase difference and then to adjust charging (up) and discharging (dn) currents from the charge pump to the loop filter so that the loop filter can generate the proper control voltage, nbias. The control voltage nbias then is sent to a bias generator as shown in Fig. 5 to generate a corresponding control voltage pbias for the PMOS voltage to current converter devices. This pbias works together with the nbias to match the pull-up and pull-down currents in the replica of a half delay cell. Finally, the same nbias and pbias voltages for the replica are routed to control all the delay cells in the delay cell block. Theoretically, the pull-up and pull-down currents for all the delay cells will be equal as well.
The unit delay cell in Fig. 4 is shown in Fig. 5. A matched current buffer, which consists of two matched current inverters, is used as the delay cell. This matched-current delay cell maintains the CMOS supply voltage full-swing signal. This full-swing CMOS level signaling enables the delay cell to consume low power even when operating at 5 GHz and it is less sensitive to the V t variations in the delay cells. In addition, the fast rising/falling edges of the full swing clock edges make the signal less sensitive to the MOSFET transistor's noise and other environmental interference as the signal propagates. Moreover, the pbias of all delay cells is AC coupled to the supply voltage to achieve good PSRR (power supply rejection ratio). Besides, the stacked-transistor current control devices in the delay cells can improve the output impedance which helps the delay cell to be less sensitive to the power supply noise. Depending on the noise in the power supply, and the deterministic jitter specification, a linear voltage regulator may be needed in some applications where the power supply noise is high.
As mentioned, the pull-up and pull-down currents of the delay cells are matched. This current match makes the slew rates for the rising edges and falling edges identical. Therefore, no duty cycle error is introduced in the delay cell circuit block. Also, the number of transistors in the current control devices are controlled by a 2-bit binary decoder (not shown in Fig. 5) which can increase the driving current level of the delay cells so that MC-DLL frequency range can be extended. In the 32 nm design, MC-DLL can operate from 1.25 GHz to 5 GHz across process, temperature, and supply voltage corners. In this MC-DLL design, a structure like the phaseonly-detector [13] is employed as the phase detector. The self-biased concept [4] is also adopted to make the charge pump current proportional to the driving current of the delay cells so that the DLL bandwidth tracks operating frequency.
The MC-DLL was first implemented in 45 nm and 32 nm high-k and metal gate CMOS technology, and demonstrated to work at 5 GHz with the power consumption of only 6 mW, a 5x reduction relative to a reference design with the current mode low voltage swing DLL. Fig. 6 shows the measured output clock waves at 5 GHz operation from 32 nm test chip, where only 4 phases are shown. The waveforms are not square due to the large loading from the measurement setup. The measured maximum phase error from the ideal phase positions for the MC-DLL is only 2.59 ps. The worst measured RMS jitter from MC-DLL outputs is 1.29 ps. It should be mentioned that the RMS jitter of the forwarded clock amplifier (FCA) which drives the MC-DLL is about 1.24 ps. Therefore, the jitter amplification (added jitter) for the MC-DLL is less than 4%. The layout area for the MC-DLL is 65 μm x 120 μm in the 32 nm CMOS process technology, which is ∼50% smaller than the reference design with current mode low voltage swing circuits.

B. PHASE LOCKED LOOP
As industry has been moving toward high performance and low power chip design, it has become necessary to use PLLs that operate at multi-GHz frequencies with a precisely 50% duty cycle while consuming the minimum amount of power. The self-biased PLL [4] cannot meet the product requirements such as a precisely 50% duty cycle, low power consumption, low jitter performance and small layout area. Higher power consumption from the self-biased differential PLL is caused by the fact that the delay cells require a DC bias current. It had become increasingly difficult to design self-biased PLLs and DLLs with sub-1V supply voltage as process technology scaling continued, driven by digital circuit performance improvement, area scaling and power consumption reduction.
A low power phase locked loop was designed in 45nm CMOS process technology [19] by adopting the matched current delay cells described in Section II-A. One of the most important advantages is that the output clock duty cycle is 50%, so duty cycle correction is not needed compared with other types of PLL design. Fig. 7 shows the schematic of the matched current VCO (MC-VCO). Five matched current inverters are connected into a ring oscillator. The delay from matched current inverter is controlled by nbias and pbiaso, where nbias is generated from phase frequency detector, charge pump, and low pass filter. Pbiaso  is generated from the feedback loop with a replica delay cell. Since the reference voltage to the differential amplifier is set at Vcc/2, the feedback loop adjusts the pbiaso to make sure the pull-up current matches the pull-down current across process, supply voltage, and temperature variations. This enables the clock rising time to be the same as its falling time, and therefore 50% duty cycle is obtained. Pbiaso is coupled to power supply Vcc to achieve noise canceling (Vgs = Vg-Vcc), which is required to achieve high power supply noise rejection (PSRR). The number of delay cells in the VCO ring can also be 3 or 7, depending on the frequency requirements. Depending on the process technology and operating frequency range, the current control device sizes used in the VCO can be digitally tuned. To achieve high power supply noise rejection and low power consumption, the current control devices should use long channel length transistors while the switching devices should use minimum channel length transistors. In the process technology where long channel transistors are not available, stacked transistors can be used to approximate the performance of long channel transistors. The differential amplifier DC gain can be from 10 to 20, its compensation capacitance C1, which is also used to reject supply noise, is chosen to provide a phase margin of more than 45 degrees, which requires a value of 1∼2 pF. Fig. 8 shows a simplified block diagram of the low power PLL based on the proposed matched current VCO. The phase frequency detector (PFD) and charge pump (CP) are like the design used in [20], [21]. The low pass filter is a simple RC filter with a shunt capacitor. The input reference clock (Refclk) comes from the external reference clock's ondie receiver. The rising edge of the feedback clock (Fbclk) from the VCO (after being divided by N) and the rising edge of Refclk are compared at the phase frequency detector (PFD). When the Refclk is ahead of Fbclk, UP will go high and nbias is raised; the delay from each delay stage is decreased and the VCO frequency is therefore increased. Alternatively, when the Fbclk is ahead of Refclk, DN will go high and nbias is lowered; the delay from each delay stage is increased and the VCO frequency is therefore decreased. Finally, the system reaches a steady state where the rising edge of Refclk and the rising edge of Fbclk are aligned, and the PLL is locked. Fig. 9 shows the simulated lock acquisition of the proposed low power PLL. The lock time is less than 100ns for a VCO frequency of 3.2 GHz (reference frequency is 200 MHz), which is 5 to 10x faster than the self-biased PLL under the same operating conditions. Fig. 10 shows the output clock waveforms of the VCO. When simulated at typical process corner, 110 • C and 1.1 V supply; the duty cycle is 50.05%. At fast process corner, −10 • C and 1.15 V, the simulated duty cycle is 50.11%. At slow process corner, 110 • C and 1.0 V, the simulated duty cycle is 49.82%. The simulated duty cycle error is within +/−0.18%, demonstrating the effectiveness of the proposed architecture. The simulated average supply current of the matched current phase locked loop is 3 mA (or 3.3 mW), ∼7x lower than a reference design with self-bias PLL architecture.

III. PHASE INTERPOLATOR WITH CMOS SIGNALING
The phase interpolator generates a precise clock waveform between two input clock waveforms. For example, while the DLL generates 8 clock phases, 0 • , 45 • , 90 • , 135 • , 180 • , 225 • , 270 • , and 315 • , the PI generates 8 phases between any two adjacent phases, such as 0 • and 45 • , as shown in Fig. 11. The delay of the output waveform relative to the input clock waveforms is determined by a digital control code which normally is either 2 or 3 bits but can be up to 7 bits in some applications where finer delay steps are required [15]. Fig. 12 shows the phase interpolator architecture. The phase interpolator consists of a 3-bit thermometer decoder, a  figure 12), two 4-to-1 MUXs and a PI mixer. The 2-bit decoder controls the two 4-to-1 MUXs to pick up two adjacent phases from the 8 phases generated by the DLL. The selected adjacent phases are sent to the mixer. The 3-bit thermometer decoder converts a 3-bit binary code into an 8-bit thermometer code, which control each of the eight 2-to-1 MUXs inside the PI mixer. Consequently, the output clock phase will be an interpolated result of the two input phases based on the digital input to the thermometer decoder. For a mixer with a 3-bit control code, the phase interpolator will interpolate the two input phases to generate 8 discrete phases between the two input phases. So as the phase interpolator works together with the MC-DLL, a total of 64 discrete phase steps across a clock cycle are generated. At the clock frequency of 5 GHz, the phase step from the phase interpolator is 200/64 = 3.125 ps or 5.625 • . At 6 GHz operation frequency, the corresponding phase step is 2.60 ps or 5.625 • .

2-bit decoder (not shown in
For the phase interpolator design, the mixer is the most important block in realizing all the interpolated clock phases to be linear. To improve the phase linearity, the design uses a weighted thermometer coded design architecture. In the mixer block, eight 2-to-1 MUXs together with 8 matched current inverter cells (MC-cell) generate a current-mixing output from the two input phases. The resultant clock phase from the mixer output depends on how much weighting is given to the two input phases individually. The matched current inverter cell is also shown in Fig. 12, which is half of the matched current delay cell used in MC-DLL. Each matched current inverter in the mixer is properly sized to optimize the overall linearity. To explain the operation of the phase interpolator, take an example of interpolation between phase 0 • and phase 45 • . When thm<7:0> are all 0 (or 1), α = 1 (or 0) and β = 0 (or 1) in the left plot of Fig. 11, the eight 2-to-1 MUXs all select the 0 • phase (or 45 • phase), the resulting output of the phase interpolator is corresponding to input phase 0 • (or 45 • ). When thm<7:0>= 00000001, α = 7/8, β = 1/8, the first 2-to-1 MUX selects input phase 45 • , while other seven 2-to-1 MUXs still select input phase 0 • , as a result, the phase interpolator output is delayed by one LSB (which is 3.125 ps for 5 GHz clock) relative to the output phase with thm<7:0>= 00000000. As thm<7:0> increases to 00000011, the first two 2-to-1 MUXs select input phase 45 • while other six 2-to-1 MUXs still select input phase 0 • , and the resulting output phase is delayed by two LSB relative to thm<7:0>= 00000000. This process continues until thm<7:0>= 11111111. The nbias and pbias are generated from the MC-DLL shown in Fig. 4, which compensates for PVT (process, voltage, and temperature) skew variations and achieves good linearity across PVT corners. In the applications where the DLL is not present, pbias and nbias inputs can be connected to ground and Vcc respectively, or the PMOS and NMOS current control devices can be removed. Fig. 13 shows the output phase vs. code for the matched current DLL-PI design; the worst-case DNL is less than 1.2 ps (0.46 LSB) at 6 GHz operation. This matched current DLL and PI circuit enabled the first 12 Gb/s transceiver design in 32 nm CMOS process technology [1].

IV. HIGH PRECISION DIGITAL TO TIME CONVERTER
Digital to time converters (DTCs) are widely used in clockand-data recovery (CDR) circuits, fractional-N PLLs, and MDLLs [17]. They consist of coarse tuning and fine tuning controls, where the coarse tuning is accomplished by a delay locked loop, a multi-phase VCO, or a divider, the fine tuning is done by a phase interpolator which is controlled by a 2 to 7-bit binary digital code. An 11-bit DTC with a resolution of 244 fs resolution was reported in [15], the circuit architecture is shown in Fig. 14. The PLL generates differential phases of clock waveforms (VCOp and VCOn) at 8 GHz, the multimodulus divider (MMD) divides the 8 GHz clock down to 2 GHz, and outputs two 2 GHz clock phases with a phase difference of 62.5 ps (MMD out1 and MMD out2 ). This phase difference is further cut by half to 31.25 ps at MUX+DEL stage, resulting an input phase difference of 31.25 ps at the input of the phase interpolator. With the 4-bit digital control, the MMD and MUX is able to cover the full 2π range of the 2 GHz clock at MMD out1/out2 over the code, with 16 pairs of clock waveforms with a phase difference of 31.25 ps.
The phase interpolator architecture is shown in Fig. 15, where 244 fs resolution is achieved with 7 binary digital control bits. To achieve monotonicity, thermometer coding of 128 interpolator cells is used. These 128 interpolator unit cells are arranged in an 8x16 array, all of them have input signals (MUX out and DEL out ), and their outputs are connected to the common node V int . The detailed schematic of the interpolator cell is shown in Fig. 16(a), it consists of two tri-state inverters, one for the first input clock phase (In 1 ), and one for the second input clock phase (In 2 ). One of the modifications from [1] is that, in this design, only pull-up branch can be turned on when input clock goes from V DD to GND, and only pull-down branch can be turned on when input clock goes from GND to V DD . In this way, any conducting path from V DD to GND is avoided, which helps to improve the linearity of the phase interpolator. When all the bits in the 7-bit code are 0, the tri-state inverter with In 1 as the input in all 128 interpolator cells is turned on, the tri-state inverter with input In 2 in all 128 interpolator cells is turned off, which results in the interpolator output clock corresponding to In 1 phase. In the case when the tri-state inverter with input In 1 is turned off and the tri-state inverter with input corresponding In 2 is turned on for interpolator cells from 1 to i, the output clock corresponding to the i th interpolator phase is generated. Since the output of the phase interpolator V int is floating between phase pull-up and pull-down time, 16 retention cells are inserted to maintain the voltage level. The retention cell schematic is shown in figure 16 (b).
The proposed 11-bit DTC was implemented in standard 28 nm CMOS process technology with area of 0.009 mm 2 and measured power consumption of 19.8 mW. The design covers the 2π range of a 2 GHz clock with a resolution of 244 fs. Linearity measurement data show a peak INL of 1.2 ps, in agreement with simulation. A similar phase interpolator architecture was adopted in the fractional-N MDLL design [17] in 22nm FinFET CMOS process technology.

V. HIGH SPEED CMOS CLOCK CIRCUITS FOR SERDES
CMOS clock buffers have been used extensively for highspeed clock distribution in microprocessors and continues to be used for high speed Serdes PHY with the help of inductor peaking, such as the recently reported 224 Gbps Serdes PHY in a 10 nm FinFET [22] process. Although CML like clock distribution was also used in some recent publications [23]- [24], it needs to be converted to CMOS level swing to drive TX circuits. The amplification of lowswing clocks at very high frequency to full-swing CMOS is costly in term of power, and it increases jitter and design complexity. CMOS clocking circuits have the advantages of low jitter amplification, and low power consumption. Figure 17 shows the clock distribution network that supports two distinct modes: a full-rate 28 GHz clock frequency mode and divided clock frequency (div-2/4/8/16) mode for lower speed operations. In divided down clock frequency mode, quadrature clocks are generated with programmable I/Q dividers. Having separate clock modes avoids the need for large tuning ranges in the 28 GHz quadrature delay line and duty cycle correction.
In full-rate 28 GHz clock frequency mode, a digitally tuned LC-based delay line generates quadrature clocks from the 28 GHz differential clock generated with a digital LC-PLL [25]. An AC-coupled inverter with resistive feedback and a voltage DAC performs the duty cycle correction. To distribute a low jitter 28 GHz clock, four inductively peaked CMOS buffer stages, S1∼S4 as shown in Figure 17, are designed. Shunt-series peaking in S1, S2 and S3 provides jitter filtering, and reduced power supply noise. To support a wide frequency range and reduce area, compact low-Q inductors are used. The last stage (S4) uses series-shunt peaking because it consumes 20% less power at the expense of a larger inductor. Figure 18 shows the simulated jitter amplification and power comparison from i) CMOS buffer, ii) CMOS buffer with series-shunt inductors, and iii) CMOS buffer with shuntseries inductor [22]. The CMOS buffers with shunt-series and series-shunt inductors have a jitter amplification of 0.8, while CMOS buffers have a jitter amplification of 1.45 at 28 GHz clock frequency. The lower jitter amplification (jitter attenuation) from CMOS buffers with inductor peaking can be explained as follows. The presence of a zero in the transfer functions of the CMOS buffers with inductor peaking creates a bandpass characteristic, which attenuates high-frequency random jitter caused by device noise in the clock buffers and reduces the integrated voltage noise at the output. At the same time, a sharper slope at the transition point reduces the conversion of intrinsic buffer voltage noise into timing jitter noise. This is a key advantage over conventional CMOS and CML clock buffers which tend to amplify high frequency jitter.

VI. CONCLUSION
The clock generator circuits, such as phase locked loop, delay locked loop, phase interpolator, digital to time converter, and high-speed clock distribution employing CMOS full supply swing signaling are reviewed in this paper. Compared with current mode low swing designs, CMOS full-swing level signaling circuits consume much lower supply current, achieve smaller layout area and are more friendly toward continued process scaling. These techniques have been used in low power PLLs, DLLs, PIs, DTC design with 244 fs resolution, and in a 28 GHz clock distribution scheme that enabled a 224 Gb/s Serdes PHY in 10 nm FinFET process technology.