Design Techniques for CMOS Wireline NRZ Receivers Up To 56 Gb/s

Wireline receivers continue to target higher data rates, posing great challenges at circuit and architecture levels. Governed by tradeoffs among speed, power consumption, and channel loss (CL), receiver designs can benefit from new methods that push the performance envelope. This paper presents a number of techniques that allow non-return-to-zero data rates as high as 40 and 56 Gb/s in 45-nm and 28-nm CMOS technologies, respectively. The prototypes operate with a CL of 19-25 dB and a bit error rate of less than 10−12.


I. INTRODUCTION
T HE GROWING demand for greater throughput rates in data centers and edge computing presents significant challenges to physical layer designers. Wireline transceivers have been under intense development [1], [2], [3], [4], [5], [6], [7], [8], [9], targeting speeds as high as 224 Gb/s. This trend is also accompanied by issues regarding the power consumption-both in absolute value (which dictates packaging and heat removal costs) and as the amount of energy per bit (which determines the efficiency of serialization and hence the number of lanes).
This paper serves as a companion to [10] and describes receiver (RX) design techniques that can improve the achievable data rate while saving power. The ideas are presented in the context of 40-Gb/s [11] and 50-Gb/s [12] receivers operating with non-return-to-zero (NRZ) data. Realized in 28-nm and 45-nm technologies, respectively, the designs demonstrate concepts that can lead to higher speeds in more advanced process nodes.

II. GENERAL CONSIDERATIONS A. CHANNEL CHARACTERIZATION
The design of a wireline receiver is dictated by the properties of the channel that precedes it. Imperfections, such as loss and impedance discontinuities, "distort" the data as it travels through the channel, requiring that the RX provides sufficient compensation for successful data recovery. 1 We must therefore employ a reasonably realistic channel model in our RX design efforts.
A given channel can be modeled by an electromagnetic field simulator or a network analyzer, with the results typically expressed as S-parameters. In transceiver design, however, we prefer a scalable model so that the link behavior can be assessed for different amounts of loss. The scalability proves especially critical to the design of RX building blocks as it reveals the limits of their performance.
Copper media, such as printed-circuit-board traces, suffer from three nonidealities: 1) loss due to skin effect; 2) loss due to the dielectric underneath or surrounding the signal line; and 3) impedance discontinuities arising from connectors and line cards. The former two require a frequency-dependent model, as exemplified by the section shown in Fig. 1(a) [13]. Obtained empirically from simulations of 50-traces on FR4 boards, this scalable representation accounts for skin effect by R 2 and L 2 (at high frequencies, the series resistance rises from R 1 ||R 2 to R 1 ) and dielectric loss by R 3 and R 4 . As an example, [13] reports the following values for a section corresponding to a 1-in trace: L 1 = 77.25 nH, C 1 = 30.9 pF (such 1. The transmitter also offers a modest amount of compensation for the channel.  that the characteristic impedance, Z 0 = √ L 1 /C 1 = 50 ), R 1 = 5.55 , R 2 = 150 m , L 2 = 468.9 pH, R 3 = 2 k , C 3 = 200 fF, R 4 = 100 , and C 4 = 80 fF. The trace simulations in [13] suggest a reasonable agreement with this model. Additional RL and RC branches can be included so as to refine the model. Fig. 1(b) plots the magnitude response of a channel consisting of 12 such sections, displaying a loss of 21 dB at 28 GHz. In this paper, the term "loss" will refer to that at the Nyquist rate.
We also wish to study the effect of impedance discontinuities on the link performance. We observe that such a nonideality can lead to deep notches in the channel frequency response. As an example, consider the scenario depicted in Fig. 2(a), where Z p denotes a parasitic impedance at some point along the channel, e.g., at a connector, but the link is otherwise ideal. Since the impedance seen to the right of node X is equal to Z 0 , we note that Z p ||Z 0 is transformed by the transmission line on the left to create Z in . The impedance rotation by a length of L 1 can move Z p ||Z 0 to a high Z in , thus lowering the power delivered by V in to the line and causing a notch in the frequency response [ Fig. 2(b)]. In the time domain, the data experiences reflection at node X, a benign effect if R S = Z 0 . In other words, even though the reflection is absorbed on the TX side, the removal of the signal energy by the discontinuity still demands compensation.
The frequency-domain view of the channel proves useful for the design of circuits such as continuous-time linear equalizers (CTLEs). For discrete-time structures, on the other hand, a time-domain perspective becomes necessary. For   example, decision-feedback equalizers (DFEs) are designed according to the impulse response of the channel. Plotted in Fig. 3 is such a response, where T B denotes the unit interval (UI), i.e., the bit or symbol period. The precursor at −T B and the postcursors at T B , 2T B , etc., introduce intersymbol interference (ISI).

B. RECEIVER ARCHITECTURES
In the past decade, two general RX architectures have become common [1], [2], [3], [4], [5], [6], [7], [8], [9]. In "analog" receivers, equalization and clock and data recovery (CDR) occur in the analog domain. Fig. 4(a) illustrates this approach, which is better suited to NRZ data. A CTLE provides some high-frequency boost so as to partially compensate for the channel, and the result is applied to a DFE for further equalization. In addition, a CDR circuit senses the data and generates a clock with proper frequency and phase values for driving the DFE and the data demultiplexer (DMUX). Even though this architecture incorporates latches in the DFE, the CDR, and the DMUX, it is still considered an analog solution as most of its building blocks are crafted by analog designers.
The second architecture employs an analog-to-digital converter (ADC) and delegates some of the functions to the digital domain [ Fig. 4(b)]. Called "ADC-based" receivers, such systems are suited to PAM4 data-especially for channel losses (CLs) greater than 20 dB. They do incorporate a CTLE in the front end so as to provide a boost of 10-20 dB, thus relaxing the ADC resolution to some extent. The ADC output drives a digital processor performing equalization and   data detection. The result also drives a CDR loop containing a phase detector (PD), a digitally controlled oscillator (DCO), and a phase interpolator (PI), which delivers the ADC's sampling clock(s). This RX architecture consumes substantial power in the ADC, the digital processor, and the clock generation and distribution network. This paper focuses on analog NRZ receivers. For extrashort-reach or medium-reach links (with a CL of less than 20 dB), this architecture draws markedly less power, an important advantage because a given system contains many more such links than long-reach channels.

C. CHOICE OF CIRCUIT TOPOLOGIES
The analog and mixed-signal processing required in highspeed receivers can be realized by means of current-mode differential and regenerative pairs, but at the cost of significant static power consumption. For most of the operations beyond the CTLE, it is possible to employ "charge steering" [14].
Depicted in Fig. 5(a), a basic charge-steering differential stage replaces the tail current source with a "charge source" consisting of C T , S 1 , and S 2 , and also the load resistors with precharge switches S 3 and S 4 . The output nodes are first tied to V DD while C T is discharged. Next, X and Y are released, and C T switches into the tail node. The charge then flows from M 1 , M 2 , and their drain capacitances, amplifying the input and ceasing when V P reaches about one threshold below the input common-mode (CM) level. The circuit can serve as an amplifier and/or a latch. A key difference between charge-steering and integrating stages, e.g., that shown in Fig. 5(b), is that, by design, the former does not allow V X and V Y , and hence V X − V Y , to collapse to zero whereas the latter does. Thus, the timing margins are more relaxed for charge steering. Moreover, this style can operate across a much wider speed range with no adjustment.  Charge steering has been used in a multitude of RX and TX designs to save power [11], [12], [13], [14], [15].

D. LINEARITY REQUIREMENTS
The generation of NRZ data in transmitters does not dictate any linearity for their front end unless feedforward equalization is used. In NRZ receivers, on the other hand, some linearity is necessary before the data is sliced by the DFE because channel properties manifest themselves in the received signal amplitude. This issue proves important because we wish to amplify the input so as to maximize the eye height but must also be mindful of nonlinearity.
We investigate this point by considering the simple model shown in Fig. 6(a), where the RX front end is represented by a constant gain, k, and a static nonlinear stage [16]. Let us examine the impulse response of the entire chain, noting that h in (t) is that of the channel, which is then amplified by a factor of k. The result is subjected to compressive nonlinearity and exhibits a main cursor equal to h 0 and a first postcursor equal to h 1 . In other words, nonlinearity equivalently raises the normalized postcursor level.
The nonlinearity is modeled by y = α 1 x + α 3 x 3 and thus an input 1-dB compression point Fig. 6(b) contains a main cursor equal to β m A 1dB and a first postcursor given by β 1 A 1dB . It can be shown that [16] (1) As the front-end gain and hence β m increases, h 1 /h 0 exceeds the input ratio, β 1 /β m . According to the findings in [16], this effect manifests itself if β m reaches 1.5A 1dB .

E. CHOICE OF CLOCK RATE
The simplest, most compact receivers operate with a full-rate clock, i.e., one whose frequency is equal to the input data rate. However, the generation and distribution of clocks at high speeds present formidable challenges. For this reason, we opt for half-rate or quarter-rate architectures-at the cost of doubling or quadrupling the hardware, respectively. An immediate consequence is that the CTLE in Fig. 4(a) now sees a greater load capacitance. As a compromise, we select half-rate clocking in the front end.
Half-rate clocking also becomes a natural choice in transceivers where the TX employs such a clock for its last multiplexer stage and the RX utilizes this clock along with phase interpolation to implement the CDR loop.

III. CTLE DESIGN
The CTLE in Fig. 4(a) must provide a high boost factor so as to 1) increase the eye opening at the DFE summing junction and 2) deliver a sufficient swing to the CDR, thus ensuring an adequate PD gain, lock range, and loop bandwidth (BW). We begin with the basic stage shown in Fig. 7(a), and note that the output pole, ω 0 , should preferably lie Fig. 7(b)], allowing the circuit to provide its maximum boost factor, A 2 /A 1 = 1 + g m R S /2. In fact, ω 0 must exceed approximately 2.5ω p [17], a daunting challenge at high speeds that dictate the use of inductive peaking.
The design of the basic CTLE stage entails a tradeoff between the low-frequency gain, g m R D /(1 + g m R S /2) (also called the "dc" gain), and the boost factor. For the output eye depicted in Fig. 7(c), a greater R S reduces the outer height, H 1 , while raising the inner height, H 2 . An optimum can therefore be achieved for the latter as dictated by the channel. We typically target a low-frequency gain of around 0 dB, thereby facing a boost factor bound of about 6 dB per stage due to the limited voltage headroom.
For higher boost factors, we cascade multiple CTLE stages, bearing in mind the proportional rise in the power consumption and the reduction in the bandwidth. For n identical stages, we have [18] where BW 0 denotes the bandwidth of one stage and m = 4 for second-order stages. A cascade of two thus suffers from a 20% bandwidth shrinkage, i.e., ω 0 in Fig. 7(b) falls by this amount. For these reasons, typical front-end designs, comprising a CTLE and possibly a variable-gain amplifier, contain no more than three stages. The boost factor limitations outlined above call for additional high-frequency equalization techniques. We propose the concept of "feedforward" in this regard [12]. Illustrated in Fig. 8(a), the idea is to create a high-pass branch that contributes boost with negligible voltage headroom consumption. Transistors M 3 and M 4 and inductors L 1 and L 2 play such a role. The overall response is quantified as where L 1 = L 2 = L D and the capacitances at the drains are neglected for now. The second term on the right-hand side represents the zero created by feedforward. At high frequencies, source degeneration in Fig. 8(a) vanishes and the fraction on the right-hand side of (3) approaches 4 )L D s. This implies that feedforward raises the apparent value of L D and could be simply avoided by making L D larger. The key point, however, is that C L constrains the value of L D if the output pole must lie above the Nyquist frequency. Thus, feedforward provides greater flexibility in shaping the frequency response.
We now consider the capacitances at the drains in Fig. 8(a) and sketch the responses created by the two paths. As shown in Fig. 8(b), the feedforward path is designed such that it dominates as the main path's response reaches a plateau at ω p1 . The feedforward path should take over for ω > ω p1 = (1 + g m1,2 R S /2)/(R S C S ); i.e., we must have g m3,4 L D ω p1 < g m1,2 R D and hence The advantages of feedforward become more pronounced if it is applied to both stages of a CTLE. As illustrated in Fig. 9, we exploit all three possible feedforward paths. The stage consisting of G m1 , G mf 1 , and its RL load is identical to the circuit shown in Fig. 8(a), and so is the stage formed by G m2 , G mf 2 , and its RL load. The values of G mf 1 and G mf 2 follow (4). The path consisting of G mf 3 and L D2 manifests itself as the rest of the circuit approaches a flat response.
The performance of CTLEs must be studied in both frequency and time domains. Owing to the significant effect of layout parasitics, we report simulation results for only extracted circuits. The inductors are modeled by RLC networks obtained from Cadence's EMX tool. We also include the input capacitances of the stages fed by the CTLE, namely, the CDR and the DFE. In the frequency domain, we perform two tests and study 1) the stand-alone CTLE and 2) the channel-CTLE cascade. Fig. 10(a) plots the proposed CTLE response as feedforward paths are added to the circuit. We observe that feedforward increases the boost factor by about 7 dB but it also lowers the corresponding frequency. Whether or not this result is acceptable is determined by additional tests. As depicted in Fig. 10(b), we cascade the channel profile of Fig. 1(b) with the CTLE. Notably, the overall response becomes flatter as feedforward branches are inserted, but the 3-dB bandwidth decreases to some extent. The ultimate test examines the eye diagram at the summing junction of the DFE with and without these branches. As explained in Section V-C, the three paths increase the eye height from 55 to 160 mV and the eye width from 18.5 to 20.5 ps.

IV. DISCRETE-TIME LINEAR EQUALIZATION
The notion of boosting high-frequency components can be pursued in the time domain as well. As illustrated in Fig. 11(a), a pulse experiencing the channel's loss is broadened and introduces ISI at t = T B = 1 UI. If this pulse is shifted by 1 UI, scaled by a factor of α, and negated, it leads to its broadened counterpart at the output. Thus, p(t) − αp(t − T B ) produces less ISI. Implementing the operation as shown in Fig. 11(b), we write Y = (1 − αz −1 )X and recognize that this "feedforward equalizer" (FFE) yields where 0 < α < 1. The input is therefore subjected to two effects.
1) It is scaled by a factor of 1 − α, suffering from attenuation and displaying smaller low-frequency swings. This can be seen by applying a long sequence of ONEs and noting that they settle to a smaller amplitude [ Fig. 11(c)].
2) The input is differentiated and scaled by a factor of α, thereby benefiting from high-frequency amplification. The boost factor is equal to (1 + α)/(1 − α). A greater α translates to both a higher "dc" loss and a larger boost factor. In TX design, the unit delays necessary for FFE are readily realized by flipflops as the NRZ data can be processed nonlinearly before the final summation point. FFE can also be formed in the analog domain in receivers. We call such a circuit a "discrete-time linear equalizer" (DTLE) [11]. Unlike TX FFEs, however, RX DTLEs process dispersed data and must provide some linearity so as to preserve the channel profile information. That is, they cannot rely on  Fig. 12(a) is a DTLE example where the 1-UI delay is formed by a two-stage passive sampler. If C A C B , the circuit delays x(t) and scales it by a factor of α, but α itself can be realized by ratioing C B with respect to C A .

flipflops. Depicted in
As explained in Section II-E, we prefer half-rate operation so as to ease the generation and distribution of clocks. This points to the topology shown in Fig. 12(b) [11], where both DMUX 1 and the DTLE are driven by a half-rate clock, CK 1/2 . The odd and even data produced by DMUX 1 are delayed by 1 UI, scaled, and injected into the DFE's summing junctions.
We make two remarks. First, DMUX 1 in Fig. 12(b) must perform sampling and can thus be merged with the first stage of the DTLE. This leads to the implementation shown in Fig. 12(c), where two-stage sampling is performed in the odd path by S 3 , S 4 , and the chargesteering stage, M 1 -M 2 , which injects the result into the DFE summing junction. In addition, the charge-steering regenerative pair consisting of M 3 and M 4 provides a gain of 6 dB. The nonlinearity introduced by this pair is studied in [11].

Second, the DTLE transfer function emerges as
That is, if C B is not much less than C A , then the circuit also displays an infinite impulse response (IIR) tap equal to C B /(C A + C B ).

V. DFE DESIGN
DFE architectures have been studied extensively. For most, the loop around the first tap must "close" in 1 UI regardless of the clock rate/data rate ratio. (In "unrolled" or "speculative" topologies, a loop consisting of a multiplexer still dictates a 1-UI timing budget [19].)

A. EYE OPENING CONSIDERATIONS
The eye height observed at the DFE summing junction must be large enough to satisfy the target bit error rate (BER), e.g., 10 −12 . As shown in Fig. 13, five imperfections must be discounted from this height. These include 1) V OS1 : the CTLE and summer dc offsets; 2) V OS2 : the flipflop (FF) input-referred offset; 3) V n1 : the CTLE and summer noise;

4) V n2
: the FF input-referred noise; and 5) V sen : the FF sensitivity. We define V sen as the input difference that allows the FF output to reach roughly 90% of its full swing in 1 UI [19] so that the first tap, h 1 completely switches. 2 If an eye monitor is available in the system, then V OS1 and V OS2 can be canceled. The BER is expressed as where Q denotes the error function, V pp denotes the differential eye opening, V OS denotes the total offset (with or without cancellation), and V 2 n denotes the total rms noise referred to the summing junction. An error rate of 10 −12 demands that the argument of the Q function exceed 7.
In the absence of an eye monitor, V OS in (7) must remain sufficiently small by proper design. For example, with an eye opening of 200 mV pp and a total noise of 5 mV rms , the offset must not exceed 65 mV (if V sen is neglected). In practice, we would confine the 3σ offset to about 30 mV to leave a margin for the sensitivity and other imperfections.
The horizontal eye opening determines how much clock jitter and phase offset the equalizer can tolerate. The acceptable eye width depends, to some extent, upon the height: the greater the latter, the more the clock phase can depart from the center. This relationship is formulated in [13].

B. PROPOSED DFE TOPOLOGIES
A number of circuit and architecture techniques can improve the performance of high-speed DFEs. We begin by applying the concept of charge steering to summation and latching in a half-rate/quarter-rate environment. Consider the topology shown in Fig. 14(a), where half-rate data streams D odd and D even drive the summers and 2-to-1 DMUXs. The quarterrate outputs of each DMUX are then multiplexed, scaled, and subtracted from the input data in the other path. The DMUX and MUX stages utilize the quadrature phases of the quarter-rate clock, generated by a ÷2 circuit that receives the half-rate clock. Illustrated in Fig. 14(b), the circuit implementation employs charge-steering differential pairs for the summer, the latch, and the MUX/tap 1 combination [16]. Moreover, the summer exploits RC degeneration so as to provide a few dB of boost.
2. Additionally, the FF kickback noise and hysteresis become problematic in some implementations. A remarkable attribute of this architecture is its relaxed first-tap timing budget. In a conventional loop, we must have t CK−Q + t MUX + t sum + t setup < 1 UI, where the four terms, respectively, denote the flipflop clock-to-Q delay, the MUX delay, the summing node delay, and the FF setup time. In the charge-steering realization, on the other hand, we have t CK−Q < 1 UI, where t CK−Q is the delay from CK 1 to the output of the latch [16]. This constraint does not include a setup time because, in contrast to continuous-time currentmode latches, here the input data need not propagate to the precharged drain nodes of the MUX before this stage is clocked.
It is possible to reach a similar timing budget by injecting the feedback signal into the output of the first latch in the FF [22]. But this is not possible in the half-rate architecture of Fig. 14(a).
In addition to charge steering, we investigate greater interactions between the CTLE and the DFE to open the eye further. In contrast to conventional cascades, wherein the CTLE drives the DFE unilaterally and only at one port, we can envision some feedforward and feedback paths between the two [12]. Depicted in Fig. 15(a) is a full-rate example: we allow a high-pass feedforward branch, G(s), to inject the CTLE output into the summing junction. Furthermore, we create a high-pass feedback branch, H(s), that returns the slicer output to D sum (Loop 2). If G(s) = αs and H(s) = βs, we have The high-frequency boost thus imparted to D in and D out improves the performance, a point that can be verified in the time domain as well. From the waveforms shown in Fig. 15(b), we observe that αdD in /dt and βdD out /dt pulsate only on the data edges. Upon adding these derivatives to the summer output, we note that the rise and fall times are shortened. If two consecutive bits are the same, D sum exhibits a kink due to βdD out /dt (e.g., at t = t 3 ), a benign effect as the kink occurs at bit boundaries. The proposed feedforward and feedback techniques readily lend themselves to circuit implementation. As shown in Fig. 16(a), dD in /dt is available at node P within the CTLE and travels through G m stages to reach the summing junctions. For dD out /dt, we first multiplex the quarter-rate outputs VOLUME 3, 2023 125 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. of the latches so as to obtain full-rate data [ Fig. 16(b)]. This topology can be viewed as a direct 4-to-1 MUX, except that it is driven by overlapping quadrature phases. It is shown that charge steering still delivers nonoverlapping charge packets to this output. We then inject the result into node P, granting L D the task of differentiation. The strength of the injection, i.e., β, is defined by the amount of charge that each MUX branch draws. The second DFE tap is accommodated by adding secondary latches to each quarter-rate arm, multiplexing their outputs, and injecting the results into each summing node and node P [ Fig. 16(c)].
One may wonder how precisely one must control the timing alignment of the data that returns to node P in Fig. 16(c). In this work, no adjustment has been included as simulations reveal that this timing is no more critical than that of the main tap. If an eye monitor is present, one can adjust this path's delay for optimum performance.
In contrast to IIR DFEs [20], [21], the proposed method returns the shaped signal to the DFE input rather than to its  summing junction. According to the foregoing analysis and simulations, this approach yields a greater eye opening.

C. REFINEMENTS
We incorporate additional circuit techniques to further improve the DFE's performance, striving to maximize the NRZ eye opening at its summing junctions. First, we modify the basic charge-steering latch of Fig. 5 as shown in Fig. 17(a), where a cascode pair, M 5 -M 6 , and two cross-coupled pairs, M 3 -M 4 and M 7 -M 8 , boost the output voltage swings [11]. These transistors play the following roles: the first pair isolates X and Y from the large capacitance at P and Q, raising the voltage gain from V in to these nodes; the second pair also increases this gain by means of regeneration; the third pair restores the high level at P or Q to V DD , avoiding the CM drop observed in Fig. 5(a).
The second method relates to the DFE summing node itself. As illustrated in Fig. 17(b), we attach two crosscoupled pairs to this interface, thus increasing the eye height by 50% [12]. The continuous-time CM drop caused by I 1 at A and B is less than 20 mV in the 18-ps evaluation mode of the 56-Gb/s RX.
We quantify the improvements afforded by some of our proposed techniques for the 56-Gb/s RX in the presence of a CL of 25 dB. Fig. 18 illustrates the incremental improvements due to each concept. The eye width increases from 18.5 to 25 ps, and the eye height from 55 to 200 mV.

VI. 40-Gb/s AND 56-Gb/s RECEIVERS
The 40-Gb/s and 56-Gb/s NRZ RX examples reported here operate with a CL of 19-25 dB at the Nyquist frequency. The former's architecture is shown in Fig. 19 [11]. A single CTLE stage drives DMUX 1 , the DTLE, and the DFE, which consists of two summers, latches L 1 -L 8 , and MUX 1 -MUX 4 . The retimed and demultiplexed return-to-zero (RZ) data is converted to NRZ as described in [14].
The CDR utilizes the signals processed by DMUX 1 and the DFE to reduce the number of latches that it requires [11]. Specifically, XOR 3 measures the phase difference between D odd and D even , while XOR 1 and XOR 2 generate a constantwidth pulse on V ref for each data transition. The resulting difference, V err − V ref , uniquely represents the phase error regardless of the data pattern.
The 56-Gb/s RX is depicted in Fig. 20 [12]. (For simplicity, the second DFE tap is not shown.) In this case, the higher speed is accommodated by driving the CDR from node Q in the CTLE so that C CDR negligibly affects the signal path's bandwidth. The data presented to the CDR thus displays a high-pass spectrum, but it still allows locking [12].
The half-rate PD requires quadrature clocks at 28 GHz, a condition fulfilled by simply delaying the output of a differential LC oscillator by a self-biased inverter. It is shown that this stage's delay variability does not affect the PD gain significantly [12].

VII. EXPERIMENTAL RESULTS
This section presents the measured results for the 40-Gb/s and 56-Gb/s NRZ receivers. The prototypes have been mounted directly on printed-circuit boards and tested on a high-speed probe station. Unless otherwise stated, all measurements are carried out with a 1-V supply at the full data rate and with a pseudo-random bit sequence (PRBS) pattern of 2 7 − 1. Fig. 21 depicts a test setup example for characterizing receivers. A BER tester (BERT) generates NRZ data, which is then subjected to a lossy channel such as M8049A.  The result drives the device under test (DUT) and the output is captured by an oscilloscope. The recovered clock too is monitored on a spectrum analyzer.

A. 40-Gb/s RX
Realized in TSMC's 45-nm technology, the 40-Gb/s RX die is shown in Fig. 22 and occupies an active area of about 110 μm × 175 μm. Another version accepting an external clock has also been fabricated to permit the characterization of the equalizer. We first employ a channel having the black profile shown in Fig. 23(a) and producing the eye in Fig. 23(b). We begin with the RX path measurements using an external clock. The output data at 10 Gb/s is depicted in Fig. 24(a) and the equalizer bathtub curve in Fig. 24(b). The horizontal eye opening is 0.28 UI. Part of the eye closure arises from the PRBS generator's 8-ps rms jitter. Also shown is the bathtub curve for an input data rate of 20 Gb/s and the gray loss profile in Fig. 23(a), demonstrating that charge-steering circuits can accommodate a wide range of frequencies.
The complete RX is characterized for jitter generation, transfer, and tolerance while it equalizes the dispersed data. The CDR bandwidth is set to 20 MHz unless otherwise stated. Fig. 25(a) and (b) plots the recovered clock spectrum and waveform, respectively. For phase noise measurements, the 20-GHz clock is divided by 2 off-chip, yielding the profile illustrated in Fig. 26. The integrated jitter amounts to 515 fs rms from 100 Hz to 1 GHz. Fig. 27 shows the measured jitter transfer and tolerance for different CDR bandwidths. The latter improves as the BW increases, reaching 0.45 UIpp at 5 MHz with 19 dB of CL. (The maximum jitter amplitude of 20 UI is dictated by the equipment.) Table 1 summarizes and compares the performance.

B. 56-Gb/s RX
This RX has been fabricated in TSMC's 28-nm technology.   Fig. 29. To these losses at 28 GHz, we add 1.7 dB to account for the probes and the interconnects. We first report the RX performance while the CDR is disabled and an external 28-GHz clock is used. In this measurement, Keysight's M8040A BERT has the capability to emulate a 2-tap TX FFE in the data applied to the channel. Fig. 30 plots the bathtub curves for two cases: 1) for channel A, which has a loss of 25 dB, and no FFE and 2) for channel B, which has a loss of 30 dB, while the BERT implements an FFE function of the form −0.2 + 0.8z −1 . The horizontal eye openings are 0.4 and 0.33 UI, respectively.
We next present results with the CDR enabled. Shown in Fig. 31 are the outputs of channel A and the RX. The BER is less than 10 −12 . Fig. 32 plots the recovered clock waveform and spectrum for a CDR noise-shaping bandwidth of 50 MHz. The phase noise profile of Fig. 33 reaches a 100-MHz offset, at which it is equal to −124.4 dBc/Hz. For greater offsets, we measure the phase noise directly from the spectrum, which falls to −128 dBc/Hz at 14-GHz offset. The integrated jitter from 100 Hz to 14 GHz amounts to 100 fs rms .   Fig. 32(b). The high-pass nature of the CDR input data leads to some peaking for low loss values, but it enables the CDR to achieve bandwidths as high as 25 MHz for a CL of 30 dB. Fig. 35 plots the measured CDR jitter tolerance for a loss of 25 dB, yielding a value of 1.1 UI pp at 5 MHz and exceeding the CEI-56G-VSR mask. Table 2 summarizes and compares the performance.

VIII. CONCLUSION
High-speed wireline receivers present a multitude of challenges, especially for greater CLs. This paper describes VOLUME 3, 2023 131 methods that improve the performance of CTLEs and DFEs and proposes concepts such as discrete-time linear equalization and charge steering. Collectively, these techniques lead to 40-Gb/s and 56-Gb/s receivers with low power consumption.