Fast-Locking Burst-Mode Clock and Data Recovery for Parallel VCSEL-Based Optical Link Receivers

A burst-mode clock-and-data-recovery (CDR) system for a multi-channel vertical-cavity surface-emitting laser (VCSEL)-based non-return-to-zero (NRZ) optical link’s quarter-rate receivers is presented, that utilizes proxy timing recovery for fast turn-on time. The proxy timing recovery scheme takes advantage of correlated data jitter over parallel optical lanes typically deployed in a data-center. Rapid timing recovery of the burst-mode channel is enabled by incrementing/decrementing its phase rotator (PR) control code during idle periods using phase updates from an always-active channel in the link. This work also presents circuit-design techniques to reduce power dissipation during idle times while still enabling fast turn-on time. Simulated in 65 nm CMOS technology, the proposed CDR consumes only 19.5 mW per channel while operating at 10 Gbps/ch and 0.58 mW during its idle-state. Simulation results are presented for the turn-on time with the proposed technique and compared against the turn-on time of a conventional receiver. The proposed technique allows the CDR to lock within 26 unit intervals (UIs) from when it is powered on irrespective of a 1000 ppm frequency offset between the incoming data and the CDR’s reference clock. The complete CDR of each channel occupies an area of 0.045 mm2. The proposed scheme introduces only 1.3 % of area and 2.6 % of on-state power overhead while reducing idle-time power dissipation by 97 %.


I. INTRODUCTION
The ever-increasing demand for data rate in data-centers drives the need for high-speed interconnects [1]. Consequently, electrical interconnects are being replaced by optical-fiber interconnects. The main advantage of an optical cable over its electrical counterpart is its reduced frequencydependent loss. This leads to simpler channel equalization and reduced power dissipation.
Data-centers are using parallel optical links with progressively increasing per-lane data rate to meet overall throughput demands. However, the peak data rate is not required by all links all the time [2]. Indeed, links in datacenters are idle up to 90% of the time [3], but non-useful data packets are still being sent to maintain synchronization. Useful data packets are transmitted intermittently in the form of bursts. However, links dissipate full power even if no useful data are being transmitted. This is graphically shown in Fig. 1 (a), where the flat grey area represents a constant power dissipation level and the hatched area represents the intervals in which useful data are transmitted.
Present-day data-centers contain thousands of servers and their interconnection network consumes around 26 % of the total data-center power [4]. As a result, there is a need to reduce the power dissipation of these interconnects when they are idle and reduce the associated cost of cooling data-centers. A promising way to reduce power dissipation is to employ a power-proportional burst-mode optical link, where the link is powered off during idle times and powered on with each burst of data [5]. The ideal scenario for the power dissipation of a power-proportional burst-mode link is shown in Fig. 1 (b) where power is dissipated only when useful data are being transmitted. The main challenges to make such a link attractive for use in data-centers are the reduction of off-state power and the turn-on time.
If a power-proportional burst-mode optical link is put into a sleep mode between data bursts, the receiver clock loses synchronization with the data. When a link is turned on as required by a burst of data, its clock will start at an arbitrary phase with respect to the data and take considerable time to lock to the appropriate phase for low bit-error ratio (BER) sampling {~hundreds of unit intervals (UIs) [6]}.
A conceptual transient at the output of a burst-mode receiver is shown in Fig. 2 (a). Turning on a burst-mode receiver involves two tasks. First, the average value of the input signal must be acquired for use as a reference level for the receiver's decision circuit(s). This process is referred to as dc-level recovery. Second, the receiver's sampling clock must be aligned to capture the data through clock-and-data-recovery (CDR), or in this context, timing recovery. Both of these processes are lengthened if the receiver is turned off or put into an idle-state when no input data are received. Lower power during idle states is usually associated with longer turn-on time. Hence, there is a trade-off between the off-state power and the turn-on time, as illustrated graphically in Fig. 2 (b). This work is focused only on the timing recovery and its rapid activation from an idle-state.
Recent works on burst-mode solutions for optical networks, wireline electrical, and vertical-cavity surface-emitting laser (VCSEL)-based optical transceivers have been presented in [5]- [11]. In [6], power-hungry limiting amplifiers (LAs) are replaced with a variable-gain amplifier (VGA) to improve the energy efficiency during the active state. The work uses complicated control loops to correct the dc-offset current at the input of the TIA, and an exhaustive CDR search algorithm to reduce the overall turn-on time of the receiver which remains on the order of hundreds of UIs. The work in [5] reduced the duration of timing recovery by using a complex CDR functionality along with a link protocol. Both the settling of the dc-level and the CDR locking take place simultaneously in [8] to reduce the overall turn-on time. The CDR of [11] uses an area intensive architecture that sweeps the oscillator phase after each idle period to achieve phase match with the data signal for the lock, resulting in a long recovery time. However, all of the solutions discussed above address the turn-on time for a single-channel burst-mode receiver. These approaches forgo opportunities arising from the parallel nature of many VCSEL-based links. This work leverages correlation in the timing among parallel channels along with a commonly used CDR topology to achieve the fastest CDR locking. Without additional complex circuitry as compared to the other published works it has the minimum area and power overhead. Since the design is mostly digital, its area and power are also attractive, and will improve with technology scaling.
The proposed work presents an energy-efficient burst-mode CDR for multi-channel parallel links for fast CDR lock time, as initially laid out in [12]. A typical phase interpolator-based CDR [13], as shown in Fig. 3, is used. It consists of a clock synthesizing block and a peripheral loop. The main circuit blocks of the peripheral loop are a phase detector, a loop filter, phase-rotator (PR) control logic, and a PR.
The work has two features: 1) low-power operation during the idle state is enabled by turning off all circuits in the CDR except the PR control logic block that increments/decrements the PR's control code; 2) a fast CDR lock time is achieved by updating the PR control code (in the PR control logic) during idle times using a phase update signal from an active parallel channel.
The remainder of the paper is organized as follows: Section II briefly highlights the background of the parallel link's properties, and Section III describes the architecture, implementation details, analysis of jitter tolerance and tolerable frequency offset of the proposed CDR. Simulation results for the fast turn-on time of the proposed technique are presented in Section IV. Section V concludes the work and includes a comparison with other published work.

II. BACKGROUND
To meet high aggregate throughput requirements in datacenters, parallel links are widely used [14], [15]. Each parallel link has a near-identical channel and transceiver. Transmitted data streams over parallel links are synchronous from lane to lane as they share the same transmitter-side clocking circuitry and a common transmitter-side reference clock. A possible mismatch exists in the path length among these channels. However, in serial links, the mismatch in length is not a problem in clock recovery since each lane has its own CDR. Fig. 4 shows the phase offset between the data at the receivers of two parallel channels relative to the receiver's reference clock in the presence of parts per million (ppm) frequency offset between the data streams and the receiver's reference clock. The data streams have no ppm offset from one lane to another. One complete cycle around the circle represents 2 or one reference clock cycle. M denotes the number of data UIs to cover one complete cycle, which is inversely proportional to the ppm frequency offset. Due to the ppm offset, the phase offset for data on channels 1 and 2 increases between t = 0 and t = N UI as well as during successive steps of N UI. However, the phase offset between channels 1 and 2 remains small due to their shared Tx clock. In this figure, N is chosen arbitrarily to be M/5, giving 5 N-UI steps for the phase offset of the data channels to advance around the circle. The small phase offset between channels 1 and 2 will be exploited in a fast recovery of the clock of a burst-mode receiver when it turns on.
The phase relationship of the data streams and their recovered clocks of an always-on channel and a burst-mode channel with respect to the receiver's reference clock is shown in Fig. 5 (a). 1 is the phase difference between the receiver's reference clock and the recovered clock of the always-on channel at t=0. 2 is the phase difference between the receiver's reference clock and the recovered clock of the burstmode channel at t=0. In this case, the PR control code of the burst-mode channel is not updated while the channel is idle. The righthand side of Fig. 5 (a) shows the phase relationships after an idle period of M/2. The phase of the always-on channel's data and recovered clock has advanced by , reaching 1 + . Since the burst-mode channel's PR control code was not updated, its phase remains at 2 , but is misaligned to the burst-mode channel's data which has also advanced by during the idle period. Using the burst-mode    Always-on ch. data channel's clock to sample the next burst will lead to a severe degradation in bit-error ratio. Allowing the CDR to lock using conventional loop dynamics will incur a time delay of possibly hundreds of UIs. The phase relationship is repeated in Fig. 5 (b) where the PR code of the burst-mode channel is updated between bursts using the proposed timing recovery scheme. Even though most of the burst-mode channel's CDR has been powered down between t = 0 and t = M/2 UI, its PR code has been updated using duplicate PR updates from the always-on channel. Therefore, it powers up with a phase offset suitable to capture data from the burst-mode channel.
The advantage of the proposed scheme is that the clock of the burst-mode receiver remains aligned with the data stream from the very first stable clock cycle. The phase detector output of the always-on channel is used as a proxy for phase updates during idle intervals. Hence, this method of timing recovery is referred to as "proxy" timing recovery. The "collaborative" timing recovery scheme used in [16] recovers a global clock from the simultaneous participation of all the channels in the link and is different from that proposed here.

III. PROPOSED ARCHITECTURE AND DESIGN
Twelve or more parallel optical-fiber channels are frequently deployed in data-centers [17] where each channel has an independent CDR at the receiver side. A conceptual block diagram of the proposed twelve-channel burst-mode link is shown in Fig. 6 which without the connection among CDRs is similar to a conventional architecture presently deployed in data-centers. The proposed architecture is compatible with the approach in [17] so long as it is assumed that even during periods of low link activity, at least one channel is active, maintaining synchronization. The remaining channels are turned on or off independently depending on the per-channel workload. Each channel has two power states: i) idle-state and ii) on-state. An external half-rate reference clock (5 GHz) is provided for all the CDRs on the receiver side. During the idle-state, each burstmode receiver receives a 1/4 th rate clock from the always-on channel to detect the beginning of a burst of data, and its PR control code is incremented/decremented using phase updates from the always-on channel. Since the preamble of a burst uses 1/5 th rate data, the 1/4 th rate sampling clock can have arbitrary phase. Although the always-on channel can take any arbitrary location, to maximize the jitter correlation between the recovered clock on the always-on channel and the data on the burst-mode channels, the always-on channel is located in the middle [16].
To investigate opportunities to reduce the turn-on time in parallel optical links, a three-channel receiver prototype is designed, as shown in Fig. 7. It consists of one always-on channel and two burst-mode channels; the latter operate with rapid on/off functionality to improve energy efficiency during idle times. All channels can operate at 10 Gbps. In the idle-state, most of the circuitry of the burst-mode channel's CDR is turned off to reduce power dissipation. Also, the envisioned front-end of the receiver is configured to have reduced power dissipation and bandwidth but equal gain during the idle-state. The proposed link protocol and the circuit details of the building blocks for the proxy timingrecovery CDR are described in the following subsections.

A. PROPOSED BURST-MODE LINK PROTOCOL
The proposed link protocol is shown in Fig. 8 (a). When configured in the idle-state, the front-end is a fully operational receiver, but with reduced bandwidth and power dissipation compared to when it operates in the on-state. The

BURST-MODE CHANNEL
sensing of a data burst is marked by the detection of a few positive edges in an incoming 1/5 th rate bit stream. This is achieved by a sensing circuit through a continuous sampling process by a 1/4 th rate clock available from the always-on channel. To power on a receiver from its idle-state, an alternating sequence of five "1s" and five "0s" (ON-BITS) starts the preamble sequence. These ON-BITS, captured by a 1/4 th rate decision circuit, produce positive edges that switch the receiver from idle-state to on-state. In this prototype, three positive edges confirm a data burst, thus avoiding any false glitches arising from transients that might appear when the receiver is switched from its on-state to idlestate. The sequence of 5 bits for each "1" and "0" is to guarantee the three positive edges in the sampled data with a 1/4 th rate clock available from the always-on channel. The 1/4 th rate sampling clock from the always-on channel is aligned with the midpoint of the data on the always-on channel, but the incoming data on the burst-mode channel has an arbitrary phase deskew with the sampling clock of the always-on channel. This phase deskew is due to the non-phase-matched clocking architecture used for the distribution of the recovered clock from the always-on channel (Fig. 7), and due to a mismatch in the path lengths of the channels, shown in Fig. 8 (b). The figure shows possible incoming data phases (i-iii) with respect to the clock from the always-on channel. This clock is used only to detect the incoming burst and is not used for sampling the rest of the burst of data on the burst-mode channel once it is activated. The corresponding sampled bits are shown in Fig. 8 (c), and the curved arrows denote positive transitions. The first "0" in each row of sampled data represents the last sampled bit during the idlestate. Even if the first clock edge falls on the metastable position of the data, the circuit with the help of the proposed sequence is able to detect three positive edges. All possible combinations for this case are also presented (square box).
The rest of the preamble sequence (DC-PREAMBLE) is a repeated "1100" pattern of 24 UIs during which the output dc-level is recovered. The useful data follow the preamble sequence, and the link is fully operational. Since the PR control code was updated using phase updates from the always-on channel, the CDR in the burst-mode channel starts from the correct phase with respect to its incoming data burst.
At the end of a data burst, receiving 20 consecutive 0s brings the receiver into the idle-state.

B. SENSE CIRCUIT DESCRIPTION
The sense circuit (SENSE CKT), shown in Fig. 9, senses the incoming burst of data in 20 UIs and generates the ON-OFF signal to toggle the receiver between the idle-state and the onstate. When ON-OFF = 0, the receiver is in the idle-state. The recovered clock from the always-on channel (CLK1) passes to the latch L1 which samples the output of the front-end in anticipation of an incoming burst of data, ultimately producing three rising edges. The DFFs are clocked by the output of the latch. With the three rising edges, the outputs of the three DFFs are 1, resulting in ON-OFF=1, and thus, the clock to latch L1 is gated. Following this, the receiver enters the onstate. When the receiver needs to be placed in the idle-state, the "Off Detect" block (clocked by its own recovered clock, CLK2) generates a reset pulse (RST), after sensing twenty consecutive "0s", which resets the three DFFs inside the "Burst Detect" block and turns ON-OFF=0.
The generation of the RST pulse is governed by the following equation where signals A, B, and Y are shown in Fig. 9: The "Off Detect" circuit in the proposed work has been optimized based on the available four data sampling latches for a quarter rate receiver.

C. CDR ARCHITECTURE AND DESCRIPTION
A phase interpolator (PI)-based CDR [13] is used in this work. The CDRs of the always-on channel and one burstmode channel are shown in Fig. 10. The only difference in the CDR of the burst-mode channel is the selection circuits for the UP/DN and TRIG signals denoted by the shaded region. CLK1 and CLK2 are the clocks generated by the CDRs of the always-on channel and the burst-mode channel, respectively. Early and Late signals are generated from data (D1-D4) and edge (E1-E4) samples by the Alexander phase detector (PD) and majority voting circuits. They are then processed by the loop filter to create an UP/DN signal (1=UP, 0=DN) and a triggering pulse (TRIG) for the PR control logic. The PR control logic produces a 64-bit code for the PR. The PR generates a differential clock signal, whose rotation covers the whole clock period, and drives the "8-phase generator" block. The outputs of the "8-phase generator" then serve as the clock for the data and edge sampling latches.
The two channels of the proposed architecture interact with each other in such a way that the UP/DN and TRIG signals from the always-on channel feed the PR control logic of the burst-mode channel through the selection multiplexers when the ON-OFF signal is low (idle-state). Also, the generated clock (CLK1) from the CDR of the always-on channel is

PHOTO-DIODE
supplied to the SENSE CKT of the burst-mode channel through a gating circuit. In the idle-state, the SENSE CKT continuously checks for the start of a data burst with the help of CLK1.
The circuit details and the functionality of the loop filter, consisting of a 4-bit bidirectional shift-register and a finite state machine (FSM), are shown in Fig. 11. The minimum output update rate of the loop filter is 6 clock cycles. The purpose of using a 4-bit shift register is to have a balance between jitter tolerance and false phase steps due to the overall delay of the CDR loop. CLK2 passes to the loop filter only when Early and Late signals are unequal, giving a gated clock, GCK [18]. The advantage of clock gating the loop filter is to reduce the dynamic power of the CDR. The TRIG pulse is generated only when both Q1 and Q4 are high.
The 2.5 GHz differential in-phase and quadrature clock signals, generated by a single 5 GHz reference clock (REF_CK) with the help of an I/Q generator, drive a 128step CML PR. The 5 GHz REF_CK would be generated using a shared PLL across all twelve lanes, as is commonly done in multi-lane interfaces [16] and distributed. The proposed scheme does not rely on precise matching of the REF_CK phase at each CDR since each lane has its own PR. The output of the PR goes to an 8-phase generator. The 45º spaced output clock phases then drive the high-speed (data sampling) latches.
The implemented CML PR is shown in Fig. 12 (a) with circuit level details of one cell [19] and total phase steps of 128 in one clock period (32 phase steps per UI). The low complexity architecture has a conventional round constellation and interpolates between four quadrature clock phases (I+, I-, Q+, Q-) whose weights are determined by two current steering DACs: I-BITS and Q-BITS of the PR control logic. These bits, similar to [18], are shown in Fig. 12 (b). The PR control logic consists of two cascaded DACs with an inversion at the midpoint of the cascade. Each of the DACs consists of 32-bit bidirectional shift registers. The functionality of the PR control logic is graphically illustrated in Fig. 12 (b). The PR control code is zero at power on (or while reset), and the increment/decrement of the code by the

PHOTO-DIODE
UP/DN and TRIG signal from the loop filter is shown in one direction only when UP/DN=1. Similarly, for UP/DN=0, the DACs will be updated in the opposite direction. The 8-phase generator consists of three cascaded injection-locked oscillators (ILOs), shown in Fig. 12 (c). Each ILO is a four-stage, cross-coupled, pseudodifferential ring oscillator with four differential injection points. For the first ILO, the differential output of the PR is received at one differential injection point, and the other three injection points are disabled. The other two ILOs receive the four differential outputs of the preceding ILO as the injection signals for their four differential injection points. The four differential outputs of the third ILO form eight clock phases, for data and edge sampling, with 45º phase separation. The use of three cascaded ILOs provides better phase separation (each with 45º) compared to only two cascaded ILOs.
The CDR of the always-on channel is in operation at all times with full power dissipation. However, during its idle state, the CDR of the burst-mode channel consumes a small fraction of its on-state power dissipation due to activity in the PR control logic, necessary to keep the PR code updated by the always-on channel. All other digital blocks in the CDR remain inactive and dissipate no power. The I/Q generator, the PR, and the 8-phase generator are turned off through the ON-OFF signal and dissipate only leakage power in the idlestate. All the switching in the proposed circuit is done simultaneously with the generated ON-OFF signal.  Step 0 (Step 128) Step 128 Step 1 Step 32 Step 64 Step 96

D. CDR LOOP DYNAMICS, STABILITY AND INTER-LANE TRACKING
The CDR used in this design consists of a clock synthesis circuit and a peripheral loop. The clock synthesis circuit includes the external reference clock and the I/Q generator, whereas the peripheral loop comprises the PD, loop-filter, PR control logic, PR, and the ILOs. The CDR has a firstorder loop (Fig. 13) and hence it is unconditionally stable. In Fig. 13, the value of Kp is 1. The phase resolution achieved by the CDR is given by [20] Phase The factor 6 in the above equation represents the minimum update clock cycles for UP/DN signals out of the loop filter. The jitter tolerance (JTOL), which is defined as the maximum sinusoidal jitter amplitude on the incoming data that the designed CDR can tolerate for a given jitter frequency is given by Clearly, the higher the frequency of jitter, the lower the amplitude of jitter that can be tolerated. Both (3) and (4) give theoretical values exceeding what is seen in a real implementation. A reasonable choice of the design parameters was made in this work, and the JTOL curve resulting from (4) is comparable with published work [21]. The use of phase detection from one channel in the updating of the PR code in another channel raises the question of whether the correlation of data jitter seen on adjacent channels leads to detrimental differential jitter, because of the difference in data path length and REF_CK path length between channels [22]. A delay mismatch due to combined data path and REF_CK path length differences between channels could exceed a few UIs over a 100 m interconnect. Considering clock jitter on the first channel given by [22] where is the jitter amplitude and is the jitter frequency, the clock jitter on the second channel in the presence of a data and REF_CK path length mismatch is then given by 2 = sin(2 ( + )), (6) where m is the combined delay mismatch in number of UIs and is the bit period. Therefore, the differential jitter between the two clocks is given by This differential jitter is irrelevant when both lanes operate with separate CDRs. However, when the PR control code on the burst-mode channel is updated based on the always-on  channel, 1 is imposed on the burst-mode channel's clock whereas its data has 2 . If exceeds the width of the bathtub curve, errors will occur when the burst-mode channel is activated using the always-on channel's PR updates. For a given jitter amplitude and jitter frequency, as the delay mismatch increases, the differential jitter increases and becomes out of phase, resulting in an increased (BER).
The maximum jitter amplitude tolerated by a single lane, given by (4) introduces a timing error ( ), between the clocks of the always-on and burst-mode channels, less than 0.1 UI when the combined data and the REF_CK path difference is up to 78 UIs. This path length mismatch is unlikely to occur in a link of 100m or less. Therefore, updating the PR code of the burst-mode channel during its idle-state will provide a clock, accurate to within 0.1 UI at the onset of a data burst where jitter is considered to be correlated. The effect of uncorrelated jitter is considered in Section IV.

Extrapolated Line
Theoretical Simulated Fig. 15. Theoretical and simulated JTOL of the proposed CDR. (a) (b) Fig. 18. Process variation at phase rotator code with (a) minimum phase delay error, (b) maximum phase delay error.

(a) (b)
Step number 0 Step number 16 The functionality of the prototype of the proposed burstmode receiver (Fig. 7) is presented in this section. The proposed CDR is designed in a 65 nm, 1-V CMOS process. The total power dissipation of the CDR of each channel operating at 10 Gbps is 19.5 mW (1.95 pJ/bit) during the onstate and reduces to 0.58 mW in the idle-state. The power breakdown of the CDR is presented in Table I. The layout of the entire CDR of one channel is shown in Fig. 14.
The theoretical value of the JTOL of the CDR used in this work is plotted in Fig. 15. Two different jitter frequencies are selected (4.14 MHz and 8.28 MHz) for simulating the performance of the CDR with added sinusoidal jitter at these frequencies. The maximum amplitude of jitter tolerated by the CDR was noted (limiting the phase offset between the clock and the data to within 0.3 UI peak-peak). The noted values of the jitter amplitude are presented in Fig. 15 and are comparable to (4).
There are several non-idealities that stress the performance of the CDR, such as the nonlinearity in the PR, process variation and mismatch effects on the phase interpolator, correlated and uncorrelated jitter. The PR delay with each phase step is plotted in Fig. 16 and the maximum phase offset of 6 ps from an ideal value is obtained for a given PR code.

A. EFFECTS OF PROCESS VARIATION
Local transistor mismatch will lead the phase interpolators used in the CDRs of the always-on channel and the burst-mode channels to behave differently. The effect of mismatch on the phase delay for the PR codes that result in the minimum and the maximum phase delay error (Fig. 16), is shown in Fig. 17 for 100 Monte-Carlo iterations. The phase delays are with respect to the ideal values of delay corresponding to those codes. The means and the standard deviations in the two cases are -2.3 ps and 5.68 ps, and 1.3 ps and 5.4 ps, respectively. The effect of global process variation is shown in Fig. 18 for these PR codes and the corresponding means and standard deviations are 1.4 ps and 6.8 ps, and 2.1 ps and 8.7 ps, respectively. The deviation, for a given PR code, from the average value due to global process variation and local transistor mismatch will be handled by the proxy timingrecovery as static phase offset. However, the effect of local transistor mismatch on the linearity of the PR is of concern. In Fig. 19 (a), the PR delay is plotted against the run number for phase step number 0 and phase step number 16 (50 ps delay between the two different PR codes). The phase variation between these two codes is highly correlated over the iteration number. In Fig. 19 (b) the difference between the delay for PR codes 0 and 16 is plotted, giving a variation in delay of ~3 ps. Therefore, the PRs are robust against local mismatch and the proxy timing recovery scheme achieves lock when the CDR turns on.

B. EFFECTS OF JITTER
Without any data jitter there is a maximum allowable frequency offset between the sampling clock and the data for phase tracking. However, in the presence of a correlated jitter, a lower frequency offset can be tolerated which depends on the amplitude and frequency of the jitter. For an ideal data eye, the maximum allowable phase offset between the sampling clock and the data, resulting from an uncorrelated jitter, is 1 UIp-p for a specified BER. In the case of completely uncorrelated jitter, a differential jitter of 1 UIp-p (and hence 1 UIp-p clock and data phase offset) is due to an uncorrelated jitter amplitude of 0.5 UIp-p at any given jitter frequency. This uncorrelated jitter between the always-on channel and the burst-mode channel has an adverse effect only when the burstmode CDR turns on from the idle-state. During the active state, each CDR can track this jitter and remains locked.
If the idle time is longer than the period of the jitter, then the maximum tolerable amplitude of differential jitter (due to an uncorrelated jitter) is 1 UIp-p. However, if the time period of the jitter is longer than four times the idle period, the CDR can sustain uncorrelated jitter with an amplitude that results in a differential jitter of more than 1 UIp-p. Under this condition of uncorrelated jitter, the phase offset between the sampling clock and the data remains within 1 UIp-p (neglecting other non-idealities). The tolerance for uncorrelated jitter at jitter periods of 1000 ns, 242 ns, and 121 ns against the idle period is plotted in Fig. 20 The longer the period of the jitter compared to the idle period, the higher the amplitude of differential jitter that can be tolerated by the CDR. For example, if the idle period is 40ns, a 1 MHz uncorrelated jitter can be tolerated by the proposed proxy timing-recovery with a differential jitter amplitude (considering anticorrelation) as high as 4.02 UIp-p.

C. LOCKING BEHAVIOR
In the proposed design, the overall lock time of the CDR depends mostly on the injection locking of the ILO to the injected clock signal. The extracted simulation (C+CC) of the overall locking behavior of the CDR is presented in Fig. 21, and it takes 1.9 ns to phase lock the third ILO to the reference clock (shown in grey) within a phase error of 15 ps (0.15 UI).
The usefulness of the proxy timing-recovery becomes prominent in recovering the timing information quickly in the presence of a long period without data and with a ppm frequency offset. The transient behavior of the receiver is presented in Fig. 22 and Fig. 23 with a continuous data pattern input to the always-on channel and burst-mode patterns to the burst-mode channels. The simulation starts with a continuous run of pseudorandom binary sequence (PRBS) 2 7 -1 on the always-on channel and burst-mode patterns on the burst-mode channels (also PRBS 2 7 -1). To take into account the nonlinearity of the PR in the simulation, the input data pattern of the second burst-mode channel is phase shifted by ½ UI, forcing its PR codes to have an offset with the always-on channel. This offset is around 16 phase steps ahead that of the always-on channel, such that when the burst-mode channel turns on, its PR will have near maximum phase delay error (due to the phase update signal from the always-on channel to increment the code by 16 phase steps during the idle period of 40 ns). Simulated plots are presented after an initial period of time over which the sampling clock on the always-on channel and the burst-mode channels are aligned separately to the optimal sampling position. Fig. 22 and Fig. 23 represent the case without the proxy timing-recovery scheme and the case with the proxy timing-recovery, respectively. In the absence of a contribution from the always-on channel to update the control code of the PR in the burst-mode channels, the 1000 ppm frequency offset causes the data phase to drift away  Fig. 22. Analog output data and sampling clock of burst-mode and always-on channels with ON-OFF signal. Enlarged view of areas before and after the idlestate (40 ns) are also presented in this figure and subsequent figures. The enlarged view is for sampling behavior in the presence of 1000 ppm frequency offset between the transmit and receive clock frequency and without proxy timing recovery scheme. Burst-mode channel-2 also includes the nonlinear effect of the PR. Without the use of proxy timing-recovery, the clocks of the burst-mode channel are not locked at the end of the preamble bits.
from the reference clock phase between bursts. Thus, when the burst-mode channels turn on following 40 ns without data, their clocks do not lock within the preamble period. A phase difference of ~40 ps is introduced between the sampling clock and the data (Fig. 21). The effect of the PR nonlinearity, as expected, does not have a significant contribution in the phase difference at the end of the idle period. However, in the presence of the phase update contribution from the always-on channel, the PR code is updated, and the sampling clocks of the burst-mode channels have the correct phase at the end of the preamble period of 49 UIs (Fig. 23).
In Fig. 24, the simulation includes correlated jitter in addition to the ppm frequency offset. However, due to the presence of 4.14 MHz correlated jitter with amplitude 0.5 UIp-p, the offset frequency has been reduced to 300 ppm. At the end of the idle period, the clocks are aligned within the preamble period.  Fig. 24. The simulation is with proxy timing recovery in the presence of 300 ppm frequency offset (burst-mode channel-2 includes the PR nonlinearity) and 4.14 MHz correlated jitter of amplitude 0.5 UIp-p. The clocks of the burst-mode channels are found to be locked successfully within the preamble period.

Δt (ps)
Off period (ns) 10  The presented ~40 ps difference between the sampling clock and the data is due to an inactive period of only 40 ns chosen for simulation feasibility. A graphical representation of the calculated phase difference (in time, ∆t) between the sampling clock and the data, resulting from different off periods and 1000 ppm frequency offset, is shown in Fig. 25.
Using the proposed technique, irrespective of the length of the off period, the clock remains almost aligned with the data when the CDRs of the burst-mode channels turn on. However, when there is a ppm frequency offset, consecutive identical digits (CIDs) on the always-on channel will result in a phase offset between the sampling clock and the data of the alwayson channel and in turn, the burst-mode channel. For a typical ppm frequency offset and a differential jitter of 0.25 UI, the designed CDR can tolerate at most 83 CIDs. In this design, the timing recovery depends mostly on the locking time of the ILO. Therefore, the technique presented here proves to be effective in fast timing recovery in a burst-mode multi-channel application.

V. COMPARISON AND CONCLUSION
This work demonstrates a new approach for fast timing recovery, based on a proxy timing-recovery scheme, and power reduction during the idle period in burst-mode parallel optical links. The proposed architecture is similar to a conventional architecture presently implemented in datacenters. In the proposed technique, when the burst-mode channel is idle, it's PR control code is updated by the alwayson channel. Simulation results demonstrate a fast CDR lock time of 26 UIs, when it goes from an idle-state to its on-state, even in the presence of a ppm frequency offset between the clock and the data.
This work has been compared to the most relevant burstmode works as presented in Table II. The proposed CDR, together with the presented sense circuit, takes less than 49 UIs to lock. However, the CDR takes only 26 UIs from the time it is turned on by the ON-OFF signal. In contrast, the clock recovery time presented in [5] and [6] is 352 UIs and 463 UIs, respectively from the time the CDR is powered on. The work of [8] concentrates on both timing and output DClevel recovery simultaneously and takes 58 UIs for the lock. This work achieves a faster lock time and a comparable power dissipation to [11] implemented in the same technology. The proposed work incorporates ILOs. Their lock time is the most significant factor in determining the overall turn-on time of the CDR. Reducing the lock time of the ILO will correspondingly reduce the lock time of the CDR.
The off-state power does not include the power of the always-on channel and the REF_CK, both of which can be amortized over the lanes. In a scenario with one always-on lane and 11 burst-mode lanes, the worst case off-state power (assuming no useful data over the always-on channel) of each burst-mode CDR becomes 2.35 mW as compared to 0.58 mW for a burst-mode CDR alone. This is still significantly smaller than the 19.5 mW dissipated by an always-on CDR.
Finally, the power overhead of the proposed design (2.6 %) is better compared to [8] (22 %) working at the same data rate. Therefore, the presented work, which has an area overhead of 597 µm 2 (1.3 % of the non-burst-mode area) coming from the "Off Detect" and the "Burst Detect" circuits, and an on-state power overhead of 0.5 mW, is an attractive solution for energy-efficient parallel optical links.  [5] ISSCC'15 [6] JOCN'18 [8] JSSC'20 [11] This work