A Cryo-CMOS DAC-Based 40-Gb/s PAM4 Wireline Transmitter for Quantum Computing

Addressing the advancement toward large-scale quantum computers, this article presents the first four-level pulse amplitude modulation (PAM4) wireline transmitter (TX) operating at cryogenic temperatures (CTs). With quantum computers scaling up toward thousands of quantum bits (qubits), but having too limited fidelity for robust operation, continuous rounds of quantum error correction (QEC) are necessary. However, QEC requires a large amount of data to be transferred from a cryogenic controller at 4K to a classical processor at room temperature (RT). To bridge the gap, a high-speed data link between the quantum processor at CT and the classical counterpart at RT is needed. The proposed PAM4 TX architecture integrates a low-power 64:4 serializer structure, a high-speed 4:1 current-mode logic (CML) multiplexer, and a linear 6-bit digital-to-analog converter (DAC). Considering the challenges and benefits of CMOS operating at CTs, the TX architecture and circuitry are designed to exploit the maximum speed, while maintaining sufficient linearity. The fabricated 40-nm CMOS chip achieves a data rate of 40-Gb/s (36-Gb/s), an energy efficiency of 2.46 pJ/b (2.47 pJ/b), and 97.8% (96.6%) ratio of level mismatch (RLM) at CT (RT). While demonstrating an energy efficiency comparable to prior-art TXs in more advanced CMOS nodes at RT, the broad operating temperature of the proposed TX enables the required high-speed wireline link for large-scale quantum computers.

small-scale processors comprising up to a 100 quantum bits (qubits), simple algorithms can already outperform a classical computer [1], [2].In the next intermediate stage, with processors achieving up to 10k qubits, specific chemistry or physics algorithms, such as computational catalysis, can be solved within weeks [3].In a far future with >10 6 qubits, largescale quantum computer systems can practically revolutionize computing, cracking modern-day encryption within a day [4].
Looking at the intermediate-scale stage for the coming years, qubits are hard to scale up and too noisy for robust computations [5].Hence, different quantum error correction (QEC) strategies [6], [7], [8], such as surface code (SC), were proposed to realize a large-scale quantum computer with error rates low enough for useful calculations.In general, QEC encodes information across multiple physical qubits to form a logical qubit, thereby suppressing the errors.On the other hand, to reach good inherent physical qubit fidelity, thermal noise should also be minimized, and consequently, the qubits must be located at sufficiently low temperatures, typically at the mK stage of a dilution refrigerator.
As the classical controller and readout are currently located at room temperature (RT), a large number of cables are needed to transfer the extensive amount of analog information from/to the qubits.Fitting all these cables in a dilution refrigerator not only leads to an integration challenge, but also imposes a 1-mW/cable thermal load on the fridge [9].To tackle this imminent bottleneck, CMOS circuits operating at cryogenic temperatures (cryo-CMOS) have been developed to move the control/readout equipment from RT to the cryogenic environment [10].Efforts to date in characterization have resulted in multiple CMOS technologies being functional at cryogenic temperatures (CTs), accompanied by some challenges, including increased threshold voltage and device mismatch [11], [12], [13].
So far, cryo-CMOS circuits have been placed at the 4-K stage to control and read out the qubits [14], [15], [16], [17], [18], [19], while (de)multiplexers are designed for operating at the mK stage to reduce the amount of interconnect between qubit and controller [20], [21].Moreover, hot qubits operating with high gate fidelities at temperatures above 1 K are being developed to completely close the temperature gap between the electronics and qubits, thus improving the scalability of future quantum computers [22], [23].However, in any scenario, QEC is still needed to protect the quantum information from errors Fig. 1.Simplified block diagram of a scalable quantum computing system incorporating a cryo-CMOS high-speed wireline link.[6], [7], [8].After each periodic readout cycle, the combined state of the physical qubits must be decoded to extract any eventual error.Considering the computational complexity of QEC decoding algorithms and their required short execution time (well below 1 µs for typical solid-state qubits [24], [25]), a low-latency, high-performance classical processor is needed for the decoder implementation.Hence, it is challenging to integrate the required processor inside the refrigerator due to the limited available cooling power [26].
Assuming that the readout data from the qubits are digitized at the 4-K plate of the dilution refrigerator, these data then need to be serialized from different readout blocks at CT and transferred to the classical processor at RT, as illustrated in Fig. 1.Furthermore, for the cryo-CMOS control electronics to then apply the correction to the qubits, a fast downlink is necessary to send the gate instructions down.Hence, to accomplish those tasks and to significantly reduce the number of cables between the dilution refrigerator and RT equipment, there is a need for a high-speed wireline transmitter (TX) operating at CT/RT, as presented in [27].Even if, in the future, QEC can be implemented within the cryogenic environment, a high-speed TX is still necessary for chip-to-chip communication.
Compared with [27], this article aims to quantify the system's requirements, extend the design procedure, and give a more detailed evaluation of the results.This article is organized as follows.Section II defines the system specifications.Section III discusses the architecture and gives a detailed analysis of the circuit design.Section IV evaluates the measurement results, and this article is concluded in Section V.

II. WIRELINE TX SYSTEM SPECIFICATIONS
Based on QEC requirements and channel loss, this section gives a guideline for determining the TX specifications, such as data rate, power consumption, modulation format, output swing, jitter, and linearity.First, we estimate the required uplink and downlink data rates based on QEC requirements.Fig. 2(a) shows a simplified system diagram of QEC using a repeated SC with a distance of d.In this scheme, each logical qubit contains d 2 data qubits (D) and d 2 − 1 ancilla measurement qubits (X /Z ).In each SC cycle, the measurement data of the X /Z ancilla qubits are serialized by the uplink wireline TX at 4.2 K and transmitted to a set of decoders located at RT.The decoder then finds the estimated error on the data qubits.Based on this outcome, the Pauli frame unit (PFU) tracks the error and passes the correction instruction to the downlink TX.The downlink receiver (RX) then de-serializes the instruction data for the cryogenic controller to apply the corresponding correction gates to the data qubits.Each SC cycle includes two Hadamard/Identity gates, four controlled-NOT (CNOT) gates, and the measurement of the X /Z ancilla qubits.The corrections can be applied after measuring the data qubits.The frequency of data qubit measurements depends on the type of quantum algorithm.In the worst case, a correction should be applied between each cycle, but in practice, this measurement rate is always lower [28], [29], [30].By considering Fig. 2(b) and assuming a Hadamard/Identity gate time (T H/I ) of 20 ns, a CNOT gate time (T CNOT ) of 40 ns, and a measurement time (T M ) of 200 ns, the total SC cycle time (T SC ) is 440 ns for state-of-the-art Transmons [31].As can be gathered from Fig. 2(b) and (c), to prevent a backlog or increase in the execution time of quantum algorithms, the total data round-trip time, including the transfer time for the uplink (T uplink ), decoder (T decoder ), and downlink (T downlink ), should be lower than the rate at which X /Z measurement data are produced [32].Since the measurement of each ancilla qubit produces a 1-bit data stream ("0" or "1"), 1 the uplink data rate 1 Some decoding algorithms need the entire measured data stream of in-phase and quadrature analog-to-digital converters (ADCs) of the readout chain [33], thus requiring a substantially higher uplink data rate.can be estimated by where N is the number of logical qubits.Considering the QEC decoder takes T decoder ≈ 90 ns [31], 175 ns of transfer time will be available for either T uplink or T downlink . 2 On the other hand, with the current inherent infidelity of ∼10 −3 for state-of-theart physical qubit technologies [24], [25], at least a distance of 23 is needed to reach a logical error below 10 −12 for a practical fault-tolerant quantum computer [34].Considering T uplink = 175 ns and one logical qubit with d = 23, an uplink data rate of 3-Gb/s is required.
On the other hand, for the worst-case scenario where a physical data qubit is measured in each SC cycle, the PFU should decide which of the m possible gate instructions needs to be applied.This instruction code should be transmitted down to CT at the same rate.Consequently, the speed required in the downlink TX may be estimated by Considering the same number of qubits and code distance and assuming that a universal instruction set of eight quantum gates is sufficient to control the qubit system [35], [36], the maximum downlink data rate for one logical qubit is 9-Gb/s.In this work, we target a data rate of 40-Gb/s, thus providing the required throughput for four logical qubits.With current qubit technologies and QEC architectures, moving toward larger numbers of logical qubits in the far future would require Tb/s throughput and extremely challenging energy efficiency for the data link [37].Consequently, both the relaxation time (T * 2 ) and fidelity of the qubits need to improve to relax the requirements on the cycle time and distance of the SC.Another approach to reducing the throughput bottleneck would be to implement a power-efficient decoder within the cryogenic environment to detect local errors, thus reducing the data rate going to the more complex decoder at RT [38].
To provide both uplink and downlink, the wireline TX must operate at CT and RT.The cooling power available in a typical dilution refrigerator is limited to roughly 1 W at the 4-K plate [9], of which the largest portion is available for the electronics, while the other part is reserved for thermal loading.Considering that the control and readout circuits take a significant part of the power budget for electronics, we target an active power consumption of 100 mW for the wireline TX, including data retiming, serializing, and driving of the wireline.Note that, based on the timing diagram in Fig. 2(c), the uplink TX at the 4-K plate is turned on for less than half the SC cycle, so the expected thermal loading will be even lower.
In the next step, the data modulation format between nonreturn-to-zero (NRZ) signaling and four-level pulse amplitude modulation (PAM4) should be chosen.Essentially, at the same bit rate and TX maximum output swing, PAM4 increases the  number of voltage levels from two to four.Thus, the required circuit speed and bandwidth are reduced by half, while the noise tolerance is degraded, as its eye height is reduced by 3×.Fig. 3 shows the estimated loss of the system between the cryogenic TX and room-temperature RX, considering the cable insertion loss and TX's output bandwidth limitations due to the parasitic capacitance of the output pads and electrostatic discharge (ESD) diodes.At half of the baud rate (i.e., 20 GHz for NRZ and 10 GHz for PAM4), the total loss is ∼6 dB higher for NRZ signaling.Consequently, the maximum eye height normalized to the TX output swing will be ∼3 dB larger for NRZ signaling at the RX input.However, since the delay and rise/fall time of simple gates is ∼12 ps in the used technology (i.e., 40 nm), the timing skews can easily be comparable with a baud rate duration of 25 ps for an NRZ TX, thus degrading the eye width significantly.Therefore, PAM4 signaling is chosen in this design.
Considering the targeted baud rate (i.e., 20 Gbaud) and estimated channel loss of 8-to-14 dB at 20 GHz, the long-range wireline standard is used as a guideline to determine the required differential output swing (V o,pp ) and jitter of the TX [39].The standard demands an 800-mV pp steady-state swing without pre-emphasis.Furthermore, the time interval that includes 99.99% of the jitter distribution (J4u) should be below 0.118 unit interval (UI).Hence, the total jitter should not exceed 5.9 ps in our 40-Gb/s PAM4 TX.Note that both deterministic jitter caused by clock skew and random jitter due to noise should be considered in J4u calculations.
Finally, the PAM4 TX should be sufficiently linear to realize the same height for all three eyes and achieve similar bit error rate (BER) for all symbols.For PAM4 signals, the ratio of level mismatch (RLM) determines the linearity performance, as illustrated in Fig. 4. The RLM is defined as the difference between the smallest vertical eye opening divided by the full swing where V 0−3 are the mean values of PAM4 signal levels.
Based on the long-range wireline standard [39], an RLM of at least 95% is required.Consequently, the maximum allowed quantization error must be less than 1.6%, which results in at least a 5-bit amplitude resolution for the TX output stage.

III. WIRELINE TX ARCHITECTURE AND CIRCUIT DESIGN
To overcome bandwidth limitations and compensate for the low-pass channel response, the wireline TX architecture should implement a feed-forward-equalization (FFE) filter in the analog or digital domain.The first option is to realize the pre-emphasis filter in the final stage by weighted combining of the delayed taps of the full-rate signal, as shown in Fig. 5(a).This requires the delay taps to run at the baud rate, increasing design complexity and limiting the flexibility to the fixed amount of retimers installed in the system.Alternatively, the pre-emphasis can be calculated in the digital domain and then transmitted by a high-speed digital-to-analog converter (DAC), as can be gathered from Fig. 5(b).This allows the delay taps to run at the baud rate divided by the serializer steps.However, this requires a multi-bit DAC to run at the baud rate.Using on-chip memory for FFE implementation, the DAC-based architecture is chosen, as it adds flexibility to generate multiple equalization options in the digital domain, as well as the programmability to configure different modulation sequences (i.e., NRZ and PAM4).
The block diagram of the DAC-based wireline TX architecture is shown in Fig. 6.In total, 6×, 512-word, 64-bit programmable SRAM modules, with a total size of ∼197 kb, are synthesized to allow for exploring different bit patterns, data formats, and equalization techniques in measurement.The 6× 64-UI parallel SRAM modules are decoded to feed the 10× DAC serializer slices (3-b binary, 3-b unary coded).In each serializer slice, the data are multiplexed to drive a single DAC bit.First, the SRAM data are retimed by a 64-UI retimer with a selectable clock phase (i.e., ψ i ∈ ψ 0−7 ) to align all input data.Next, a 64:4 multiplexer (MUX) serializes this data stream to a 4-UI output stream, utilizing the available clock phases in an efficient manner.This 5-Gb/s signal is then retimed by a quarter-rate retimer and converted to 2 × 4 complementary streams, each having 25% duty-cycle pulsewidth.A current-mode logic (CML)-based 4:1 MUX then interleaves the full-swing data pulses to a 20-Gb/s differential output, driving a DAC element.Finally, the DAC element drives the output network, consisting of two 50-termination resistors, a peaking coil, ESD, and differential output pads with sufficient swing.The termination resistors are implemented by the unsilicided polysilicon resistor, as its sheet resistance is fairly constant over temperature [10], [40].The peaking inductor is designed for maximally flat envelope delay, increasing the bandwidth by a few GHz.The values of the output capacitance and peaking inductance stay constant down to CT, while their quality factor increases by 2×-3× [41].Hence, due to lower passive loss, the output bandwidth is expected to slightly improve at 4.2 K.A clock generation circuit provides the necessary clock phases for all serializers and retimers in the architecture, from an external 10-GHz clock input.

A. 6-bit Current-Steering DAC
The DAC should drive the wireline with sufficient swing, linearity, and bandwidth.The final stage can be implemented in two ways: utilizing a current-mode driver or employing a source-series-terminated (SST) driver, both of which are commonly seen in wireline TXs [42], [43].The SST or voltage-mode DAC creates the output levels based on a voltage division between the effective output resistance of the DAC cells and the load.Due to its class-D operation and not drawing constant current from supply, an SSL driver typically consumes ∼4× lower power compared with the current-mode counterpart, when generating the same output swing.Despite this advantage, a current-steering DAC is adopted in this design for a number of reasons.
First, the switches in the SST cells must be wide enough to exhibit an ON resistance (r ON ) well below their corresponding series resistors.These wide transistors come with a high input capacitance and need a rail-to-rail input swing to switch fast, thus demanding power-hungry pre-drivers.However, currentmode drivers only need a moderate input voltage swing and small size switching pair to steer current, thus requiring less power consumption in the pre-driver stage and achieving higher speeds due to their lower input and output parasitic capacitance.Second, r ON significantly varies with process, voltage, and temperature (PVT) variations.Since the output resistance and voltage of the SST drivers are dependent on r ON , additional calibration or trimming circuitry is needed to satisfy output matching and linearity of the SST driver both at RT and CT.Third, the SST structure draws a significant transient current during data transitions, thus requiring large decoupling capacitors and experiencing some data-dependent supply voltage (V DD ) variations.This can heavily affect the TX linearity, as the output voltage is proportional to V DD .Yet, current-steering drivers draw a relatively constant current from V DD for different data transitions and voltage levels.Moreover, the linearity and output swing of current-steering DACs are mainly determined by the current source accuracy, which can be stabilized more conveniently over PVT variations.Consequently, considering the possible supply voltage drop due to the long wires between the cryogenic wireline TX and the room-temperature voltage source, and potential supply/ground interferences, the current-steering driver is a more promising candidate for this design.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.As discussed at the end of Section II, at least a 5-bit DAC is needed to satisfy the RLM requirement.Taking a margin for the other errors that could be introduced, a 6-bit DAC is chosen in this design.Fig. 7 shows the schematic and layout of the DACs' least significant bit (LSB) cell.To achieve the linearity performance, the integral nonlinearity (INL) caused by current source mismatches should be safely below 0.5 LSB at both CT and RT.According to [44], the maximum INL of an n-bit DAC with N = 2 n voltage levels may be approximated by where I 0 is the LSB current and σ i is its standard deviation.
On the other hand, by using the Croon model [45], the drain-current mismatch can be predicted by where W 4 , L 4 , and g m represent the width, length, and transconductance of the current source transistor (M 4 ), respectively.Moreover, A V TH and A V β are, respectively, the threshold-voltage and current-factor proportionality parameters [46].According to [11], for long-channel NMOS transistors in 40-nm CMOS, the measured A V TH and A V β , respectively, increase 1.5× and 3× from RT to 4.2 K, indicating a more severe device mismatch at CT.As can be gathered from ( 5), to achieve a lower drain-current mismatch, g m /I 0 should be minimized.Fig. 8(a) shows g m /I 0 versus overdrive voltages for a 2-µm/240-nm transistor at RT and CT.It can be gathered that, in weak inversion, g m /I 0 increases by 3× at CT, while in strong inversion, it stays constant [47], [48].Hence, the tail sources should be biased in strong inversion with sufficient margin to avoid the increase in current mismatch.
The active area of the M 4 transistor is then calculated by substituting (5) into (4) Now, assuming that the maximum allowed INL error should be within LSB/8, and considering A β ≈ 3% • µm, A V TH ≈ 10 mV • µm, and (g m /I 0 ) ≈ 10 S/A, the transistor core area should be at least 11 µm 2 .Increasing the size further would create a large parasitic capacitance, reducing the output impedance of the current source at high frequencies.
Another contributor to nonlinearity is the finite output impedance of the current sources [49].Depending on the DAC input code, the impedance seen at the output node varies, thus distorting the output voltage, and degrading the uniformity of eye heights.Assuming m current switches (M 1,2 ) are active on the left and N − m on the right branch [50], the DAC output voltage may be estimated by where R L is the 50 , and Z o is the output impedance of the LSB current source.Due to the finite output impedance of the current sources, the voltage difference between the maximum and minimum levels is reduced to On the other hand, the minimum eye height can be approximated by By replacing ( 8) and ( 9) in (3), we have To obtain RLM ≥ 95%, |Z o | should be at least 5 k .Moreover, by using ( 8), I 0 should be larger than 220 µA to achieve an 800-mV pp differential output swing.To achieve the required output resistance, a cascode transistor M 3 is added to the current source.M 3 uses a minimum length transistor, since its drain parasitic capacitance limits the output impedance of the current source and, thus, DAC linearity at high frequencies.Consequently, by considering the required current, output impedance, and device matching, L 4 = 400 nm and W 4 = 30 µm are chosen.Note that the transistor's early voltage and output resistance drop at CT [47].Hence, the output resistance is overdesigned at RT. Fig. 8(b) shows the simulated output impedance of the DAC LSB cell at RT, offering an impedance over 5 k for frequencies below ∼4.7 GHz.
The DAC unary and binary segmentation is decided based on the trade-off between the DAC differential nonlinearity (DNL) and the TX area and power consumption.The total number of TX slices is determined by where n b and n u are the number of binary and unary bits, respectively.Note that each slice requires a complete serializer path.Therefore, it is preferred to employ a fully binary DAC to optimize chip area and power consumption.However, this approach increases DNL significantly [51], as the DNL error due to the major transition can be approximated by By considering this trade-off, a 3-bit binary, 3-bit unary DAC is chosen.In this way, the number of slices reduces from 63 for a fully unary structure, to 10, and the DNL improves from 0.5 LSB, for a fully binary structure, to 0.18 LSB.Further increasing the number of unary bits, e.g., a 4-bit unary DAC, could achieve 0.125-LSB DNL but would require 17 TX slices, thus increasing the digital power consumption of the serializing paths by 70%.
The biasing circuit shown in Fig. 7(a) mirrors a reference current by a factor of 5 to generate the LSB current of 0.25 mA.The reference current is generated internally by an unsilicided resistor, which is fairly constant over temperature [10], [40].To allow for wide temperature operation with constant output swing, the bias current can be adjusted digitally by trimming the coarse and fine current mirror DAC.In layout, the big current source transistors, M 3,4 , are split symmetrically in two, such that the height of DAC unit cells can be reduced to 3.3 µm, as illustrated in Fig. 7(b).This way, the length of the wires connecting 4:1 MUX outputs to DAC cells is minimized, thus reducing parasitics and delay mismatches of the input signals.
During the input data transition, the voltage across the current source is affected.This data-dependent disturbance couples to the bias voltage of the current source through M 3,4 parasitic capacitances, degrading the DAC settling and linearity [52].To reduce this coupling, large and distributed decoupling capacitors are placed at M 3,4 gate bias voltages, and special attention is given to reducing M 3 drain-source and M 4 gate-drain parasitic capacitances during layout.Besides, the distribution of the bias voltages is done perpendicular to the data lines.Moreover, since the substrate sheet resistance increases by five orders of magnitude at CT [41], substrate contacts are placed around and close to DAC unit cells to avoid possible distortions due to the floating body of transistors.The resulting DAC structure and layout are shown in Fig. 9.The unit-cell slices have been laid out in symmetric rows and connected by H-trees to minimize delay mismatch between the different branches.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. 4:1 Multiplexer
As shown in Fig. 10(a), a CML-based 4:1 MUX is chosen, because it can reach higher speeds than conventional logic, as it only needs to steer a small current rather than switch a full-swing voltage.The circuit combines the 25% rail-to-rail interleaved input pulses in the current domain to generate a differential output data stream, directly driving the associated DAC element.The MUX is designed without a tail transistor to counteract the reduced voltage headroom due to the higher threshold voltage at CT [12].Fig. 10(b) displays the functioning of the 4:1 CML MUX with a transient simulation of a "1011" sequence.All interleaved data pulses pull down the left branch (i.e., X n ) except for the second data pulse (i.e., D 1,n ), which pulls down the right branch (i.e., X p ).This results in a differential "1011" data stream at the output nodes.
Note that the MSB cells have a larger input capacitance than the LSB cells in the DAC structure.Consequently, if the same components' values are utilized in the 4:1 MUX of all slices, the data reach the DAC input ports with different delays, thus increasing the system's deterministic jitter.To mitigate this issue, the load conductance and pseudo-differential pairs are proportionally scaled with the size of their corresponding DAC slice to achieve a similar bandwidth and voltage amplitude at the output of all 4:1 multiplexers, thus preventing any systematic delay mismatch.

C. Quarter-Rate Retimer
The quarter-rate retimer, complementary data generator, and pulse generator prepare the data for the 4:1 MUX.The circuit schematic and its timing diagram are shown in Fig. 11.The last stage of the 64:4 MUX uses a different clock tree (i.e., ψ 0 ) than the pulse generator running at the 4-UI domain (i.e., φ 0−3 ).This results in an unknown time offset, which can hardly be predicted by simulating the extracted layout.Therefore, to prevent any data violation, a flip-flop retimer is added whose clock phase (i.e., φ i ∈ {φ 0 , . . ., φ 3 }) can be selected with a multiplexer-based phase rotator.Next, the data are converted to complementary form and again retimed to ensure any delay mismatch caused by the additional inverter in one of the complementary paths is removed.
The pulse generator then generates the required 25% dutycycle pulses (i.e., 1-UI) by combining two 50% overlapping 4-UI clock phases and the complementary data using two AND gates in cascade.Cascaded AND gates have been chosen, as opposed to a single three-input AND, as this would have multiple stacked devices, resulting in a much slower rise/fall time at CT due to the limited voltage overdrive.In this design, the pulse generation is done locally, since alternatively distributing complementary 25% non-overlapping clocks would require ∼2× more power-hungry buffers in the clock path and set tighter constraints to the clock distribution.Similar to the output network, the differential phases of the quadrature clocks are distributed through H-trees and upper level metals to minimize power consumption and clock delay mismatches.
Considering the worst-case scenario and the timing constraint diagram shown in Fig. 11, the delay between the triggering edge of the second retimer (i.e., φ 0 ) and the first sampling edge (i.e., φ i = φ 3 ) must be long enough to accommodate the inverter propagation delay (T inv ), the edgeto-edge jitter between the φ 0 and φ 3 clock phases, the flip-flop clock-to-q delay (T cq ), and the setup time (T su ).The total edgeto-edge jitter is a combination of peak-to-peak deterministic jitter DJ φ caused by clock skew and device mismatches, as well as random jitter RJ φ due to noise.Targeting a BER of 10 −12 , we aim for a 14 sigma (i.e., ±7σ ) design for the random jitter.Consequently, the maximum baud rate of the system may be estimated by Considering T inv = 12 ps, a simulated T cq = T su = 22 ps for a high-speed flip-flop at RT, a simulated DJ φ = 934 fs from the extracted layout, and a combined random jitter of 48 fs rms from the divider and potential PLL clock source [53], a maximum speed of ∼17.4 Gbaud can be achieved.
To ensure that the flip-flop retimers do not slow down at CT, the use of transmission gates or multiple stacked devices should be prevented due to the increased threshold voltage.Therefore, in this design, a true single-phase clock (TSPC) dynamic flip-flop is adopted, limiting the number of stacked devices and maximizing the system's speed.The drawback of TSPC logic is its susceptibility to transistor leakages originating from sub-threshold conduction, gate oxide, and source and drain junctions [54].Since the states are stored on the parasitic capacitance of small-sized transistors, if the clock period is excessively long (i.e., ≫10 ns), the leakage current could discharge this capacitor, thereby corrupting the logic state.This sets a constraint on the minimum operating frequency of TSPC flip-flops.Since the TSPC flip-flop is only used in the quarter-rate retimer, it is intended to only work at frequencies higher than a few GHz.Moreover, at CT, due to the 3× larger sub-threshold slope and increased threshold voltage, the sub-threshold conduction and leakage of source/drain junctions become significantly lower [55], [56].The gate leakage is also expected to decrease by ∼2 [57].Hence, the minimum operating frequency of the TSPC flip-flop becomes significantly lower at CT. Due to the transistors' higher mobility at CT [13], T cq and T inv are expected to reduce by ∼10%, thus increasing the maximum baud rate to ∼20 Gbaud.Besides, the TX speed can be further improved in future implementation by removing the inverter from the critical timing path and realizing complementary data streams earlier in the serializer chain.However, this comes with the cost of power consumption and complexity due to the extra serializing paths needed for the additional inverted signals.

D. 64:4 Multiplexer
For serializing at lower data rates, a conventional 2:1 MUX cell can be used.Each conventional 2:1 MUX cell has one selector and three latches, two of which block glitches from previous stages [58].A D:Q (e.g., 64:4) binary-tree CMOS MUX requires D-Q (e.g., 60) 2:1 MUX cells and 3× (D-Q) (e.g., 180) latches operating at frequencies ranging from f baud /D (e.g., 312.5 MHz) to f baud /2Q (e.g., 2.5 GHz), thus increasing the TX power consumption.As shown in Fig. 12, to reduce power, the input and output latches of the MUX tree are maintained, but the intermediate latches are removed.If all selectors in each MUX rank are clocked with the same clock phase, the delay of the selectors (T D ) and frequency dividers (T div ) will eventually limit the baud rate at the output of D:Q

MUX to
Considering T D ≈ 25 ps, T div ≈ 25 ps, and T su ≈ 22 ps, the maximum baud rate of 64:4 MUX cannot be higher than 4.5 Gbaud, thus limiting the system speed.Alternatively, the selectors can be clocked using the available quadrature phases of the clocks, as proposed in [59].As shown in Fig. 13, the lower rank selectors use the quadrature clocks generated by the divided clock of their corresponding higher rank selector.In this way, the required delay between each selector's inputs is guaranteed by the appropriate selection of available clock edges in successive stages.Therefore, this structure is scalable as long as the sum of the selector and divider delays does not exceed half the period of the highest rank clock.Since the intermediate-rank selectors mitigate the effect of the data and divider delays of their previous stages, the maximum baud rate of this structure is mainly determined by its output retimer.Hence where T div,φ→ is the divider delay between φ and clock domains, and DJ φ, accounts for the additional delay between these two clocks due to the routing.Considering T div,φ→ = 22 ps, T su = 22 ps, DJ φ, = 1.38 ps, and RJ = 51 fs rms , the maximum baud rate of 64:4 MUX is 21.7 Gbaud, thus not limiting the performance.

E. Clock Generation
As described in the architecture in Fig. 6, both the quarter-rate retimer and 64:4 multiplexer architecture make extensive use of quadrature clock phases.An external differential clock input is divided using a CML-based quadrature divide-by-2 circuit to generate the four overlapping 5-GHz clock phases (φ 0−3 ), as shown in Fig. 14(a) and (b).The subsequent divide-by-2 circuits for generating the lower frequency clock phases ( , δ, θ , and ϕ) utilize cascaded C 2 MOS latches shown in Fig. 14(c) and (d).To ensure the correct clock phase relation, the bottom two C 2 MOS latches in Fig. 14 are reset at the start; thereby, the subsequent divided clock Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.phases will come in the correct order.The clock distribution is illustrated in Fig. 15.The highest frequency quadrature clock (i.e., φ 0−3 ) is distributed directly to the quarter-rate retimers with a similar H-based clock tree as the output combiner to minimize delay mismatches.The lower frequency clocks (i.e., ϕ, θ, δ, and ) are routed to the center of 64:4 MUX and then distributed over the digital circuitry to keep the clock skew low.The synthesized SRAM is clocked with ϕ 0 , while the 64-UI retimers are clocked with a selectable ϕ i (∈ {ϕ 0 , . . ., ϕ 7 }) to compensate for the skew in the data path.The data distribution is routed perpendicular to the clock lines to minimize crosstalk.Similar to the DAC, dummy retimers and multiplexers rows are added on left and right, since it is crucial for the timing to keep device mismatches low.

IV. MEASUREMENT RESULTS
The wireline TX has been fabricated in a standard 40-nm bulk CMOS process.Fig. 16(a) shows the die micrograph.The total active area, excluding the SRAM, comprises 0.146 mm 2 .Due to the self-heating effect at CT [60], as a precaution, the digital 64:4 serializer, CML 4:1 MUX, and DAC are  separated from SRAM by at least 100 µm.In addition, since the heat dissipation through the silicon substrate with a small contact area (i.e., ∼1 mm 2 ) is very low [61], and there is no convection possible in the vacuum chamber, most heat needs to be dissipated by the metal connections.Therefore, it is of extra importance that the on-chip power lines are designed with thick upper metals to dissipate the heat.
The chip has been characterized both at 4.2 and 300 K to test the functionality at the required operating temperature range.The chip was measured inside a Lakeshore CPX cryogenic probe station.The measurement setup is shown in Fig. 17 The channel loss (S dd21 ) of the test setup, including the probe, a 30-cm cable inside the probe arm, and a 10-cm cable  outside the probe station, was measured with a vector network analyzer (VNA).First, the error terms up to the output ports of the VNA were measured using the Anritsu 3652K calibration kit.Second, the complete channel, including cables and probe, was calibrated using the short-open-load-reciprocal (SOLR) standard of a CSR-30 calibration substrate at both CT and RT.The total channel loss of the setup could then be extracted by taking the difference in the error terms of the VNA calibration and complete channel calibration.Fig. 18(a) shows that the measured channel loss is about ∼8 dB at 20 GHz for both temperatures.
The output reflection coefficient (S 11 ) of the chip was measured and de-embedded using the same CSR-30 calibration substrate.As can be gathered from Fig. 18(b), without any trimming, the chip is completely matched at low frequencies at 4.2 and 300 K, indicating that the sheet resistance of the unsilicided polysilicon termination resistors is fairly constant over temperature [10], [40].Moreover, the bandwidth in which S 11 remains below −10 dB is ∼10% larger at CT, mainly due to the reduction of the parasitic capacitance to ground as the silicon substrate becomes highly resistive [62].
The INL and DNL are extracted from a ramp signal generated with the SRAM, and the results are shown in Fig. 19.The largest jumps in DNL happen in every eighth code and are attributed to the 3-bit unary-to-binary transitions.The peak-topeak INL is 0.8 LSB at RT and increases to 1.2 LSB at CT.This is expected as the device matching degrades, and the DAC shows more third-order nonlinearity, since the output impedance of its current sources reduces, as discussed in detail in Section III-A.To evaluate the DAC's performance without being limited by the clocks' deterministic and random jitter, a relatively low-frequency (i.e., ∼20 MHz) sine wave was loaded into the on-chip SRAM, and the output of the 20-GS/s  jitter of 1.12 ps (3.17 ps) at CT and 1.14 ps (3.22 ps) at RT [59], [63].This amount of deterministic jitter is somewhat higher than the simulated value of 1 ps but still low enough to satisfy the J4u requirement for our targeted baud rate.To alleviate this issue in a future version, a duty-cycle error correction circuit similar to [64] can be implemented.
The eye diagrams shown in Fig. 21 were measured using a Keysight N1094B sampling scope without using additional equalization or de-embedding.The SRAM is programed with a 2 15 -length pseudorandom binary sequence (PRBS)-15 for NRZ and a quarternary QPRBS-15 for PAM4.The maximum speeds with sufficient eye opening at a BER of <10 −15 are explored for different modulation schemes and operating temperatures.At RT, a maximum speed of 20-Gb/s NRZ and 36-Gb/s PAM4 is achieved.For the highest baud rates at RT, the measured eye heights (widths) of NRZ and PAM4 are 231 mV (0.65 UI) and >24.7 mV (0.28 UI), respectively, with 96.5% RLM.At CT, due to the mobility improvement, the baud rate could be increased, and therefore, 25-Gb/s NRZ and 20-and 40-Gb/s PAM4 are measured.At CT, the measured eye heights (widths) of NRZ and PAM4 are 216 mV (0.73 UI) and >38.5 mV (0.47 UI), with 97.8% RLM.Consequently, at the maximum data rate, the TX achieves 2.46-pJ/b (2.47-pJ/b) energy efficiency at CT (RT), while consuming 98.6 mW (88.8 mW) from a 1.1-V power supply.The power breakdown charts, excluding the SPI controller and SRAM, are shown in Fig. 16(b).The dynamic power remains almost constant at RT and CT, since the power consumptions of the digital part of the serializer and DAC are mainly determined by C V 2 DD f and the required voltage swing, respectively.The static power consumption of the digital circuits was measured when the chip was idle.Due to the larger threshold voltage, higher subthreshold slope, and lower gate leakage, the static power consumption of the multiplexers and SRAM reduces from 1.55 mW at RT to 0.17 mW at CT. Table I summarizes the performance of the proposed TX and compares it with relevant prior DAC-based and multi-tap PAM4 wireline TXs.This achieves a similar data rate and energy efficiency as prior art, while it stands out by maintaining its linearity and RLM down to CT.The drawbacks associated with cryo-CMOS devices are mitigated, while their speed enhancement is harnessed to achieve full performance.Moreover, by demonstrating, for the first time, both full functionality and high efficiency over the wide temperature range, this work addresses the required high-speed wireline link for quantum computing applications.

V. CONCLUSION
This article presented the first PAM4 cryo-CMOS wireline TX for quantum computing applications.Based on QEC requirements and the channel loss, the specifications for the data link between the control electronics at CT, and a classical processor at RT have been quantified.Guidelines were also developed to efficiently design different building blocks of the proposed DAC-based TX at CT.As summarized in Table II, by circumventing the drawbacks of cryo-CMOS devices (i.e., higher threshold, larger mismatch) and exploiting their higher speed, the TX maintains high power efficiency, linearity, and data rate down to CT.At CT (RT), the prototype achieves 40-Gb/s (36-Gb/s) PAM4 transmission with 2.46-pJ/b (2.47-pJ/b) efficiency and 97.8% (96.5%)RLM.Therefore, this work satisfies the requirements of a high-speed data link between classical and quantum processors, paving the way toward realizing large-scale quantum computers.

Fig. 2 .
Fig. 2. (a) System diagram of quantum processor with repeated SC.The X /Z ancilla qubit measurement data are transferred at D uplink to a real-time decoder that updates the PFU.The PFU stores the estimated errors of the physical data qubits and transmits the correction instructions at D downlink .(b) Timing diagram illustrating the SC cycle, including single-/two-qubit gate time and measurement time.(c) Timing diagram of the data round trip, including uplink, decoder, and downlink.To prevent backlog, the total data round-trip time should be lower than the SC cycle time.

Fig. 3 .
Fig. 3.Estimated loss versus frequency (a) extracted simulation of signal attenuation due to limited chip bandwidth caused by the parasitic capacitance of the ESD and pads (∼200 fF) and (b) measured insertion loss of a typical coaxial cable connecting the fridge 4-K stage to its output connector at RT. (c) Total expected loss.

Fig. 7 .
Fig. 7. (a) Schematic of the bias circuit and DAC LSB cell.(b) Layout of the DAC LSB cell.

Fig. 8 .
Fig. 8. (a) RT simulation and CT measurement of g m /I D curves for a 2-µm/240-nm transistor.(b) Simulated output impedance of the DAC LSB cell versus frequency.

Fig. 9 .
Fig. 9. (a) DAC structure, representing the binary bits B 0−2 and unary bits B 3−5 shown in the gray-coded unit cells; the light blue cells symbolize dummies; the red and blue lines represent the output H-tree.(b) DAC layout, including the 4:1 MUX (left) and output matching network (right).

Fig. 10 .
Fig. 10.(a) Schematic of the 4:1 CML MUX.The pull-up resistors (R P ) are inversely proportional to the size of their corresponding DAC slice with an input parasitic capacitance of N × C DAC .(b) Simulated input and output waveforms of the 4:1 CML MUX for a "1011" sequence.

Fig. 11 .
Fig. 11.Quarter-rate retimer, complementary data generator, and 25% pulse generator.(a) Schematic illustrating the worst case critical path in red.(b) Timing diagram showing the total delay.

Fig. 12 .
Fig. 12. Simplified schematic of a binary-tree CMOS MUX using a single clock phase in each rank-(a) circuit and (b) timing diagram.

Fig. 13 .
Fig. 13.Simplified schematic of a binary-tree CMOS MUX using quadrature clock phases in each rank-(a) circuit and (b) timing diagram.

Fig. 15 .
Fig. 15.Illustration of the clock distribution layout.The data distribution is done vertically, indicated in gray; the clock lines are distributed horizontally indicated with the different colors.

Fig. 17 .
Fig. 17.(a) Complete overview of measurement setup.(b) Test board inside the probe station.
. The die is glued onto a sample PCB, which is clamped on top of the 4.2-K sample stage of the probe station.No solder mask is used on the bottom of the PCB, and extra vias are added to the ground plane to thermally anchor the chip to the cold plate.A 10-GHz single-ended external clock with 18.5-fs rms jitter is generated using the R&S SMA-100B and connected to the chip with an on-PCB balun.Then, the differential clock signals are terminated by two series on-chip 50-resistors and amplified by an ac-coupled buffer to drive the frequency dividers.The center tap of the termination resistors is connected to ground to provide a return path for common-mode signals.The output pads are probed using a Cascade 100-µm pitch GSGSG Z-probe rated for CTs.The SRAM test sequences are loaded with a serial peripheral interface (SPI) module, connected through the dc wire connections of the probe station.

Fig. 18 .
Fig. 18.(a) Measured loss of the differential probe and the cable connecting the chip to the measurement instrument at RT.(b) Measured reflection coefficient (S 11 ) at the output of the chip at 4.2 and 300 K.

Fig. 20 .
Fig. 20.Measured output of the TX delivering a 10-GHz periodic 1010 NRZ sequence (a) differential output waveform and (b) its spectrum at 300 K, and (c) single-ended waveform and (d) its spectrum at 4.2 K.Note that the spectrum was measured with relative power.wirelineTX was captured by a sampling scope.At RT and CT, a signal to noise and distortion ratio (SINAD) of 35.8 and 36.2 dB is achieved, thus satisfying the linearity requirement and showing the reduction of the DAC's output resistance at CT.The clock's deterministic and random jitter limits the DAC performance at high frequencies.To evaluate this, a 20-Gb/s 1010 NRZ sequence (i.e., a 10-GHz period signal) is generated, and the resulting waveform and spectrum are measured at the TX output at 4.2 and 300 K, as shown in Fig. 20.Note that the output of the CML 4:1 MUX is not retimed while traveling to the TX output.Consequently, any systematic duty-cycle mismatch among the four non-overlapping clocks results in a deterministic jitter at the output.Due to the duty-cycle mismatches in 4:1 MUX, spurious tones with a level of approximately −26 dBc at ±5 GHz from 10-GHz carrier are observed.This results in a deterministic rms (peak-to-peak)

TABLE I COMPARISON
TABLE WITH PRIOR DAC-BASED AND MULTI-TAP PAM4 WIRELINE TXS TABLE II SUMMARY OF THE CRYO-CMOS CHALLENGES, DIVIDED INTO DIFFERENT BEHAVIORS, THEIR CONSEQUENCE, AND THE IMPLEMENTED TECHNIQUE