A Benchmark of Cryo-CMOS Embedded SRAM/DRAMs in 40-nm CMOS

The interface electronics needed for quantum processors require cryogenic CMOS (cryo-CMOS) embedded digital memories covering a wide range of specifications. To identify the optimum architecture for each specific application, this article presents a benchmark from room temperature (RT) down to 4.2 K of custom SRAMs/DRAMs in the same 40-nm CMOS process. To deal with the significant variations in device parameters at cryogenic temperatures, such as the increased threshold voltage, lower subthreshold leakage, and increased variability, the feasibility of different memories at cryogenic temperature is assessed and specific guidelines for cryogenic memory design are drafted. Unlike at RT, the 2T low-threshold-voltage (LVT) DRAM at 4.2 K is up to <inline-formula> <tex-math notation="LaTeX">$2\times $ </tex-math></inline-formula> more power efficient than both SRAMs for any access rate above 75 kHz since the lower leakage increases the retention time by <inline-formula> <tex-math notation="LaTeX">$40\,000\times $ </tex-math></inline-formula>, thus sharply cutting on the refresh power and showing the potential of cryo-CMOS DRAMs in cryogenic applications.


I. INTRODUCTION
Q UANTUM computers (QCs) can deliver an exponential speedup for several computational problems [1], [2], [3], [4], [5], [6].However, scaling up the number of quantum bits (qubits) to the thousands or millions necessary for useful computations requires an impractical amount of wires connecting the cryogenic qubits to the roomtemperature (RT) control electronics.To overcome such an interconnect bottleneck, electronics integrated in commercial CMOS technology but operating at cryogenic temperature, i.e., cryogenic CMOS (cryo-CMOS), has been proposed [7], [8].As the power consumption of the cryo-CMOS control electronics must be kept below the cooling power of the cryogenic refrigerators adopted in QC applications, designing power-efficient cryo-CMOS circuits is crucial.
The control electronics consist of analog/RF circuits directly interfacing with the qubits to perform operations and measurements, in combination with the digital system-on-chip (SoC) for scheduling the quantum-algorithm execution [9] and processing a large amount of measurement results, e.g., as required for quantum error correction [10], [11], [12], [13], [14].In modern digital systems, significant fractions of the area and power are consumed by the memory, thus making the optimization of cryo-CMOS embedded memories essential.However, accurately estimating the power consumption of a memory at cryogenic temperatures is challenging due to the lack of reliable cryogenic device models.
Furthermore, the cryo-CMOS controllers will require memories for several distinct functions covering a wide range of access rates (read and write operations per second) and write/read (W /R) ratios, ranging from high-speed lookup tables for generating the waveforms for qubit control (multi-GHz, W /R = 0) [15], [16], [17] to low-speed buffer queues for the quantum-algorithm instructions (sub-MHz, W /R = 1) [9].Static memories (SRAMs) are well-suited for high access-rate applications but they suffer from excessive operation energy and limited density.The density issue can be alleviated by dynamic memories (DRAMs), which store data as the charge on a (parasitic) capacitor and require fewer transistors per cell.Unfortunately, frequent refreshes are required to counteract charge leakage, resulting in a large power consumption independent of the access rate.While the charge leakage is strongly mitigated by the significant decrease in subthreshold leakage at cryogenic temperatures [18], [19], it is unclear whether a cryo-CMOS DRAM can outperform a cryo-CMOS SRAM, due to both the shortcomings of existing device models and the absence of comprehensive studies in the literature.
To overcome this issue, this work compares eight different dynamic and static memory cell designs, embedded in identical memory architectures in a nanometer CMOS process (TSMC 40-nm) typically adopted for QC cryo-CMOS interfaces, by comparing the experimental characterization at both RT and 4.2 K. Due to the limited cooling power available in dilution refrigerators, the main focus is on minimizing the memory power consumption.Since the power consumption of the dynamic memories is limited by their refresh power for medium-to-high frequency applications, a detailed characterization of the data-retention time is required for these cells.This article, an extension of our work in [55], is structured as follows.Section II offers a brief overview of the cryogenic effects in CMOS devices.Section III describes the circuit designs of the adopted memories, for which the experimental characterization is presented in Section IV.Section V discusses the results and Section VI concludes this article.The data shown in this article are also available here [56].
For full-swing digital circuits, the mobility increase compensates the effects of the larger V th and, together with the reduced resistance and capacitance, results in a speed-up for digital circuits from 10% to 20% for 40-nm bulk CMOS [65], [66], [67].For more advanced technology nodes, the speed-up from RT to 4.2 K is reduced due to the increased relative importance of interconnect capacitance and lower supplies, enhancing the relative V th increase [65].However, the speedup could be recovered for FinFET technologies by scaling V th [40].The increased V th and the steeper subthreshold slope lead to severely reduced subthreshold leakage, while gate leakage stays approximately constant (<2× smaller) [68].For these digital circuits, this will result in greatly reduced leakage power, while keeping the dynamic power consumption similar.

III. CIRCUIT DESIGN
The memory cells in this work have been mainly optimized for maximum density, and, where possible, for optimum (expected) performance at cryogenic temperature.All memory cells are implemented in two versions, using either standard-threshold-voltage (SVT) or low-thresholdvoltage (LVT) devices.LVT cells are expected to perform worse at RT since their higher subthreshold leakage reduces the retention time of dynamic memories and increases the static power consumption.At cryogenic temperatures, however, the V th increase may cause SVT designs to fail due to the insufficient overdrive voltage limiting the readout currents.Although forward-biasing the bulk-source voltage [66] could help circumvent the cryogenic V th increase, no individual bulk contacts have been employed to avoid an excessive increase in the design effort and the area of the memory cells.The memory peripherals always use LVT devices, unless otherwise noted, to ensure functionality at cryogenic temperatures and minimize their effect on memory performance, while the synthesized digital circuits, e.g., the controllers, adopt SVT devices with extra hold margin to anticipate the cryogenic logic speed-up.

A. 6T Static Cell
As the most commonly used embedded-memory cell, the conventional six-transistor static cell [6T, Fig. 1(a)] represents a good reference for comparison with alternative designs.It consists of a latch formed by two inverters (M 3-6 ) and two access transistors (M 1,2 ) that connect the latch nodes to the differential bitlines (BLs) (BL and BL).The latch state is written by differentially driving the BLs and pulling the wordline (WL) high.To read the state, both BLs are first precharged to V DD before enabling the WL.Then, the BL connected to the low side of the latch will be discharged by one of the pull-down transistors (M 5,6 ).
To minimize the cell area, most transistors have minimum size (W/L = 120 nm/40 nm).Since the cell design is ratioed, the pull-down transistors (M 5,6 ) are sized 1.5× larger (W/L = 180 nm/40 nm) to ensure writing and reading under device mismatch.For a fair comparison with the other cells, the static cell is manually implemented using the logic design rule check (DRC) rule set and occupies 0.435 µm 2 using a lithographically symmetrical layout [69].This is 80% larger than the foundry-offered cells (0.242 µm 2 [70]) that violate several logic DRC rules.

B. 2T NW-PR Dynamic Cell
A higher density can be reached by dynamic memory cells, as they require fewer transistors.Since the popular one-transistor-one-capacitor (1T1C) dynamic cell [71] is only advantageous with a high-density-capacitor technology option [72], gain-cell dynamic memories are preferred here to achieve low area in standard CMOS.In the simplest gain cell with two transistors [2T, Fig. 1(b)] [46], the data, stored as charge on the storage node (SN), are written from the write bitline (WBL) through a write pass-transistor (M 1 ) when the write wordline (WWL) is enabled.For reading, the read bitline (RBL) is precharged to ground and charged by the readout current of M 2 when the read wordline (RWL) is enabled, depending on the voltage of the SN.The output data are obtained by comparing the RBL voltage to a reference.
We could use common device types to implement both transistors, allowing for a high cell density due to the lack of N-well transitions.However, different device types are preferred for the following reasons.To keep the design simple and reliable, all voltages are kept within the supply rails.This means that WL boosting, i.e., pulling the WWL beyond the supply rails to counter the V th drop across M 1 , cannot be used.The resulting V th drop will limit the voltage range on SN, reduce M 2 's overdrive, and, therefore, limit the readout current.This will be worse at cryogenic temperatures due to the V th increase.The charge on SN leaks away through M 1 's subthreshold leakage and M 2 's gate leakage.Since the gate leakage is expected to dominate at cryogenic temperatures and the PMOS gate leakage is smaller in the target technology (according to the RT model), an NMOS is used for writing (NW) and a PMOS for reading (PR).
Although a wider M 1 would speed up the writing, its width is minimized to reduce the area and the subthreshold leakage since the minimum-size write speed is still very high (10-100 ps).For M 2 , a larger width asks for more area but also increases the SN capacitance and the readout current, and therefore the retention time.At −40 • C, i.e., the lowest valid temperature for the standard models, W = 300 nm results in a good tradeoff between area and retention time by minimizing the area-refresh-power product for a fixed read duration of 1 ns and a fixed margin (>300 mV) between the RBL voltage levels for the different stored bits.The resulting cell area is 0.184 µm 2 (58% smaller than the custom 6T cell, and 24% smaller than the foundry 6T cell).
Unfortunately, the retention time and readout speed of the 2T cell are limited due to capacitive coupling between the RWL and the SN.Due to the M 2 gate-source coupling, the SN voltage is pulled up at the start of a read operation.While this ensures M 2 to be off for high SN voltages, it limits M 2 's overdrive for low SN voltages, thus limiting the readout current and increasing the read time.Since such a gate-source coupling is stronger when M 2 is in inversion, the increase in SN voltage will be larger for lower SN voltages.As a result, the SN voltage levels for the two states move closer during readout, making them harder to distinguish.Additionally, the RBL voltage is limited by the other cells connected to the same RBL and with low SN voltages, as they will also conduct when the RBL voltage approaches V th of the readout transistors.Although this effect is mitigated by the cryogenic V th increase, the RBL voltage swing is usually kept well below this limit by limiting the duration of the RWL pulse, so as to minimize the read energy and stay within the functional range of the sense amplifiers.

C. 3T NW-PR Dynamic Cell
A three-transistor cell [73] [3T, Fig. 1(c)] circumvents the readout limitations of the 2T cell.The source of the readout transistor (M 2 ) is connected to a fixed voltage (V DD ) and a read pass-transistor (M 3 ) is added to select the row.This results in a faster readout due to larger M 2 overdrive, no shrinking of the SN voltage margin during readout, and no leakage through the readout transistors of other cells when RBL gets charged higher.
The 3T-cell sizing follows the principles adopted for the 2T cell for M 1,2 , resulting in the same sizes for these transistors.Within the layout, with the sizes of M 1,2 now fixed, M 3 is sized as wide as possible (W = 190 nm) to minimize its ON-resistance without significantly increasing the area, which is 0.242 µm 2 (only 32% larger than the 2T cell and equal to the foundry 6T cell).
The largest SN voltage that can be written is V DD − V th,n , while M 2 's gate voltage must be larger than V DD − |V th,p | to turn M 2 off.To ensure M 2 to be off, M 2,3 are implemented as transistors with higher threshold voltages (SVT for the LVT cell version, and high threshold voltage (HVT) for the SVT cell version).Since the V th increase is larger for PMOS than for NMOS at 4.2 K, the margin |V th,p | − V th,n will be larger, making it easier to turn M 2 off.However, this will also lead to a reduced overdrive and slightly lower readout currents.

D. 3T PW-PR Dynamic Cell
Instead of avoiding the SN-RWL coupling, it can be exploited to increase the SN voltage margin during readout by using preferential boosting [74].In such a cell [Fig.1(d)], the RWL is connected to both the gate of the readout pass-transistor (M 3 ) and the drain of the readout transistor (M 2 ).Since the RWL is now pulled down, the RBL will be discharged from V DD through the PMOS stack.As the RWL pull-down coupling to the SN is larger for low SN voltages, the SN voltage margin between the two logic levels now improves due to the coupling.Since the SN voltage is now pulled down by the preferential boosting, the write transistor (M 1 ) has to be a PMOS (PW) to ensure that a high enough SN voltage can be written to turn off M 2 .Consequently, the SN voltage cannot be set lower than the |V th | of M 1 .The overdrive of M 2 is then significantly reduced due to the cryogenic V th increase for both M 1 and M 2 , pointing to a high chance of failure that Fig. 2. Row decoder schematic, including transistor (W/L) in nm and the 2-bit predecoder truth table.The optional inverter in gray is only included for the row decoders where the active WL is low, i.e., overlined WWL and RWLs in Fig. 1. should be experimentally studied to assess the feasibility of the cell design at 4.2 K.
With the same sizing as for the 3T NW-PR cell, the area is 0.254 µm 2 , 38% larger than the 2T cell and slightly larger than the 3T NW-PR cell since the RWL connection of M 2 cannot be shared with neighboring cells as effectively as for the 3T NW-PR.

E. Memory Peripherals
To focus on the differences in performance due to different cell designs, the simplest memory architecture is adopted with a single bank with 1024 cells (32 rows and 32 columns) without peripheral sharing.The peripherals are nearly identical among different memories, with only small adaptations for different cell pitch and signal polarities, to minimize their effect on performance.
1) Row Decoders: Row decoders decode the 5-bit address (0-31) into a one-hot signal on one of the 32 WLs.The dynamic memories have two decoders, one for the RWLs and one for the WWLs.For low-latency and regular-layout design, the dynamic decoder in Fig. 2 is adopted with two 2-bit predecoders for the address' four most-significant bits (MSBs).The lower two NMOS transistors in the pull-down stack (left gray block) are shared between neighboring addresses, as they only have a different LSB.A large output inverter is used to minimize the WL rise/fall time and, only for 2T NW-PR and 3T PW-PR, to supply the readout current without excessive voltage drop.
2) Sense Amplifiers: The voltage-latched sense amplifiers (VLSAs) shown in Fig. 3 [75] determine whether the RBL voltage at the end of the readout phase is above or below an external reference voltage.When M 1,2 sample the reference voltage and the RBL voltage, the power-gated latch formed by two inverters is turned off.The latch is then disconnected from the inputs and turned on to amplify the input difference.The VLSA is sized to fit within the pitch of a single memory cell, and for offset and noise not to limit the cell performance.Two variations of the VLSA are used.Due to M 3,4 starting to conduct during sampling, the headswitch NMOS access (HSNA) VLSA [Fig.3(a)] is used for the 2T and 3T NW-PR cells as it works well for inputs below V th,n , while the footswitch PMOS access (FSPA) VLSA is used for the 3T PW-PR and static cells as it works well for inputs above V DD − |V th,p | [75].In both designs, M 5,6 perform the comparison and dominate the input-referred offset.Thus, they are sized larger to lower the offset and laid out in a regular grid with surrounding dummies to improve the matching.Transistor M 7 is also wider to supply sufficient current to the latches and not to limit the decision speed.All other transistors (M 1-4 ) are minimum-sized and M 3,4 are implemented using HVT devices to increase the functional range of the SAs.In this case, the high V th is not a problem since M 3,4 must only ensure the (dis)charge to the supply rails.
The input-referred offset standard deviation of the HSNA-VLSA is expected to be around 12.6 mV based on RT Monte Carlo simulations.This is significantly less than the expected RBL voltage variation due to cell mismatch (in the order of σ = 50 mV).The input-referred rms noise is expected to be around 3.5 mV with a decision time of around 200-250 ps at RT.At 4.2 K, the mismatch is expected to increase roughly 10%-15% [62] while the rms noise is expected to decrease by at least 50% [76].
For the dynamic memories, the reference input of the SA is always connected to an external reference voltage pad.A minimum-sized NMOS/PMOS pass-transistor (the same type as the access transistor) is added to the BLs so they can be connected to a second external pad.This allows for the characterization of the offset and noise of the SAs by controlling both input voltages.The SAs are followed by transmission-gate-based latches implemented with minimumsized devices.During a read operation, these prevent glitching at the output and isolate the SAs to prevent interference.
3) BL Driver: In the BL driver for the dynamic memories [Fig.4 The BL driver for the static memories [Fig.4(b)] implements a different functionality: when idle (W and R low), the BL is pulled up and precharged to V DD ; when reading (R high), the BL is left floating to be discharged by the cell being read; when writing (W high), D IN is written to the BL.For each differential BL pair, two of these drivers are used with an extra  inverter (minimum-size NMOS and double-width PMOS) to generate D IN for the BL driver.
4) Timing Control: Since the exact cell behavior at 4.2 K was unknown at design time, designing a fixed timing circuit was not possible.To allow also detailed cell characterization or debugging, the timing of the control signals is derived asynchronously using programmable delay chains (Fig. 5).The lengths of two inverter chains running in opposite directions are set by transmission-gate-based multiplexers.The delay is determined by the first non-zero element in D[1:n].A 3.6-fF metal-oxide-metal (MOM) capacitor C half-step can be added to the final stage to increase its delay by approximately 50%, resulting in a delay resolution of about 20 ps.
For reading, a 192-step delay chain with a maximum delay of 3.84 ns determines the total duration of the SA's sampling phase.A 16-step (320-ps) delay chain determines how much of the sampling time is spent on precharging the internal SA node on the RBL side to fully reset the SA.The write duration is derived from a single 64-step (1.28-ns) delay chain.
The control-signal generation circuits for both write and read operations are identical for all memories, such that the same settings on different memories result in similar delays.Since the delay chains consume a constant but large amount of energy, their supplies are separated and not included in the reported power budget.For an actual memory, such fine programmability is not needed, allowing for a lowpower design.To estimate the delay, especially at cryogenic temperature, a 256-step delay chain is configured as a ring Fig. 6.Latency-measurement setup.If the memory delay is larger than T 256 , the outputs will be delayed by a clock cycle.
oscillator (RO) by selectively shorting the output to the input through a NAND gate.The frequency of its buffered output can be measured through a pad for several delay settings to estimate the stage delay.

F. Testing Infrastructure
To measure the total read-access latency, the setup shown in Fig. 6 is used.A 256-step (5.12-ns) delay chain generates a programmable delay between the launch and capture registers.The total latency is estimated as the lowest delay setting for which the outputs of the synchronize register are correct.This will include the clock-to-Q and setup time of the launch and capture registers, respectively, which are not removed since they are small and these registers would be needed in any real application to synchronize to the clock to prevent race conditions.
A local controller is connected to each individual memory to execute read, write, and refresh operations.Additionally, it stores and decodes all memory settings, such as the delay chain settings and special test mode flags.A programmable, global controller is connected to local controllers through a shared bus.The global controller is a custom 32-bit, single-cycle microprocessor with 16 different instructions, 32 registers, and a 32-word instruction memory.Additional hardware compares memory read instruction results with the expected values and accumulates the error count for the various tests.The registers, instruction memory, and readerror accumulators are written and read through a shift register (SR).All (automatically synthesized) controllers are clocked at 100 MHz.To account for the cryogenic logic speed-up, 50-ps margin is added to the hold time in the synthesis flow.Due to the bus communication overhead, the maximum memory operation frequency is lower (six cycles per write and eight cycles per read).

G. Additional Test Structures
An often-used metric in static cell design is the staticnoise margin (SNM) [69], which indicates the read and write stability of the cell design.The SNM can be estimated by plotting the butterfly curves, which are created by overlaying the dc voltage transfers of both SRAM cell sides.The distance between the curves gives an indication of the noise amplitude needed to flip the state of the cell.A larger SNM, therefore, indicates a more stable cell.
To experimentally characterize the SNM, an array with 256 SVT and 256 LVT half-cells (organized in 32 rows by 16 columns) is included, which matches the actual 6T-cell array layout as accurately as possible up to 1 µm around Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the half-cells (Fig. 7).All V in are shorted and driven by a pad.The BLs are driven by tri-state buffers, allowing them to be floating (hold SNM curve), pulled up (read SNM curve), or pulled down (write SNM curve), as shown in Section IV-C.The WLs select one cell from each column to be connected to each V out,column through a transmission gate (all W/L = 120 nm/40 nm).A column select (CS) signal selects one V out,column to be connected to V out,array through a transmission gate (all W/L = 300 nm/40 nm).A thick-oxide source follower buffers V out to a pad.To characterize the buffer's dc shift, V in can also be connected directly to V out,array using the SF signal.

A. Measurement Setup
Fabricated in TSMC 40-nm bulk CMOS process (Fig. 8), the test chip has been bonded to a dual in-line (DIL) package and mounted on a printed circuit board (PCB) at the end of a dipstick for testing at RT and 4.2 K by submerging into liquid helium.
The SA reference voltages and SNM input voltage are set by a programmable R&S HMC8043 dc power supply, while the supply for the digital controllers (1.1 V) and pad ring supply (2.5 V) are set by manually tuned low-noise adjustable RT low-dropout (LDO) regulators which operate far from their rated limits to ensure a stable output voltage.The memory macro supplies (1.1 V for all reported measurements) are divided across three pins: one for the dynamic memories, one for the SVT static memory, and one for the LVT static memory.These pins are connected to relays on the PCB to select between the 1.1-V LDO supply or a Keithley 2636B source measure unit (SMU) channel for current measurements.A second SMU channel is used to drive a Lakeshore DT-670 cryogenic temperature sensor, located slightly above the test chip to monitor the approximate environmental temperature.The test-chip digital interface is connected to an RT fieldprogrammable gate array (FPGA), through an optocoupler board for noise isolation, for reprogramming the global test controller and manually sending messages on the shared controller bus.
The average delay of a single delay-chain setting step is determined by measuring the RO frequency with an oscilloscope and fitting the resulting oscillation period for various settings to a linear equation, as shown in Fig. 9.The resulting step delay, later used for latency estimation, shows a cryogenic speed-up of 11%.

B. Dynamic Memories
The retention time of all dynamic memory cells shown in Fig. 10 is measured by writing data to the cell, waiting for varying hold time with the opposite voltage on the WBLs for worst-case leakage, and reading back the data.The retention time is defined as the maximum hold time for which the read data match the written data for both data polarities.A data mismatch for the shortest possible hold time (80 ns) is considered a failure of the cell.The longest measurable retention time is limited to 20 ms to contain the total characterization time (100 ms for the 2T LVT cell at 4.2 K).The SA V ref is optimized for each memory type to give the best retention time performance.The read duration is chosen to be as short as possible without increasing the fail rate.A log-normal cumulative distribution function (cdf) is least-squares fit to the cumulative histogram of the nonfailing cells for which a retention time can be determined within the measurement limit.As shown in the following, the good fit is compatible with an exponential distribution of the leakage currents at both temperatures, which is expected for both subthreshold leakage and gate leakage.large leakage currents, the SVT failures at 4.2 K are related to insufficient SN margins.
For both the SVT and LVT designs, several cells exceed the retention time limit, resulting in an unreliable log-normal fit for the SVT memory (Fig. 10(a-4), fit to only 139 out of 1013 functional cells).For the LVT cells, the fit is more reliable and shows an increase in both the average retention time (4 × 10 4 ×) and spread of the retention time.Note that the increased spread cannot be directly attributed to the cryogenic increase in transistor mismatch, as the retention time is limited by different physical effects at the two temperatures, as explained in the following.Both cell designs show no significant correlation between the retention times at RT and 4.2 K [Fig.10(a-5) and (b-5)].
The 3T NW-PR designs show a smaller improvement in retention time from RT to 4.2 K. Their RT retention time is similar to the 2T NW-PR cells, but their 4.2 K retention time is much lower and with lower spread.The lower spread may be due to the lower relative impact of the mismatch on a larger leakage.The LVT implementation shows a significant (weak) negative correlation of the retention time [Fig.10(d-5)], which can be explained by the fact that, while a low V th of the write transistor causes a large RT leakage, it also provides better SN voltage margins that improve the retention time at 4.2 K.
The retention time of the 3T PW-PR designs is superior to the other cell flavors at RT, thanks to the preferential boosting technique.However, 889 SVT cells [Fig.10(e-4)] and 253 LVT cells [Fig.10(f-4)] always fail at 4.2 K (dashed lines).This is attributed to the V th increase of all transistors, which limits both the SN voltages that can be written and the readout current.This is explained in more detail in the following.The LVT implementation also shows a significant weak negative retention-time correlation [Fig.10(f-5)], similar to what happens for the 3T NW-PR cells and attributed to the same effects.
Overall, at RT, the LVT cells show higher error rates and shorter retention time than the SVT cells due to their larger subthreshold leakage.At 4.2 K, however, the LVT cells show fewer failures than the SVT cells due to the compensation for the cryogenic V th increase and better SN voltage margins.The differences in retention times between LVT and SVT for functional cells are also smaller, indicating that their leakage is much more similar.
The transition between subthreshold leakage and gate leakage can be observed by continuously sweeping the ambient temperatures.This can be accomplished by slowly raising/lowering the chip's vertical position in the helium vapors above the liquid helium surface.Two regions clearly appear, as shown for the 2T NW-PR LVT cell in Fig. 11.For high temperatures (>160 K), the subthreshold leakage dominates while for low temperatures (<160 K), the gate leakage dominates.The temperature dependence of each leakage process can be determined by fitting to a sum of two Arrhenius equations where k b is Boltzmann's constant, T is the absolute temperature, A high and A low are the proportionality constants, and E a,high and E a,low are the activation energies, as is usually done in the literature.The high-temperature activation energy E a,high = 0.328 eV matches the expected value for subthreshold leakage E a,subth = ln(10)k b (V th (0)/s 0 ) with V th (0) the extrapolated V th at 0 K and s 0 the linearized subthreshold slope temperature dependence [30].The lowtemperature activation energy E a,low indicates very little temperature dependence, which is in-line with the expected very small temperature dependence of the gate leakage.However, since gate leakage does not actually follow an Arrhenius equation, the fit fails below 50 K.This shows that simply fitting an Arrhenius equation for temperatures above 50 K is not sufficient to predict the cell's retention time at 4.2 K.
The retention time over temperature for the 3T cell designs is shown in Fig. 12 for both data polarities separately.These also show the subthreshold leakage limitation at high temperatures, mainly for the high SN voltages.Since the readout transistor gate leakage pulls up the SN, the high-SN retention time becomes infinite when the subthreshold leakage becomes smaller than the gate leakage.
For temperatures below 200 K, the retention time becomes limited by the state with a low SN voltage due to the gate leakage.Especially for the 3T NW-PR cells, the readout transistor is then in strong inversion (|V gs | = V DD ), resulting in a much larger gate leakage than for the 2T NW-PR cells where the readout transistor is never in inversion during the hold time.Although the gate leakage is assumed to be roughly constant over temperature, the retention time decreases over temperature due to the V th shift of the readout transistors.This results in smaller readout currents and readout margins at lower temperatures and thus earlier cell failures.
For the 3T PW-PR cells, the V th shift has a double effect, also increasing the lowest voltage that can be written through the PMOS.This means that the readout transistor overdrive V gs − |V th | during readout decreases due to the decrease in V gs and the increase in |V th |, resulting in a reduced retention time for lower temperatures and even in failing cells.The limited write SN voltage issue could be overcome using WL-boosting, which is not used here since it would increase the design complexity and could impact the reliability of the devices.
Table I lists the input-referred offset and noise of the dynamic-memory SAs.These are determined by accumulating the average SA output over 1023 comparisons for a fixed V ref and variable RBL voltage by connecting the RBLs to a pad.A binary search is performed to find the RBL voltage for which the average SA output equals 0.5.The input-referred offset and rms noise are found by fitting a normal cdf to the average output for various differential input voltages, similar to the method used in [76].The measurement is performed for all 6 × 32 dynamic-memory SAs.Since the SAs in the SVT and LVT cell memories are identical, there are three unique designs with 64 samples each.
The 2T NW-PR and 3T NW-PR memories use identical SA designs, resulting in no significant difference in systematic offset (µ).The offset spread (σ ) and RT noise are significantly different and attributed to differences in the layout needed to fit different column pitches.Their 4.2-K noise is similar.
All designs show a significant mean offset due to the unequal loading of the two output nodes, which is positive for the HSNA-VLSAs and negative for the FSPA-VLSAs.Although the SA design difference for the 3T PW-PR memories results in different mean offset, the offset spread and noise performance for all SAs are similar.
The absolute systematic offset of all SA designs increases by 22%-25% when cooling down from RT to 4.2 K.This is attributed to a reduction in parasitic capacitance due to the reduction of the source/drain junction capacitance, which could increase the effects of, e.g., charge injection.The offset spread increases by 5%-9% (although with very limited statistical confidence) due to the increase in mismatch, while the rms noise decreases by ∼70% due to the reduced thermal noise.While the SA designs are different, the changes in inputreferred offset and noise of the NMOS and PMOS versions do not show significant differences.
The operation energy, leakage power, and full-memory latency of the dynamic memories are shown in the first six rows of Table II.These are given in ranges since various timing settings are possible, resulting in different tradeoffs.In general, shorter timing settings result in lower latency, lower operation energy due to reduced BL swing, and lower retention time.The lower retention time will, however, give a higher static power consumption, resulting in a tradeoff between static and dynamic power.
The leakage power (P leakage ) is determined by measuring the average power consumption without any memory operations.Note that the reported DRAM P leakage is the average leakage per memory bank, as all the dynamic memories share the same supply.For the SRAM, the leakage per individual bank (SVT or LVT) is measured and reported.The average write power is measured by writing random data to random addresses of the selected memory, generated using a linear feedback shift register pseudo random number generator.By subtracting P leakage from the average write power and dividing the result by the write operation frequency f write , the write energy per operation E write is obtained.Next, a combination of random writes and reads are performed to obtain the average combined write and read power from which P leakage and write power (E write × f write ) are subtracted and divided by the read operation frequency f read to obtain the read energy per operation E read .The full memory refresh energy E refresh is determined by dividing the average power when refreshing the entire memory by the refresh frequency f refresh .Note that E read (E write ) is the energy required to read (write) a single 32-bit word, while E refresh is the energy required to fresh a full memory bank (32 words).The latency is determined by reading alternating data polarities while reducing the latencymeasurement delay chain setting until the read is unsuccessful.
In general, there is a decrease in operation energy from RT to 4.2 K, which is mainly attributed to the decrease in source/drain junction capacitance.Furthermore, the leakage reduces to inappreciable levels and the latency decreases due to the improved digital speed.

C. Static Memories
The SNMs of the 6T cells are measured using the special test structures in Fig. 7 by sweeping the input voltage while looping over different cells.The cell's output voltage is determined from the source follower's output voltage after compensating for the source-follower transfer.Fig. 13 shows the measured SNM curves of the SVT and LVT 6T half-cells at RT (red) and 4.2 K (blue) overlaid on mirrored versions of the curves to show the SNM gaps.At RT, all 256 SVT and 256 LVT half-cells are measured.At 4.2 K, only eight SVT and eight LVT half-cells have been measured because much longer measurement times are required at 4.2 K due to the lower currents and larger transmission-gate impedance around the digital-level transitions [66].At 4.2 K, the hold curves [Fig.13(a-1) and (b-1)] show sharper corners due to the steeper subthreshold slope.Furthermore, the V out versus V in curves slightly shift to the right, thus moving toward the middle of the voltage range and marginally increasing the hold SNM.Such a shift is due to the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II
CELL AREA, OPERATION ENERGY, LEAKAGE POWER, AND LATENCY OF ALL MEMORY DESIGNS AT RT AND 4.2 K (V DD = 1.1 V) Fig. 13.SNM curves at RT (red) and 4.2 K (blue) of the SVT 6T cells with BL floating (hold curves, a-1), BL pulled up (read curves, a-2), and BL pulled down (write curves, a-3) and the LVT 6T cells with BL floating (hold curves, b-1), BL pulled up (read curves, b-2), and BL pulled down (write curves, b-3).cryogenic threshold increase of the NMOS dominating over the PMOS threshold increase due to NMOS being stronger, thanks to its larger mobility.
The left half of the read curves [Fig.13(a-2) and (b-2)] follows the hold curves, including the sharper corners and the shift to the right.In the right half of the plot, the inverter NMOS pulls down while the access NMOS pulls up.Since both transistors are equally affected by the temperature change, their effects partially cancel out.Thanks to the curve shift in the left half, the read SNM increases slightly at 4.2 K.
For the write curves [Fig.13(a-3) and (b-3)], the right half follows the hold curves.Since in the left part, the inverter PMOS pulls up and the access NMOS pulls down, the 4.2 K curves are pulled down more due to the NMOS cryogenic shift dominating over the PMOS.This also results in an increase of the write SNM.
The operation energy, leakage power, and latency of the static memory designs are shown in the two bottom rows of Table II.These metrics are determined using the same method as for the dynamic memories.For the static memories, a slight decrease by ∼13% and ∼5% in read and write energy is observed, respectively.Since voltage swings stay approximately constant, this is expected to be caused by the reduced node capacitance due to the reduction in source/drain junction capacitance.While there is a significant static leakage at RT, especially for the LVT cells, it becomes inappreciable at 4.2 K. Furthermore, the latency decreases by about 14% due to the increased readout current.

V. DISCUSSION
For nearly all memories, only a marginal improvement in operation energy and latency from RT to 4.2 K has been observed, in combination with very significant improvements in leakage power and in DRAM retention time, which lowers the refresh power.Since subthreshold leakage becomes negligible, LVT devices become a natural choice to improve performance with their lower V th , resulting in faster operation for the static cells and larger retention times for the dynamic cells, thanks to the larger SN margins.
Furthermore, some relevant guidelines for cryogenic design can be inferred.For the dynamic cells, RT techniques to improve the retention time may fail at 4.2 K, as shown for the 3T PW-PR cell.In that case, the V th shift of the readout and write transistors severely reduces the readout currents and the retention time.For the readout transistor, the overdrive for the current-generating state must be large enough to mitigate the V th shift.For the write transistor, the current-generating state must be written strongly.As a result, the readout and write transistors should be of a different type (PMOS/NMOS).Additionally, since gate leakage is the dominant leakage source at 4.2 K, it should be minimized by avoiding readout devices in strong inversion and selecting the device type with the lowest gate leakage.
For the static memories, a slight increase in SNMs is expected.Despite being small at RT, the read SNM for the LVT cells is apparently larger than that of SVT cells at 4.2 K, allowing the use of LVT cells with similar leakage power and lower latency than the SVT cells, although definitive conclusions cannot be drawn due to the limited sample size.Given the improved write SNM, a different sizing in favor of the read stability (larger pull-down transistors) may allow for even better cells under mismatch, at the cost of a slight increase in area.
Using the values reported in Table II, a quantitative comparison of the expected power consumption for various applications can be drafted.Each application is defined by its access rate (limited by the memory's latency) and the W /R ratio, ranging from 0 to 1 assuming that we do not write Expected full memory power consumption (V DD = 1.1 V, excluding control signal generation) over the full application space, showing: (a) power consumption for W/R = 0 and W/R = 1 (solid SVT and dashed LVT) and (b) memory with the lowest power consumption.The refresh rate is determined by the worst cell retention time measured over multiple runs.Using the setting configurations with the minimum power results in discontinuities for high access rates where some configurations are too slow for the required speed.data that is never read.For each memory, Fig. 14(a) shows a flat refresh/leakage-dominated region and an operation-powerdominated region, with little dependence on W /R. Since the write energy is lower than the read energy, there will be a minor power consumption decrease for all memories.At RT, all LVT memories consume much more power than the SVT versions.However, at 4.2 K, some LVT memories perform better (2T) or roughly equal (6T).The gap decreases for the 3T memories.Only the 2T NW-PR (LVT) and 3T NW-PR designs improve from RT to 4.2 K since their worst cell retention time improves, resulting in lower refresh rates.In Fig. 14(b) at RT, the 6T SVT memory is most efficient below 25 MHz, thanks to the leakage power being lower than the DRAMs' refresh power.Above 25 MHz, the 3T PW-PR SVT memory consumes the lowest power, thanks to the low operation energy and the highest retention time.At 4.2 K, the LVT 2T NW-PR outperforms the SRAM already beyond 75 kHz, also being the smallest and even 24% smaller than the foundry SRAM cell, since its retention time at 4.2 K is much longer than the SVT 3T PW-PR at RT, resulting in a significantly lower refresh power.Finally, although the higher latency of the 2T NW-PR and 3T PW-PR memories may limit their maximum access rate, multi-bank architectures could be adopted to reach a much higher throughput using slower banks, therefore not constituting a fundamental issue.
Based on the proposed guidelines, more advanced cell designs could be considered beyond this work (see [77]).Additionally, the presented tradeoffs do not capture the full range of considerations for memory selection.For instance, refresh operations will reduce the DRAM availability and noise-limited bit-error rates must be acceptable for the application.Reliability and security aspects may also be relevant, such as retention time limitations due to row-hammer attacks [24].

VI. CONCLUSION
By comparing single-bank static and dynamic memories at cryogenic temperature, this article shows that well-designed dynamic memories can outperform static memories for middle-to-high frequency applications in terms of area and power.While the subthreshold leakage reduces substantially from RT to 4.2 K, gate leakage stays approximately constant, thus still limiting the retention time.Still, adopting dynamic cells with enhanced resistance to gate leakage and cryogenic V th shifts can significantly increase retention time, thus lowering the refresh power.The increased variability in both cells and peripherals may increase the number of outlier cells, while the lower noise reduces the read error rate.Embracing the design guidelines outlined here for cryogenic embedded memories will facilitate the adoption of dynamic-memory cells for high-density low-power cryogenic memories, thereby enabling the complex cryo-CMOS SoCs needed in future QCs.

Fig. 1 .
Fig. 1.Schematics of the four cell designs: (a) 6T static cell; (b) 2T NW-PR dynamic gain cell; (c) 3T NW-PR dynamic gain cell; and (d) 3T PW-PR preferentially boosted dynamic gain cell.The readout current of the dynamic cells always flows from top to bottom.

Fig. 3 .
Fig. 3. Schematics of the two SA designs with sizing in nm: (a) HSNA-VLSA for low RBL voltages and (b) FSPA-VLSA for high RBL voltages.
(a)], a multiplexer selects the external data input (D IN ) or the data from the last read operation for a refresh (D REF ).

Fig. 4 .
Fig. 4. BL driver for: (a) dynamic memories and (b) static memories with transistor (W/L) sizes in nm.In (a), unannotated transistors are minimum-sized (W/L = 120/40) and inverter PMOS/NMOS sizes are shown above/below the inverters, respectively; the driving inverters in the gray boxes are alternatively used for the respective memory.In (b), unannotated NMOS transistors are minimum-sized (W/L = 120/40) while unannotated PMOS transistors are double-width, minimum-length (W/L = 240/40).

Fig. 5 .
Fig.5.Delay chain used to generate the timing for the memory control signals.The (W/L) of all transistors is (300 nm/40 nm) and (500 nm/40 nm) for NMOS and PMOS, respectively, except for the transistors in the inverters driving the transmission gates, which are minimum-sized (120 nm/40 nm).

Fig. 7 .
Fig. 7. Structure for static-cell SNM characterization, including a single half-cell, the cell selection hierarchy, and the output buffer with transistor (W/L) in nm.

Fig. 8 .
Fig. 8. Annotated micrograph of the chip (left) and a block diagram of the architecture for all memories (right).

Fig. 9 .
Fig. 9. RO loop delay as a function of the delay chain settings with a least-squares linear fit (V DD = 1.1 V).
At RT, the SVT implementation of the 2T NW-PR cell outperforms the LVT version due to the lower subthreshold leakage through the write transistor.While all SVT cells are functional, nine LVT cells always fail [Fig.10(b-2)].Both versions show a clear improvement in retention time from RT to 4.2 K, thanks to the reduced subthreshold leakage.At 4.2 K, both types show a similar retention time since they are both limited by the gate leakage of the readout transistor.At this

Fig. 10 .
Fig. 10.Retention time measurements for all dynamic cell designs (V DD = 1.1 V) shown in subplots (y-x).The first part of the label (y) indicates the cell design: a = 2T NW-PR SVT; b = 2T NW-PR LVT; c = 3T NW-PR SVT; d = 3T NW-PR LVT; e = 3T PW-PR SVT; f = 3T PW-PR LVT.The second part of the label (x) indicates the type of plot: 1 = RT retention time heatmap per cell; 2 = cumulative distribution of the RT retention time per cell with a log-normal cdf least-squares fit to the non-failing cells with its µ and σ (base-e) and the total number of failing cells (retention time less than 80 ns); 3 = same as 1, but at 4.2 K, where blue cells indicate a retention time longer than the limit imposed by the measurement setup; 4 = same as 2, but at 4.2 K, including the total number of cells with a retention time larger than the measurement setup limit; 5 = scatterplot where each point corresponds to the retention time of a single cell at both temperatures, indicating little to no significant (P < 0.05) log-log Pearson correlation (r ) between RT and 4.2-K retention times.

Fig. 11 .
Fig. 11.Retention time over temperature for a single 2T NW-PR LVT cell with a two-term Arrhenius fit, and retention-time distribution of all at RT and 4.2 K (V DD = 1.1 V).The RT and 4.2-K retention-time distributions use different optimized settings and reference voltages.The single-cell temperature sweep uses the 4.2-K settings.

Fig. 12 .
Fig. 12.Retention time over temperature for a single cell of each 3T cell design for the low-SN-voltage and high-SN-voltage states separately (V DD = 1.1 V).

Fig. 14 .
Fig. 14.Expected full memory power consumption (V DD = 1.1 V, excluding control signal generation) over the full application space, showing: (a) power consumption for W/R = 0 and W/R = 1 (solid SVT and dashed LVT) and (b) memory with the lowest power consumption.The refresh rate is determined by the worst cell retention time measured over multiple runs.Using the setting configurations with the minimum power results in discontinuities for high access rates where some configurations are too slow for the required speed.

TABLE I INPUT
-REFERRED OFFSET AND NOISE OF 64 SAS (