Time-Based Compute-in-Memory for Cryogenic Neural Network With Successive Approximation Register Time-to-Digital Converter

This article explores a compute-in-memory (CIM) paradigm’s new application for cryogenic neural network. Using the 28-nm cryogenic transistor model calibrated at 4 K, the time-based CIM macro comprised of the following: 1) area-efficient unit delay cell design for cryogenic operation and 2) area and power efficient, and a high-resolution achievable successive approximation register (SAR) time-to-digital converter (TDC) is proposed. The benchmark simulation first shows that the proposed macro has better latency than the current-based CIM counterpart. Next, the simulation further shows that it has better scalability for a larger size decoder design and process technology optimization.


I. INTRODUCTION
C RYOGENIC integrated circuits (ICs) are aggressively studied in academia and industry, especially in the recent few years, following the needs at the various fields, such as aerospace, high-performance cryogenic computing, or quantum computing. However, considering highperformance cryogenic computing or quantum computing is likely to still locate their other parts of the system at the room temperature, the interface macro connecting between cryogenic processors and room temperature systems will be problematic in the power and latency aspects. Especially, in the case of calculation of neural networks, which necessitates immensive data communication, such drawbacks may overshadow the application's advantages.
Compute-in-memory (CIM) can be the good alternative to deal with problems of neural network computation in cryogenics. By implementing CIM macro in the cryogenics and reducing communication between outer systems, the power and latency can be improved. However, the current-based CIM, which is the most widely used topology so far, will face several hurdles to be implemented in cryogenics due to its inherent limits. First, the current-based approach has a limited maximum number of multiply-and-accumulation (MAC) sets in a limited voltage headroom when converting the summed current to the voltage (e.g., via a series transistor that is connected to the column). In other words, the number of rows that could be turned on simultaneously is limited. Second, the large area of analog-to-digital converter (ADC) that locates at the CIM array periphery increases the computation latency if multiplexed.
The above problems imply that circuit-level optimization for cryogenic CIM is required. In this article, we suggest exploiting time-based CIM, so that the above problems can be significantly relieved. Specifically, the following hold: 1) a unit delay cell structure that facilitates large array computation in low-voltage headroom and 2) a successive approximation register time-to-digital converter (SAR-TDC) structure to reduce column multiplexing and improve latency, both of which facilitate error correction of the large-size qubit lattice to be done in a given time constraint, are proposed. Even though the structure of the simulated neural network in this work is based on quantum error correction (QEC) decoder [1] for the benchmark, the time-domain circuit design ideas in this article can be potentially applied to other neural network architecture implementation for cryogenic applications.
In this article, we benchmarked the performance of the proposed time-based CIM macro using the surface code QEC decoder as the benchmarking model [1]. Surface QEC decoder is suggested to cope with high error rates of quantum devices, i.e., qubits, and ultimately to realize large-scale quantum computing. The surface coding scheme encodes logical qubit information in the 2-D physical qubits lattice comprised of data qubits and parity qubits [ Fig. 1(a)]. By reading out parity qubits multiple rounds [ Fig. 1(b)], errors of the data qubits and even measurement errors are detected. As the surface code is a degenerate code, which means that different sets of errors can create same error syndrome, neural network-based decoding (NNbD) is gaining advocation as a decoder candidate due to its pattern recognition nature.
If the QEC decoder is still located at the room temperature, the interface macro connecting between cryogenic control/readout circuitry and room temperature QEC decoder will be required. Inevitably, it will consume a non-negligible latency due to serialization/deserialization of the data, and this problem will become more severe in the future large-scale quantum computing system where the larger data volume to be processed. Therefore, in addition to the compactness and cooling budget problems [2], QEC has substantial benefits to being located in 4 K. Recent research [3] discussed the possibilities of implementing NNbD in a form of current-based CIM and near qubit temperature. However, the forementioned limitations of current-based CIM approach will hinder to apply CIM on a larger surface code decoder. Note that even the benchmark simulation was done based on the QEC decoder application in this article, the proposed macro is targeting to be beneficial overall CIM macro that needs to be implemented in cryogenic temperatures.
The rest of this article is organized as follows. Section II describes issues in more detail when applying current-based CIM to the surface code decoder, and further, in the cryogenics and briefly introduces the proposed time-based CIM macro. Section III shows two main suggestions for implementing area-efficient, high-resolution CIM macro: unit delay cell design and SAR-TDC architecture. Section IV is the benchmark simulation results showing advantages of the proposed time-based CIM in latency and scalability compared with current-based CIM counterparts, and finally, Section V draws the conclusion.

A. ISSUES: ACCURACY AND TIMING
The scalability issue, which exists in large-array-size CIM, also arises in the QEC application when the surface code distance increases. Code distance d is the size of the lattice; the number of data qubits on one lateral side. This distance d determines the length of the shortest error chain that cannot be corrected. Thus, the fidelity of logical qubits can be controlled by trading-off qubit redundancy (see Table 1). In this case, the size of the neural network, or the size of the CIM array, inevitably increases. Reflecting the neural network architecture of [1], the number of activations at one column doubles when code distance increases from 3 to 7. Due to the quantization noise and analog non-idealities, the resolution issue will occur, because the MAC sets in one column now becomes doubled. The column multiplexing cannot be the solution because of the timing constraints described in the following paragraph.
Over the decoding process's goal of accurately identifying the type of error, it also needs to meet a time constraint coming from qubit decoherence time. If decoding fails to meet the targets, errors will accumulate, and the quantum operation will lose its fidelity. According to [3], the currentbased CIM's average cycle time for one-time measurement is around 1 µs, similar to qubit decoherence time. It is almost at the limit in d = 3; moreover, it will fail to meet timing constraints when the code distance increase. Besides the limited number of MAC sets that can be accumulated, the large ADC area is the hurdle in reducing latency. As the ADC area occupies four or more columns, the CIM macro must operate in a multiplexed manner. This will inevitably increase overall computation time. Such column multiplexing will be further required, as code distance increases. Overall, a scalable solution is needed for a larger code distance.

B. TIME-BASED CIM MACRO
The time-based CIM effectively counters these limits of the current-based CIM. It multiplies and accumulates sets of input activations and weights using controllable delay stages. The computing strategy of using the accumulation of delay has two main advantages. First, the maximum number of MAC sets in one column [data delay line (DDL)] does not depend on the voltage headroom. Instead of finely dividing limited voltage headroom as in current-based CIM, it only requires additional latency to wait for additional MAC sets' delay to be accumulated. Second, while the current-based CIM can show offsets in accumulated results between the VOLUME 8, NO. 2, DECEMBER 2022  same MAC values due to the parasitic resistance and capacitance of the column wire, the time-based CIM does not get affected by them. This is because the delay accumulation of each set is separated by inverter chains. This characteristic also allows good linearity regardless of the array size. Collectively, the time-based approach is advantageous in accurate computation of a larger CIM array. In addition, the SAR-TDC proposed in this article will effectively reduce area while achieving high resolution.
The proposed time-based CIM macro architecture of this article is illustrated in Fig. 2. According to analysis in [3], 4-bit quantization of the weight is sufficient to generate high fidelity of the QEC, even compared with the floating-point weight. The macro is composed of the following: 1) 256 DDLs (for storing 64 4-bit weights), which process 128-input binary MAC operation, respectively; 2) 5-bit SAR-TDC; and 3) one additional delay line for generating reference delay, which will be t 0 + 64 × t unit_delay . The MAC operation is done by delaying a rising edge, which is shot at the beginning of each delay line. Accumulated delay is finally converted into 5-bit digital values via SAR-TDC. At each column, additional unit cells are located between the delay line and the TDC to compensate for variation and mismatch coming from RC delay from the reference delay line (RDL) to TDCs.

III. HARDWARE IMPLEMENTATION OF THE TIME-BASED CIM
This section proposes two main ideas of time-based CIM implementation for cryogenic surface code decoder. We first choose the best topologies among the previous reported time-based MAC operation schemes and suggest new MAC unit cell to improve cell area. Next, we propose SAR-TDC, which as aforementioned achieves high resolution and increases throughput.

A. DELAY CELL
Time-based computing MAC unit cells can be categorized by how to generate delay. Largely, there are two categories: using current-starving inverter [ Fig. 3(a)] and using capacitance [ Fig. 3(b)]. The current-starving type changes the cell delay by controlling current source of the inverter. The capacitance type changes the cell delay by controlling output load capacitance of the inverter chain. So far, current-starving type tends to be more frequently used due to its better energy efficiency [4]. However, in cryogenics where the voltage headroom is limited due to the increasing V th and potential reduction of the supply voltage at future cryo-optimized MOSFET process for low-power operation [3], effect of the variation will increase and will cause detrimental effect in computation accuracy. Fig. 3(c) is the Monte Carlo simulation, showing that the capacitance type is more variationtolerable to voltage-headroom reduction, as it falls into the subthreshold region later. Especially, when the current starving inverter's bias voltage (V B ) is increased to reduce latency, the discrepancy becomes severe. Thus, in this article, the capacitance type is employed. Conventional capacitance-type delay generation unit cell is composed of SRAM, NAND gate, switch transistor, and capacitor [5]. Multiplying weight and activation is held in NAND gate, and its output turns on switch transistor and connects capacitor with delay line. To improve the area efficiency, the modified unit cell structure is suggested in Fig. 4(a). It only composes SRAM, two transistors, and metal-oxide-metal capacitor (MOMCAP), which can be overlaid on the transistors. The operation of the proposed unit cell is as follows: when the activation and weight values are both one, the bottom plate of the capacitor is connected to the ground, while the top plate of the capacitor is connected to the delay line. Otherwise, when the activation or weight becomes zero, the capacitor will be disconnected to the delay line or will be floated, respectively, and will not work as an output load of the delay inverter [ Fig. 4(b)]. Fig. 4(c) is the unit cell layout for estimating the area in 28-nm design rule, yielding a 1.68-× 1.17-µm unit cell area.
To estimate the unit cell operation at cryogenic temperature, we ran SPICE simulation using 28-nm bulk CMOS BSIM4 model, calibrated to the measured dc behavior of the device at 4 K using the BSIMProPlus tool [6]. For the 300-K simulation, the off-the-shelf process design kit (PDK) of the same device is employed. Fig. 5(a) plots the variation in the accumulated delay according to spatial variation, which are the different input-weight combinations in the same total MAC values. At 300 K, only about 1.1% of the t unit_delay variation occurred due to spatial variation (≈0.2 ps). The variation slightly increased to 3.7% in 4 K (≈1.36 ps), which is, nevertheless, insignificant. Fig. 5(b) is the Monte Carlo simulation results of the accumulated delay at 300 K. Even when the deviation is the largest, the standard deviation over average t unit_delay (σ/µ) is around 1.5, much smaller than the current-based CIM. Due to the lack of cryogenic PDK support, we were not able to do Monte Carlo simulation at 4 K. According to the recent papers, the mismatch variation according to the temperature depends on certain process technology; some showed similar or only negligibly increased variation at cryogenic temperature compared with room temperature [7], [8], while other showed an opposite trend [9], [10]. Hence, the process, which shows less variation at lower temperature, will be preferred for the cryogenic IC design, and we assumed the process technology we used following such preference. Assuming σ/µ to be similar thereby, it is expected that the classification accuracy will be maintained at the cryogenic temperature.

B. SUCCESSIVE APPROXIMATION REGISTER TIME-TO-DIGITAL CONVERTER
The time-domain CIM always meets a trade-off at its timeto-digital conversion stage. Time-domain CIM macros of [4] and [5] employ flash-type TDC to convert accumulated delay to 2-bit digital data. This scheme requires 2 N − 1 samplers to achieve N -bit resolution, thereby inefficient to target a high resolution, as TDC area will occupy multiple columns and as it will be susceptible to variations. Research on [11] used a gated clock to measure the MAC value represented in the pulsewidth. This scheme can be advantageous in reso- lution, and area aspect, however, will require large power in generating and buffering high-frequency clock. If the clock frequency becomes lower to reduce power, the computing latency will increase.
The SAR-TDC proposed in this article can be an efficient solution for time-based CIM to achieve high resolution. By using successive approximation scheme, the number of samplings is reduced to N times to attain N -bit resolution. Moreover, as it employs only single RDL rather than using a clock, it does not suffer from power-latency trade-off. Fig. 6(a) is the proposed SAR-TDC block diagram. Each sampling stage comprises a single set-reset (SR) latch, OR gate, AND gate, and delay cells. In the first stage, the DDL and the RDL, whose delay is equal to when the MAC value is half the maximum, are inputted to the SR latch, OR gate, and AND gate in parallel. The OR gate's output rises when one of the DDL's rising edge or RDL's rising edge arrives. Thus, the output represents the earlier path. On the other hand, AND gate's output represents the later path. The SR latch detects which rising edge comes earlier. Also, between each stage, namely, kth and k + 1th stages, the earlier path goes through delay cells, which delay is programed to be (1/2 k+1 ) of the maximum delay that can be accumulated in the DDL. The later path only goes through the same number of delay cells with the earlier path, without additional delay accumulation.
This selective delay accumulation takes the same role as reference voltage modulation in SAR-ADC. In SAR-ADC, the successive approximation is done by narrowing the gap between the data voltage and the reference voltage, by increasing/decreasing the reference voltage. The decision whether to increase/decrease is decided by the previous sampling result. Meanwhile, in SAR-TDC, the increase or decrease in the reference voltage of SAR-ADC is corresponded to delay accumulation on RDL or DDL, respectively. The decision step of SAR-ADC is automatically included in SAR-TDC by only adding delay on the earlier path, which reduces the delay gap between two rising edges. Specifically, when the DDL's rising edge (or MID path's VOLUME 8, NO. 2, DECEMBER 2022 rising edge following DDL) is earlier than RDL's rising edge (or MID path's rising edge following RDL), the OR gate takes DDL's rising edge and adds delay. In the opposite case, the delay is added to RDL's rising edge. However, even though the successive approximation is internally and automatically done in the AND/OR gates, as MID_E or MID_L paths do not always represent only one of DDL or RDL and change at each stage, this output does not clearly show the final converted output. The SR latch is located for the reason: It first detects whether MID_E's rising edge is earlier than MID_L's rising edge (LATE <2:0>). By looking at the previous stage's SR latch result, whether the MID_E is the DDL's extension or the RDL's extension can be determined. According to the result, the LATE signal of each stage is inverted outputted or outputted as it is. The operation timing diagram and output table are in Fig. 6(b) and (c), respectively.

A. BENCHMARK MODELS AND D = 3 SIMULATION RESULT
The benchmark simulation was performed to estimate performance improvement compared with the current-based CIM surface code decoder. To estimate the latency of the delay lines, and TDC (collectively referred to accumulation and sample stage), SPICE simulation with the cryogenic CMOS model was employed. For the estimation of following peripheral circuitry including, shift-and-add, NeuroSim [12] incorporated with the cryogenic CMOS model was used. Fig. 7 is the NNbD architecture proposed in [1], which was also used as the simulation model of this article. The network is composed of three stages, which are composed of long short-term memory (LSTM) networks and feedforward (FF) network. For a successful error correction, the X/Z-type ancilla qubits are read and inputted to the NNbD multiple times. The cascaded LSTM layers can work in a pipelined manner and finally calculate logical qubit error probability at each measurement. Fig. 8 is the internal structure of the LSTM layer. The hidden state, which is the output of the previous LSTM layer, is concatenated with the input vector and goes through vector-matrix multiplication (VMM) and activation functions, respectively. The activation function outputs are pointwisely added/multiplied with the cell state and outputs to the next layer. In this research, the latency of the LSTM layer was estimated by mapping the VMM to a CIM array. The rest activation functions and pointwise operations were assumed to be comprised the same with software-based decoder counterparts, which will be implemented in the lookup table and gate array of FPGA, respectively. Fig. 9 shows the latency comparison of the largest LSTM stage (the second stage), of which will determine pipeline latency of Fig. 7 and be the decisive element whether QEC decoder timing constraint is met. The latency is compared when the LSTM network is composed of conventional current-based CIM or time-based CIM. Despite the latency increases at the accumulation and sample stage, the SAR-TDC's small area occupancy allows reducing of column multiplexing (from 8-to-1 to 2-to-1), thereby increasing overall throughput on the digital side, which is more dominant (×0.6).

B. SUPPLY VOLTAGE REDUCTION AND NETWORK SIZE SCALABILITY
In [3], V th and V DD modulation for cryogenic CMOS process were suggested in order to achieve faster operation speed or lower power consumption. However, the decreased voltage headroom will decrease the dynamic range of accumulated output current when using current-based CIM, thereby hinders from achieving high resolution in a large-size network. On the other hand, the computing method of using variable capacitor and inverter chain of time-based CIM allows more flexible voltage and network size scaling.
In the following simulations, we show that the proposed time-based CIM macro achieves better scalability when using a CMOS process optimized for cryogenic application. Following the methodology of [3], to optimize MOSFET for cryogenic application, V th and V DD were first reduced by exploiting higher subthreshold slope and on-current transconductance at the cryogenic temperature (Fig. 10). To be specific, V th is reduced from 0.45 to 0.15 V, and V DD is reduced from 1 to 0.7 V, assuming the metal work function engineering is employed. Then, the time-based CIM macro was rebuilt with the modified cryo-optimized MOSFET model. On the circuit side, the length of DDL has been doubled to deal with a doubled LSTM network size of the code distance d = 7 (256 activation × 128 weight). Likewise, by performing a Monte Carlo simulation [ Fig. 11(a)] with 300-K off-the-shelf PDK, the capability of achieving the target resolution (5 bit) under variations was still to be confirmed. Meanwhile, assuming the current-based CIM will not tolerate 256 MAC units/5-bit resolution, we distributed 256 MAC operations into two columns (128 MAC units per column). Fig. 11(b) shows the latency increment rate of the 256 × 512 LSTM network compared with the previous 128 × 256 LSTM network when using current-based or time-based CIM, respectively. While the current-based CIM showed about ×3 latency increases as the size of both columns and rows becomes doubled, the timebased CIM showed latency increasing less than ×2.

V. CONCLUSION
A time-based CIM architecture for neural network operating at cryogenic temperatures has been proposed. First, among the delay modulation schemes previously proposed, the most suitable scheme was selected by analyzing the effect of cryogenic temperature on the CMOS process and, furthermore, the time-based CIM. The advanced unit cell design was proposed after that to improve area efficiency. Next, the SAR-TDC was proposed to solve the trade-off between area and power resolution, which was not achieved in the previously reported time-based CIM. Unlike a flash-type TDC that cannot achieve high resolution or a TDC that requires a fast free-running clock, the proposed SAR-TDC is constructed using only logic gates and capacitors and can achieve high resolution with only an area corresponding to a few columns. In the first benchmark simulation, the latency of the proposed time-based CIM was compared with current-based CIM. About 40% improvement was shown, which comes from the reduction in column multiplexing due to a small SAR-TDC area. The benchmark simulation is further expanded to the larger size array with a V th -engineered transistor model. The simulation result shows that the voltage scalability property of time-based CIM allows larger computation size per column and, eventually, 1.56× better array-size scalability, under the simulated benchmark model.