Memristive Devices for Time Domain Compute-in-Memory

Analog compute schemes and compute-in-memory (CIM) have emerged in an effort to reduce the increasing power hunger of convolutional neural networks (CNNs), which exceeds the constraints of edge devices. Memristive device types are a relatively new offering with interesting opportunities for unexplored circuit concepts. In this work, the use of memristive devices in cascaded time-domain CIM (TDCIM) is introduced with the primary goal of reducing the size of fully unrolled architectures. The different effects influencing the determinism in memristive devices are outlined together with reliability concerns. Architectures for binary as well as multibit multiply and accumulate (MAC) cells are presented and evaluated. As more involved circuits offer more accurate compute result, a tradeoff between design effort and accuracy comes into the picture. To further evaluate this tradeoff, the impact of variations on overall compute accuracy is discussed. The presented cells reach an energy/OP of 0.23 fJ at a size of <inline-formula> <tex-math notation="LaTeX">$1.2~{\mu{ }}\text{m}^{2}$ </tex-math></inline-formula> for binary and 6.04 fJ at <inline-formula> <tex-math notation="LaTeX">$3.2~\mu \text{m}^{2}$ </tex-math></inline-formula> for <inline-formula> <tex-math notation="LaTeX">$4\times 4$ </tex-math></inline-formula> bit MAC operations.


I. INTRODUCTION
S URPASSING the standing record in the ImageNet challenge by far, AlexNet started a continued surge in the use of convolutional neural networks (CNNs). A trend of ever-increasing network complexity is observed improving the accuracy while increasing the memory footprint and power consumption. To tackle both these challenges, new schemes for computation have emerged, which take inspiration from the human brain, i.e., the domain of neuromorphic computing. Thereby, one key principle is to compute-inmemory (CIM), an approach that co-locates data and computation to address the von Neumann bottleneck [1].
In this domain, analog computing is often considered to decrease power consumption further. The resilience of deep neural networks to a certain degree of imprecision is exploited with a moderate impact on the classification accuracy [2]. While current and charge domain computing have enjoyed high popularity for CIM [1], [3], [4], [5], they require expensive analog-to-digital conversion and lack in ability of voltage scaling. A different compute scheme is time-domain CIM (TDCIM). In time-domain (TD) computing, the values are encoded as discrete arrival times of signal edges. While signaling is sample discrete, the arrival time is fundamentally continuous. Similar to charge and current, time is inherently additive, allowing for efficient accumulation operations.
TD implementations vary, one example being the integration of currents on a capacitor and observing the reached voltage level [6], [7]. The total time is given using (1), with N as the number of multiply and accumulate (MAC) operations and I MAC as the current component of a single MAC operation It becomes evident that for rising N , the time difference between MAC results diminishes, making time-to-digital conversion increasingly harder. Compensating this effect by increasing voltage, U , or output capacitance, C, comes with added cost. In this tradeoff, N is kept small, requiring more time-to-digital conversions and accumulation of partial sums in the digital domain thereby increasing power consumption. Cascaded TDCIM chains digitally adjustable delay elements to introduce a delay as a function of the multiplication result (see Fig. 1). The full-swing operation now offers margin for voltage scaling. Cascaded TDCIM is first introduced in [8] where registers are implemented by standard cell latches, allowing to unroll the complete kernel in a weightstationary dataflow. However, the adopted standard cell memory suffers from large area. For a binary network implementation with an energy efficiency of 1.05 POPS/W, [9] devotes 22% of the total MAC area to memory. In [10], a cascaded TDCIM for ternary operation at 716 TOPS/W is shown using custom current-starved inverters combined with static random-access memory (SRAM). The used custom SRAM co-integrates the delay cell, leading to memory footprint of 50% of the total MAC area. For multibit operation, the increased memory size limits realizable chain length.
The designs in [11], [12], and [13] all use commercial SRAM separating memory and MAC operation. Due to the row sequential access of SRAMs, the chain size is limited by the number of SRAM columns. Therefore, these designs have short chain lengths of 128, 121, and 64, respectively. The two-terminal memristive switching devices can be used to replace SRAM to reduce the area. Among the different physical realizations, filamentary switching devices based on the valence change mechanism (VCM) are one of the most studied variants [14]. First memory macros have been introduced for embedded memory applications [15], [16]. Using such memory arrays, further application areas such as neuromorphic engineering have been demonstrated [17], [18]. Best area savings are achieved in monolithic 3-D structures placing the memristive devices in the back end of line (BEOL). Further area reduction potential results from their potentially high resistance leading to smaller caps in the delay elements. At the same time, their nonvolatile storage provides inherent leakage power savings.
On the downside, today's devices still feature high variability. As TDCIM is susceptible to process, temperature, and voltage (PVT) variations, prudent assessment of this is key. This work aims to assess these nonidealities and their impact on important design metrics to better understand the tradeoffs and limits that memristive devices entail in cascaded TDCIM. Section II gives an introduction into the general concept of memristive cascaded TDCIM. Section III introduces basic concerns of VCM reliability and mechanism of variability. Section IV presents a binary TDMAC cell based on VCM and discusses the implications of those nonidealities. In Section V, this concept is extended to the multibit case. Finally, we conclude in Section VI.

II. CASCADED TDCIM ARCHITECTURE
The operation of the typical convolution layer is shown in (2), where x is the input activation and w is the weight vector. f is the activation function and usually the binarize function for binary neural networks (BNNs) and the rectified linear unit (RELU) function for CNNs In TD computing, cascaded variable delay elements can implement an accumulation. Each element realizes the delay to encode one multiplication result, thus realizing the MAC function. Unlike in the traditional digital circuits, the convolution result is therefore presented as an accumulated delay. For the TDCIM architecture, the weights of one kernel are stored in the memory of one computing chain. For a kernel size of N with M computing chains in parallel, the total area, A tot , is given by M · N · A cell + A TDC with the cell area, A cell , and the area for time-to-digital converter (TDC), A TDC . After computing a convolution, the activation signals are changed, whereas the weights can remain in memory. The weights are only updated after the complete output feature map is computed, thus reducing the data movement from the main memory. The input activation can be shared by all the computing chains and further reduce the data movement.
The MAC computation is processed in the TD and will be converted into the digital domain by TDCs. For BNN, the TDC is reduced to a sampling of the output at a specific time-point using a standard flip-flop, as shown in Fig. 2(a) [9]. Based on the arrival time of the computing delay, t cmpt , and the threshold delay, t th , the binarized output, out, is generated. For multibit CNNs, sampling can also be performed using a threshold chain, producing a temperature-coded output [see Fig. 2 [19]. To save area, sampling can alternatively be performed by an oscillator combined with a counter resulting in a tradeoff between area and sampling noise as variations get amplified by the number of oscillations. Using the same delay elements in TDC as for the compute chain ensures best attenuation of global chip variations. While it is not possible to subtract time, there are multiple ways to implement negative numbers in TD computing. By adding an offset to the delay, negative numbers can be represented by shorter delays than said offset, as done in [12] and [20]. In [8] and [21], a negative and a positive path are used with the numeric value being represented by the difference in delay.
By representing positive numbers with delays of a certain edge type and negative numbers with delays of the complementary edge type, duty cycle as an output with 50% marks the sign swap [22]. A basic technique to realize multibit numbers in TD lies in the bit-serial approach. Here, a multiplication can be split up into multiple multiplications with reduced word length down to binary. This way, negative values can be implemented by means of sign magnitude representation or one's/two's complement as done in [13].

A. MEMRISTIVE TDCIM
Memristive devices are used as the resistive component of the RC element in TD computing. The resulting difference in cell delay as a function of resistive states R 1 and R 2 is approximated according to (3). In a cascaded TD design, C is typically realized by the input capacitance of an inverter stage that follows the memristor: As the numeric result is a function of programmed resistance values, any device-to-device or cycle-to-cycle variations in the memristive elements impact this delay value. As there are no write operations within the multiplication of a kernel with the input feature map, these two can be lumped to a single variation in resistance. An equivalent of the classical signal-to-noise ratio (SNR) can be found for TD computing in (µ t /σ t ), with µ t being the average delay step and σ t being its standard deviation. Therefore, not only the variation in resistance but also the total resistance is important. Roughly a constant variation in current can be assumed for the low resistance state (LRS). For higher resistances within this state, this leads to an increase in the relative variation, as the nominal current, I nom , goes down according to (4). This observation is also made in [23] for cycle-to-cycle variations and in [24] for chip-to-chip variations. For this reason, small delays can be realized with lower variations than higher delays In the binary case, the high resistance state (HRS) is programmed to realize long delays. As long delays introduce higher total error for the same relative error, HRS variability bounds binary accuracy. On the other hand, resistances for multibit lie within the LRS to allow for multiple steps of resistance. Thus, multibit accuracy is bound by LRS variability. Besides the effects influencing variability, reliability aspects have to be considered. These include read disturb, writability, and forming and will be addressed in the next section.

III. VARIATIONS AND RELIABILITY CONCERNS OF VCM
The VCM cells consist in their simplest form of a metal oxide (e.g., TiO 2 , ZrO 2 , HfO 2 , Ta 2 O 5 , or SrTiO 3 ) sandwiched between two different metal electrodes [25]. The switching mechanism in VCM cells is based on the movement of oxygen vacancy defects within the oxide region [26], [27]. Prior to the repetitive switching, a so-called electroforming process is required, during which the oxygen vacancies are introduced into the system via an oxygen exchange process at the metal/oxide interfaces [28]. While it only happens once, the variability of this process may influence the later switching performance and can already lead to device-todevice variability [29]. Due to the switching variability also the programmed states show some variability as illustrated in Fig. 3(a). Whereas the programmed high resistive state shows a log-normal distribution, the programmed low resistive state shows a normal distribution of the resistance states. This behavior is typical for filamentary VCM cells [30], [31], [32]. It was observed that the states relax after programming, which leads to a widening of the distribution [33], [34], [35]. Moreover, it was shown that reprogramming the tail bits (i.e., shaping the distribution) by applying an additional programming pulse has no effect as the stable distribution is restored after some time [30], [34], [35]. Another variability aspect of concern is the state retention, i.e., the change in the resistance over time. It has, however, been shown that the states are stable over few hundred hours at temperatures above 150 • C [36], [37], [38]. Read disturb describes the directed change in a resistance distribution under a repeated application of read pulses. It therefore not only represents a variability concern but one for reliability as well. So far, read disturb properties have been mostly investigated for binary devices and under the assumption of constant voltage during each read pulse. Under these assumptions and for the use case of an embedded memory, recent results have shown that read disturb is not a critical issue in advanced integrated technologies [36]. In [39], we have demonstrated read disturb stability for binary devices under constant read voltage pulses for up to 5 × 10 10 VMM operations suggesting that far larger numbers of read operations are possible than the conventionally investigated for memory applications under the right conditions. For this stability, the SET voltage has to be kept below −0.3 V and the RESET voltage has to be kept below 0.5 V. While short read pulses can be expected for cascaded TDCIM, further investigations must be made on read disturb, as shape and duration of the read pulse heavily depend on circuit implementation.
Besides read disturb, other reliability aspects have to be investigated for the use of memristors. Due to the ionic nature of the switching mechanism, the switching process is inherently stochastic. Owing to the small dimensions of the filament, Joule heating occurs in this device accelerating the switching process further. On one hand, Joule heating is the key enabler solving the voltage-time dilemma, i.e., high stability at small read voltages while enabling fast switching times at high write voltages [40]. On the other hand, Joule heating introduces a positive feedback during switching [41], leading to a strong state dependence of the switching time. It has been shown that the switching time can vary over orders of magnitudes for a given voltage from cycle-to-cycle and device-to-device [42], [43]. In consequence, one can expect to have slow switching devices and fast switching devices in an array, as illustrated in Fig. 3(b). Tuning for good writability therefore also increases susceptibility to read disturb. The spread of switching times has further implications on the writing process. In principle, the programming pulsewidth could be chosen long enough for successful one-shot programming. However, most devices will then switch early and a high amount of energy is dissipated. Moreover, the fast switching devices may be overprogrammed leading to failed rewrite attempts. Thus, in most cases, a so-called write-verify process is used to program desired resistance states [33], [44]. Le et al. [33] demonstrate such programming of eight different resistance states on a large array. Resistance relaxation or retention describes changes in the programmed resistance distributions and therefore also introduces reliability concerns. In contrast to read disturb, it is neither directed to a certain resistance state nor directly associated with repeated reading as exhibited, e.g., by the relaxed HRS distribution in Fig. 3(a). In [33], the retention properties of 3-bit VCM cells were investigated in which the bit error rate (BER) was increased from 0% to 0.6% after the experiment. As the relaxation is stronger at higher, resistances the resistances were all kept below 35 k [33]. The maximum required voltages to operate the devices are important as they have to be supported by the transistors. As the initial forming step requires the highest voltages applied for the longest time, it is the most critical one in that regard. Different proposals have been made to tackle this problem such as implanting the oxide of the VCM devices [45] or adapted pulsing schemes [46]. In [47], the use of an additional deep n-well allowed for keeping all the applied voltages within the limitations of the core devices (1 V), while still allowing for high enough voltages for the forming process (1.65 V). In addition, the gap between forming voltage and technology node has been shrinking from around 2 V at 130 nm to about 1 V at 14 nm [47]. Compared with bulk devices, fdSOI offers elevated drain-source breakdown voltage (BVDS) with [48] reporting more than 2 V for soft breakdowns in a 22-nm process. Therefore, elevated voltages for the forming step can be tolerated, as this step is done only once, and time-dependent degradation therefore is negligible.
To investigate the variability and reliability of filamentary VCM cells as shown in Fig. 3(a), the respective devices with a 30-nm Pt/5-nm ZrO 2 /20-nm Ta/30-nm Pt stack were fabricated into a 7 × 7 µm crossbar architecture. The Pt bottom electrode is connected to ground and all voltages are applied to the Ta/Pt top electrode. However, for the sake of comparability, all the voltages in this article are given with respect to 0 V at the Ta/Pt top electrode. Further details on the device fabrication and the measurement setup can be found in [39]. Fig. 4(a) shows implementations for binary TDCIM MAC cells in classical CMOS (a) compared with a memristor-based solution (b) and the corresponding stick diagram (c). Both designs allow for computation on rising and falling edges, increasing the energy efficiency. For BNNs, weights are typically defined as −1 and 1 as presented in [49]. The values −1 and 1 can be mapped to 0 and 1, thus translating an XNOR operation to a multiplication. In the CMOS case, the XNOR gate therefore implements multiplication. The result of the multiplication is then connected to the variable delay element consisting of a multiplexer and a delay cell. In the memristive implementation, the memory, multiplication, and variable delay are not that clearly separable. A construct of two complementary controlled transmission gates with two complementary programmed memristors acts as a resistive XNOR gate. For W = 1, the lower memristor is in HRS and the upper memristor is in LRS, respectively. Thus, X = 1 activates R HRS in the delay path and X = 0 activates R LRS . For W = 0, the memristors are swapped, leading to an inverted operation with respect to X . The resistance of this gate holds the weight, but also realizes the delay, as it implements an RC-element when considering the gate capacitance of the output inverter. The NMOS before the output inverter is used for programming.

IV. PROPOSED MAC CELL FOR BNNs
For the memristive implementation, an inverting design is shown whereas the CMOS example is noninverting. This only leads to a swap of trigger direction in TDC in case of an odd number of delay elements for the inverting case.

A. VARIABILITY ANALYSIS
The delay of individual MAC cells is sensitive to noise and PVT variations. Here, process variations can be separated into global or chip-to-chip variations as well as local variations, which are present on a single chip. For TDCs build from the same delay cells used for computations, global variations have the same impact on all the delay paths. Thus, accuracy of the computation is only susceptible to local variations [50] considering a matched TDC circuit [9].
The compute chain SNR is directly related to the SNR of a single cell. As the delay of the individual stages accumulates over the course of the compute chain, σ chain can be obtained by (5), with N being the length of the compute chain and µ i and σ i the mean and standard deviation of the ith cell, respectively. Thus, the SNR of the computation is given by (6) Due to σ chain growing in a square root relationship to the chain length, longer chains generally provide a better SNR. For the binary case, the equation can be simplified to the following equation by assuming the MAC cells delay step size as t: In [51], we use this relationship to obtain the mean square error (mse). The central limit theorem allows to assume the Gaussian variations for the compute chain, leading to the following equation: Plotting (mse/N ) in Fig. 5 reveals a regime, where mse is zero. Here, the compute chain length is sufficiently short that the error is smaller than the threshold for the next value. The threshold for this regime is given by the following equation (9) [51]: In this regime, the TD computation can be assumed to be purely deterministic. Usually, operation does not take place in this regime as it only allows for short compute chains or large delay increments, and therefore low energy efficiency. Due to the HRS providing a lower current, classical crossbar vector multipliers are specially sensible to the variations within the LRS state. For binary TD computing, this relationship is reversed, as the HRS has a higher time constant. To model the influence of these variations compared with process variations seen in classical transistors, the PDK of a commercial 22-nm fdSOI technology was used to obtain transistor device models including back-annotated variability. Memristors are modeled with a resistance of configurable process variability.
In contrast to other compute schemes, cascaded TD computing offers good voltage scaling capabilities. By scaling voltage, efficiency is traded off against lower SNR due to higher impact of transistor threshold voltage variations. In Fig. 6, we provide cell level SNR for different voltage levels and different combinations of relative memristor variance as well as HRS resistances. For low memristor variations, transistor variations significantly contribute to cell variations, leading to more than 2% cell variance for σ R HRS = 1%. For scaled voltages, this effect is amplified, leading to cell variations of more than 4%. For a higher noise assumption, the memristor variance dominates and this effect diminishes. Here, we see that higher R HRS delivers better SNR. This can be explained with an increase in µ, while memristor variation remains constant over voltage. The red line indicates the HRS case from the tuned case in Fig. 3(b). The assumed LRS VOLUME 8, NO. 2, DECEMBER 2022 resistance is 1.5 k , in correspondence to the measurements above. Lower mean resistance leads to a similar SNR as the example with higher memristor variations but higher mean.
In [51], we show that (σ/µ) below 6% has negligible impact on network accuracy for the MNIST dataset. While this threshold is surpassed for the tuned HRS case, for σ R HRS = 5%, the requirement is fulfilled even for voltage scaling to 0.6 V.

B. WRITE SCHEME AND READ DISTURB
Besides variations, writability and forming of the memristors are concerns when combined with core devices, as high voltage is typically required. For forming, elevated voltages are assumed to be tolerable, as this step is performed only once. Tuning for easier writability trades off with susceptibility to read disturb. Therefore, these concerns are closely related. Fig. 7 shows the write scheme for the proposed cell. The inverting property is used, so that setting the chain-wide programming enable P en , pulls all the inverter stage inputs to P in resulting in a differential voltage over the resistive XNOR gate. Altering X allows a selection of the memristors to be written. To reduce the voltage drop over the transistor, fdSOI back-gate biasing is used, and all the transistors involved in programming are super low threshold devices. In addition, the programming signals and the control signals for X can be chosen higher than V dd , as the voltage drop over the transistor ensures safe operating margins for all the devices. This way, the differential programming voltage can be increased. After the chain-wide write procedure, a write-verify process can be implemented for the HRS by controlling the X input cell-by-cell.
To test for writability, the Verilog-A model presented in [30] (JART VCM v1b) was used with a reduction of the maximum oxygen vacancy density of N plug = N disk,max = 4 × 10 26 m 3 . In this configuration, the HRS resistance, R HRS , was determined as 150 k for this configuration, therefore generating an even longer read voltage pulse than previously. To find corners for fast and slow switching devices, the model parameters l disk , r disk , N min , and N max were altered, which represent the length of the disk region, the radius of the filament, and the minimum and maximum oxygen vacancy densities, respectively. To analyze writability, all the parameters were therefore altered by ±10%. Writability was confirmed in all the corners.
To ensure read stability, 5000 pulses equivalent to 10 000 computations were applied to the TD MAC cell. Here, the fast switching corner was set up in an effort to create a realistic edge case. The relative cell delay shows no systematic drift and only shows small noise indicating no read disturb issue even for a supply voltage of 0.8 V [see Fig. 8(a)]. This can be accounted to really short differential voltage over the memristor [see Fig. 8(b)] which represents another benefit of cascaded TDCIM. Table 1 shows a comparison of binary MAC cells for cascaded TDCIM. All the cells can implement unrolled kernels to allow for minimal memory movement. While [10] provides a small footprint when scaled, its delay element only consists of a single current-starved inverter, leading to low SNR. For the sake of comparability, a version of the design was considered, which queues multiple delay stages, scaling the SNR by (N scale ) 1/2 . For N scale = 4, a similar (σ t /µ t ) to this work is reached.

C. CELL COMPARISON
The design presented in [9] is optimized for SNR at the cost of area. Therefore, the area-sensitive DLY40 version  evaluated in simulation was used for comparison. Area estimation is performed using the stick diagram from Fig. 4(c) by assuming eight-track cell height corresponding to the used technology and a width of 18 contact poly pitches of 100 nm [52]. Accounting for technology scaling, the memristive implementation provides a 2× smaller footprint to [9] and a 3× smaller footprint to [10] adapted for comparable SNR. Comparing SNR at scaled voltages reveals another major advantage of implementing cascaded TDCIM using memristors. When V dd approaches the transistor threshold voltage, the variations in threshold voltage get amplified, decreasing SNR. Contrary to the transistor, the memristor read behavior remains linear for lower voltages, leading to less deterioration of SNR. In this comparison, the setup corresponding to the red curve from Fig. 6 is assumed.
Given the same die area, the throughput of the designs is mainly dominated by two parameters: cell area and delay step size, t. Minimizing the former allows for greater parallelism and therefore better throughput. Using wave pipelining as shown in [9] computations can be overlapped, speeding up the time per computation in a chain from t min + t to t. By considering 1 Area t as a figure of merit for the throughput, the presented design also proves advantageous.

V. MEMRISTIVE MULTIBIT TDCIM A. MAC CELL IMPLEMENTATION
Multibit TDCIM is mostly implemented in a bit-serial fashion. Here, MAC cells implement a 1-bit by X -bit multiplication and intermediate results are combined by a shift and add operation. Slight changes to the MAC cell presented in Section IV realize such an implementation (see Fig. 9). For X = 0, the memristor is bridged by the upper TX gate and blocked by the lower one. For X = 1, the TX gates change roll and the memristor is connected in series with the output inverter. Here, delay depends on the memristor state. By combining two compute chains, a sign-magnitude implementation is realized. To realize positive numbers, the memristors are programmed with (|W | + 1) · R step in the positive chain and R step in the negative chain. Negative numbers are implemented by switching this assignment for a time difference of W · t step between both the chains. For TDC + RELU, the negative path can be used as the threshold chain input in Fig. 2.
The write scheme for the multibit version can be directly copied from the binary case due to the similarity in design. To prevent stress on the upper TX gate, P in can, however, not be globally put above 0.88 V and may only be increased for cells which are supposed to be written in set direction. Findings on reliability in Section IV can be applied to the multibit MAC cell as, besides of the case X = 0, the same devices are involved in writing and reading the circuit.
Due to the additional logic controlling the upper transmission gate, the area increases by the size of a NOR gate and an inverter, equivalent to 0.4 µm 2 . Together with the negative chain, the area of the multibit design is 2 × (1.2 + 0.4 µm 2 ) = 1.66 · A cell . Energy consumption increases linearly with the word length of the activation, resulting in 10.8 fJ for the complete 4 × 4 MAC operation. Within this section, R LRS is assumed as 15 k , resulting in a unit delay step of 27.5 ps. Throughput is limited by the maximum cell delay, t max , which is 257 ps for W = 7. At 0.7-V supply voltage, the energy/Op reduces to 6.04 fJ and t max increases to 265 ps.

B. NONLINEARITY AND VARIATIONS
In contrast to the binary case, linearity is a concern for the multibit case, adding onto variations and noise. Besides nonlinearities in the transmission gates and the Miller effect, another effect influences linearity proportional to memristor variance. As the delay is reciprocal to the conductance of the memristor, negative deviations in conductance influence the delay to a higher degree than deviations of same amplitude in the other direction, hence shifting the mean value. Due to the relationship in (4), higher delay values have higher variance and therefore are influenced to higher degree by this effect. Comparing integral nonlinearity (INL) of σ LRS = 1% and σ LRS = 2%, this becomes obvious, as for σ LRS = 2% INL increases (see Fig. 10).  Unwanted deviations in the delays for X = 0 can be noted. To ensure writability, the memristor is only isolated from the input direction, leading to current flowing into the parasitic capacitance of the closed transmission gate. Thus, a spread in delay values is observed for X = 0.

C. NETWORK ACCURACY
Different combinations of X and W will have different standard deviations, leading to a complex relationship between error and chain length. The error of a single computation, e, can be estimated using (10), and the resulting error after the shift&add operation can then be modeled with (11) µ chain = To estimate the resulting effect on network accuracy, the error model of (11) was applied to resnet20 on the CIFAR10 dataset. For the quantized networks, the first layer and the last layer were kept at 8 bit without added noise and the TDC is assumed to be sufficiently accurate. The results for INL and σ cell were obtained by averaging falling and rising edge results. The achieved accuracies after training are shown in Fig. 11. For V dd = 0.8 V, the memristor variation dominates, leading to a drop in accuracy from σ LRS = 1% to σ LRS = 2%. For scaled voltages, the transistor variance increases and acts as the new bottleneck, leaving only a small difference between both the variation levels. Voltage scaling below 0.7 V was not attainable as the overall noise increases too much. A method to allow further voltage scaling could lie in increasing R LRS , hence sacrificing throughput for improved SNR.

VI. CONCLUSION
In this work, the use of memristive devices for TDCIM is evaluated. Thereby, benefits of the cascaded approach for TD computing based on these devices offer promising alternatives to classical memory especially considering area reduction. Variability and reliability aspects of memristive devices were discussed in the context of TDCIM applications. An implementation for a binary TDCIM MAC cell is presented, and rigorous analysis on the impact of variations and reliability was performed. While the reached SNR is still not fully competitive to pure CMOS implementation at regular supply voltages, all other design goals could be met or surpassed for the memristive implementation. For reduced supply voltages, the memristive implementation outperforms even in terms of SNR. We expect that improvements in manufacturing quality soon will close this gap, enabling highly competitive memristive TD implementations. The limits on tolerated variations to achieve this goal were derived for the binary case.
In addition to the binary case, the TDCIM MAC cell was altered to support multibit operation. The proposed cell introduces minimal overhead in size using the shift&add operations and offers comparable reliability. An error model is presented to obtain network accuracy estimates for designs using nonlinearities and variations and is used to evaluate network performance. For σ LRS = 1%, these nonidealities could be mitigated in training, almost reaching classification accuracy to the purely quantized network. While the presented design shows less headroom for voltage scaling, it is well-suited for increasing throughput and reducing area.