AM4: MRAM Crossbar Based CAM/TCAM/ACAM/AP for In-Memory Computing

In-memory computing seeks to minimize data movement and alleviate the memory wall by computing in-situ, in the same place that the data is located. One of the key emerging technologies that promises to enable such computing-in-memory is spin-transfer torque magnetic tunnel junction (STT-MTJ). This paper proposes AM4, a combined STT-MTJ-based Content Addressable Memory (CAM), Ternary CAM (TCAM), approximate matching (similarity search) CAM (ACAM), and in-memory Associative Processor (AP) design, inspired by the recently announced Samsung MRAM crossbar. We demonstrate and evaluate the performance and energy-efficiency of the AM4-based AP using a variety of data intensive workloads. We show that an AM4-based AP outperforms state-of-the-art solutions both in performance (with the average speedup of about 10 ×) and energy-efficiency (by about 60 × on average).


I. INTRODUCTION
C OMPUTING is increasingly dominated by the transfer of large data volumes through bandwidth-limited interfaces to the locations where computations are performed, which hampers the performance and energy-efficiency of conventional computer architectures [1]. This is leading to a change in computing paradigms: instead of moving the data to the computation, the computation is moved closer to the data. The straightforward approach, known as "nearmemory computing", places processing units close to memory arrays [2], [3], [4], [5]. An alternative way to overcome the so-called "memory wall" is to compute in-data, i.e., directly within the memory arrays. This paradigm, which we refer to Manuscript  throughout this work as "in-memory computing", employs the same memory cells for both data storage and data processing. This work is inspired by the 2-transistor 2-resistor (2T2R) magnetoresistive (MRAM) crossbar design, recently unveiled by Samsung [6]. We keep the original topology of Samsung's MRAM crossbar to develop a massively parallel generalpurpose in-memory computer architecture. We accomplish this goal by: 1) Converting the crossbar into an associative memory (AM), and 2) Transforming the associative memory into a massively parallel in-memory associative computer.
Specifically, we propose AM 4 , a novel associative in-memory computing architecture, and investigate its design trade-offs, explore its design space, and evaluate its performance and energy-efficiency. Additional use cases for AM 4 include Content Addressable Memory (CAM), Ternary CAM (TCAM), and approximate match (similarity search) associative memory (ACAM).
To our knowledge, AM 4 is the first magnetoresistive NAND CAM based on an MRAM crossbar. In a conventional NORtype CAM [7], [8], [9], [10], the matchline discharges on a mismatch. In a typical CAM and TCAM application, only one memory row matches, while the rest mismatch. Since mismatches are much more frequent than matches, all matchlines need to be precharged prior to every search/lookup, resulting in a very significant energy consumption. In contrast, in a NAND-type CAM, only the matching row(s) discharge, reducing the energy consumption of search/lookup by orders of magnitude.
Our research methodology spans from high-level software simulation through to accurate transistor-level circuit simulation. Our circuit design utilizes a 28 nm FDSOI technology node along with a Verilog-A based compact model for the double-barrier magnetic-tunnel junction (DMTJ) device [11]. Evaluations under exhaustive Monte Carlo simulations show that AM 4 offers a compare time of about 1.4 ns, which consumes and 1.73 fJ of energy. These results are achieved with a bit cell footprint of just 0.138 µm 2 . Smith-Waterman optimal sequence alignment algorithm [12] is used to evaluate AM 4 and compare it to state-of-the-art computing in-memory and conventional solutions. Results show that AM 4 can provide orders-of-magnitude performance and energy-efficiency improvement versus traditional von Neumann architecture. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The contributions of this work are summarized as follows: • AM 4 is the first NAND-type CAM based on a random-access magnetoresistive crossbar.
• AM 4 is the first solution that converts an MRAM crossbar into a massively parallel, general purpose digital inmemory computer.
• The proposed AM 4 architecture enables multiple applications including CAM, TCAM, approximate CAM, and in-memory associative processor within the same basic design.
• We thoroughly evaluate AM 4 under both software and circuit simulation, and conduct a rigorous design space exploration, including susceptibility to process variations. The rest of this work is organized as follows: Section II overviews the background of this work; Section III introduces the proposed MRAM crossbar based associative memory/processor design; Section IV presents the functional verification of AM 4 ; Section V presents the simulation results, while Section VI and Section VII show application results and related work, respectively. Finally, Section VIII summarizes the main conclusions of this work.

A. Samsung's MRAM Crossbar Array
Samsung recently unveiled an MRAM crossbar design for in-memory computing [6], which was used to implement a binary neural network. The crossbar array and a schematic representation of a 2T2R MRAM cell are illustrated in Fig. 1(a) and (b). Each crossbar cell comprises two magnetic-tunnel junction (MTJ) devices, which store the data bit and its complement, and two selector transistors (INx and INy), which are driven by signals IN j and its complement (IN j ) that are shared by all cells in a row j. The bottom node of each cell is connected to the top node of the cell immediately below to enable writing to the crossbar. To complete the write connectivity, an additional switch (WEN i ) connects two vertically adjacent cells to the vertically routed W DATA signals, as shown in Fig. 1(c). These connections are interleaved, such that even rows are connected to W DATA[0] and odd rows are connected to W DATA [1] . To write a value into the 2T2R bitcell, the WEN switches that are adjacent to the target cell (above and below) are enabled, and a two-cycle operation is applied. During the first cycle, the INx switch is turned on and the W DATA signals are driven to write the parallel or anti-parallel states into the left MTJ, denoted as MTJ1 in Fig. 1(c). The same procedure is applied during the second cycle, in which the INy switch is enabled to write the complementary state into the right MTJ (MTJ2).

B. Associative Processor
An associative processor (AP) is a non-von Neumann computer [13]. A 9T static CMOS CAM-based AP is illustrated in Fig. 2. The main component of an AP is an associative memory array (CAM), which allows (1) comparing the entire dataset to a search data pattern (refer to COMPARE KEY in Fig. 2), (2) tagging the matching rows (TAG circuitry is shown in Fig. 2(b)), and (3) writing another data pattern (WRITE KEY in Fig. 2) to all tagged rows. An AP performs no computations in a conventional sense, as no dedicated arithmetic logic units (ALUs) are provided [14]. Instead, arithmetic operations are broken down into a series of Boolean logic equations, which are evaluated by the AP in-memory, in a perfect induction-like fashion, as follows. The dataset is stored in the associative memory, in the input field, typically one data element per CAM row (comprising a virtual Processing Unit, PU, as shown in Fig. 2(a)). The AP matches all possible input combinations of a Boolean function against the input field (for the entire dataset in parallel). During each iteration, the CAM rows containing the matching data elements are tagged, and the corresponding function values (precalculated and embedded in the AP microcode), are written into the designated output fields of the tagged rows. During compare and parallel write cycles, the input and output fields are selected by the MASK register of Fig. 2.
For an m-bit argument x (x ∈ dataset), any Boolean function b(x) has at most 2 m input combinations. Therefore, a perfect induction-like evaluation of any Boolean function with an m-bit argument would incur up to O(2 m ) cycles on an AP, regardless of the dataset size. This can lead to significant performance gains when applied to large datasets.
The main associative instructions (primitives) are: Compares the query (key) x i (1 ≤ i ≤ m) to the field y i in all rows of the AP array, in parallel. The rows where x i equals y i are tagged; 2) Write (y 1 ← x 1 , y 2 ← x 2 , . . . ,y k ← x k ): Writes the value x i (1 ≤ i ≤ k) into the position y i in all tagged AP rows in parallel. Arithmetic operations can be performed on an AP in a wordparallel, bit-serial manner, reducing compute time from O(2 m ) to O(m). For instance, vector addition may be performed as shown in Fig. 3: Let AP bit-columns 0-3 and 4-7 hold four-bit vectors A and B, respectively. Columns 8-11 are reserved for the lower 4-bits of the sum vector S, while bit column 12 is used for storing and updating the carry bit c (after the addition is complete, it holds the MSB of the vector S). The addition is carried out in four single-bit iterations (refer to Algorithm 1), in parallel for all vector elements in the AP: (2) and (3) are performed in parallel i is the bit index and c and s are, respectively, the carry and sum bits; A single-bit addition iteration is carried out in eight steps, where in each step, one entry of the truth table (a three-bit input pattern, A, B, C IN in Fig. 3(a)) is compared against the  contents of the a i · b i · c bit columns and the matching rows are tagged; the logic result (two-bit output (C OUT , S) of the truth table as listed in Fig. 3(a)) is written into the s i and c bits of all tagged rows.
A snapshot of the second step (processing the second entry of the truth table) of the first iteration (processing LSBs of all vector elements in parallel) is presented in Fig. 3(b)-(c). Fig. 3(a) shows the truth table with the second entry delineated. Fig. 3(b) and (c) show compare and write operations, respectively. During compare ( Fig. 3(b)), the input pattern '001' is compared against bit columns a 0 , b 0 and c for all vector elements in parallel. The matching rows (two in this example) are tagged. During write (Fig. 3(c)), the output pattern '01' is written into bit columns s 0 and c, respectively. Only the tagged rows are written. Each compare and write affects the entire dataset (vectors A, B and S).
In a straightforward implementation of a Boolean function evaluation, every compare is typically followed by a write operation. This could adversely affect the overall performance and energy-efficiency, since a write may be more time-and energy-consuming than a compare operation. Additionally, this reduces the memory lifetime of write endurance-limited memories. We mitigate the parallel write overhead by using the observation that any Boolean function has only two values ('0' and '1'). Therefore, regardless of the truth table size, we can aggregate all compares that are followed by write '0' and write '1' into separate groups, and thus perform a single write per multiple compares [15].
To summarize the complexity of performing typical ALU operations in an AP, a fixed-point addition and subtraction takes O(m) cycles, whereas fixed-point multiplication and division require O(m 2 ) cycles, where m is the wordlength. Applying a single-precision floating point multiplication on an entire dataset takes 4,400 cycles, regardless of the dataset size [16].
The scaling of CMOS associative processors [16], [17] is limited due to the CMOS CAM density. However, an MRAM AP cell is at least an order-of-magnitude smaller, thus paving the way for associative in-memory computing at scale.

C. Application Space
An AP is a general-purpose computer, whose efficiency strongly depends on the workloads and datasets. As typical for a single instruction, multiple data (SIMD) architecture, the efficiency of an AP is limited in control-flow workloads, but grows quickly for regular iterative workloads with fine-grained parallelism. Since in many algorithms, AP execution time does not depend on the dataset size (as in the above example of vector addition), the efficiency of an AP typically improves with the dataset sizes. Since an AP is an in-memory computer, it is more efficient when running data-intensive applications. Due to the fact that an AP typically implements bit-serial (but word-parallel) arithmetic, it intrinsically supports flexible Fig. 5. k-means runtime (lower is better) and energy-efficiency results [18] versus different reference solutions: Intel i7-3770, Xilinx ZC706, Altera Stratix V, NVIDIA K20M, and a 10 node GPU cluster. and user-configurable data wordlengths and formats, including fixed and floating point with flexible mantissa and exponent sizes.
Data in associative memory is accessed by its contents rather than its address. Data elements of the same dataset are normally identified by a unique index, or a member ID. Unlike random access memories (SRAM or DRAM), APs do not require dense and structured data allocation to operate efficiently. Individual data elements do not have to be placed in any specific order but can rather be scattered across random locations (rows) within the CAM arrays. Since modern datasets become increasingly sparse, the ability of computers to properly process sparse data (for example, not wasting time and energy on fetching and multiplying zero-data elements) becomes a critical requirement. An AP holds a significant intrinsic advantage in sparse data processing: sparse data in one of the compressed formats (such as compressed sparse row or compressed sparse column) can be processed almost as efficiently as dense data [17].
A variety of applications have been suggested for APs, including sparse and dense matrix multiplication [17], graph processing [5], deep learning [14], financial and scientific [13], genomics [15] and others [5]. APs exhibit a performance improvement of up to 300 ×, and an energy-efficiency gain of up to 150 ×, compared to CPU, GPU, FPGA, and ASIC implementations. Some selected results are presented in Fig. 4 for different sparse square matrices [5]. Fig. 4(a) and Fig. 5(b) present the performance and power results, respectively, as compared to two reference baselines, CPU and GPU. Fig. 5 presents the k-means runtime (lower is better) and energy-efficiency results versus a variety of CPU [19], FPGA [20], [21] and GPU [22], [23] implementations. The associative processor solution outperforms state-of-theart alternatives in terms of both performance and energyefficiency (in all but one case). Another application is shown in Fig. 6, comparing the Convolutional Neural Network AP implementation with state-of-the-art solutions (CPU and a dedicated in-memory accelerator NeuralCache [24]). Again, the benefit of using an AP for such an application is dramatic.  Perpendicular MTJs have been widely adopted since 2016, paving their way into stand-alone and embedded memory designs. While standard single-barrier MTJ is the most mature option, it suffers from high writing currents. Using a double-barrier MTJ (DMTJ) with two reference layers is an appealing alternative [11].
The structure of a DMTJ device is shown in Fig. 7. The free layer (FL) is sandwiched between two tunnel oxide barriers (t OX,T and t OX,B ) with two polarizing reference layers (R L T and R L B ). Depending on the relative magnetization orientations between the FL and the RLs, the DMTJ features two stable states: the parallel or low-resistance state (LRS), and antiparallel or high-resistance state (HRS). Thanks to the inherently stochastic nature of the spin-transfer torque (STT) switching, the DMTJ can switch between the stable states when a current pulse that is above the critical switching current (I c0 ) is sent between the top and bottom terminals of the device.
In order to ensure low-energy operation, the AM 4 cell design, introduced in the next subsection, utilizes DMTJ devices as storage elements. We consider an advanced (20 nm diameter) DMTJ structure, whose main bottleneck is the reduced thermal stability factor ( ), i.e., shorter data retention time [25]. To mitigate this, we target a double FL structure [26]. Therefore, by using a DMTJ with two reference layers and double free layer, we ensure low-energy operation, while maintaining sufficient data retention times.
To simulate DMTJ devices with double free layers, we extended the DMTJ Verilog-A compact model reported in [11]. The model was calibrated with the experimental data reported in [25], [27] by considering the DMTJ parameters shown in Table I.

B. AM 4 Cell
The fundamental idea of AM 4 is transforming the Samsung MRAM crossbar into an associative memory, which can then be operated as CAM, TCAM, ACAM or in-memory AP. The proposed AM 4 , therefore, is based on a silicon-proven technology, where the primary novelty is the way that it is operated and utilized. This approach is inspired by CMOS-based CAM built upon a modified 6T CMOS SRAM [28].
The concept of AM 4 is to virtually rotate the 2T2R MRAM bitcell by 90 • and repurpose the control signals, as illustrated in

C. Working Principle
The first of two primary operations of AM 4 is the compare operation. Its implementation in the 2T2R MRAM crossbarbased AM 4 is presented in Fig. 9. Compare is achieved by first precharging the ML and then driving the search pattern onto the SLs (SLright = SLleft), connecting the leftmost bitcell in a row to ground, and disabling both the WEN i and W DATA signals (refer to Fig. 1(c)). If SLleft = 1 (SLright= 0) and Fig. 8. Rotating and relabeling the 2T2R MRAM cell to transform it into an AM 4 cell. Note that this conceptual rotation and signal repurposing does not require any circuit design or process changes to the silicon-proven crossbar, fabricated by Samsung [6]. the cell is in the '0' state (LRS, HRS), or if SLleft = 0 (SLright= 1) and the cell is in the '1' state (HRS, LRS), the output resistance of the cell is low, representing a match between the query bit of that column and the bit stored in the bitcell. If all cells in a row match, a low resistance is displayed by the row, enabling a fast discharge of the ML to ground. A mismatch occurs when at least one bit of the query pattern does not match the bit stored in the corresponding AM 4 cell. In such a case, the conductance path through the row goes through HRS DMTJ(s), preventing or slowing down the ML discharge. By differentiating between these two options, a perrow compare operation can be achieved.
The other main primitive of AM 4 is a parallel write operation. It is achieved using the W DATA[0] and W DATA [1] signals, which we label TAG and TAG, respectively, since they are driven by the AM 4 tags (refer to Fig. 2(b)). As presented previously, writing into the MRAM cell requires two cycles to separately bias each DMTJ for applying the required magnetization orientation. However, due to the sharing of the WEN switches between adjacent columns (which creates a potential sneak path along the bitcell row), a parallel write operation requires four single-cycle phases, as illustrated in Fig. 10 (phases 1 to 4). During the entire write operation, all WEN switches are turned on, and the selection of individual MTJs is done using the SLleft and SLright signals. In phases 1 and 2, the even columns are written. In phases 3 and 4, the odd columns are written. Multiple rows can be written at the same time.
During phases 1 and 3, '1' is written to the top DMTJ for bitcells where '1' is supposed to be stored and to the bottom DMTJ for bitcells where '0' is supposed to be stored. During phases 2 and 4, '0' is written to the top DMTJ for bitcells where '0' is supposed to be stored and to the bottom DMTJ for bitcells where '1' is supposed to be stored.
During phases 1 and 2, SLright is a complement of SLleft (SLright = SLleft) in all even columns. The SLleft and SLright signals of the odd columns are both driven to 0, thereby isolating these columns and blocking a potential sneak path through the bitcell rows.
In phase 1, TAG and TAG lines are driven to V write and GND, respectively. The SLleft bit of the even columns, where '1' is supposed to be stored, is asserted to '1' to enable writing '1' to the top DMTJ. The SLleft bit of the even columns, where '0' is supposed to be stored, is asserted to '0' (i.e., SLright is asserted '1') to enable writing '1' to the bottom DMTJ. In phase 2, the TAG and TAG lines are swapped; The SLleft bit of the even columns, where '0' is supposed to be stored, is asserted to '1' to enable writing '0' to the top DMTJ. The SLleft bit of the even columns, where '1' is supposed to be stored, is asserted to '0' (i.e., SLright is asserted '1') to enable writing '0' to the bottom DMTJ.
After finishing the write operation to the even columns, the same procedure is followed in phases 3 and 4 to write to the odd columns (refer to Fig. 10).
Using the compare and parallel write operations, described above, the Samsung MRAM crossbar-inspired topology can be used "as is", i.e., with no alterations, to enable CAM, TCAM, ACAM and AP functionalities.

D. AM 4 -Based Associative Processor
The AM 4 -based associative processor is presented in Fig. 11. Its core is the 2T2R AM 4 crossbar. The columns are supplemented with write/compare key and mask registers above the array. Masking-off (i.e., rendering certain bit columns unaffected by either compare or write), is achieved by setting SLleft = SLright = '1' for compare and SLleft = SLright = '0' for write, respectively. The Matchline Sensing (MLS) column is built with row-connected sense amplifiers, which drive the TAGs, as follows: 1) Sense Amplifier: The matchline sensing (MLS) topology is based on the single-ended self-reference sense amplifer proposed in [29]. The MLS comprises two transistors and two inverters as shown in Fig. 12(a). The precharge transistor (MPC) enables the ML precharge (PC = 0) and ML evaluation stages (PC = 1). The MX transistor, along with inverter I1, serve to limit the ML precharge to the voltage level of the tipping point of I1. Inverter I2 evaluates the final response of the ML.
Different from the standard operation of the MLS [29], the SL patterns are driven into the AM 4 cell during the evaluation stage. The sensing stage starts by driving the PC signal to GND (PC = 0), precharging the ML. The V X line is charged up to V DD , and ML out is discharged to GND. Similarly, the ML is charged up to a voltage level that depends on the I1 threshold, whose value, when exceeded, drives the gate of MX. This turns off MX, thereby halting the ML precharge. Subsequently, the SL signals are driven to the AM 4 cells, and PC is asserted ('1') to start the evaluation stage. The ML starts to discharge quickly or slowly depending on the output resistance of the AM 4 cells. A match is detected when all cells are in LRS (ML LRS ), presenting the lowest resistance path in the NAND-based AP word, and therefore, raising ML out to V DD . Conversely, a mismatch occurs when an HRS state is present, slowing down the ML discharge. During the mismatch sensing (ML HRS ), the evaluation time-frame is short enough to avoid compare errors, maintaining the ML out signal close to '0'. This is shown in Fig. 12(b).
The worst case ML sensing scenario differentiates between a match (all cells are in LRS) and a single-bit mismatch (all cells but one are in LRS). There is an overlap between  the match and a single-bit mismatch, mainly due to poor HRS/LRS ratio of MRAM. We mitigate this through redundant data coding, specifically by ensuring a certain minimum Hamming Distance (HD) between any two data words. Since contemporary memories typically use Error Correction Codes (ECC) [30], there is a "built-in" minimum Hamming distance which we utilize for the purpose of match vs. single-bit mismatch differentiation. Based on this insight, ECC-protected AM 4 provides a safe margin between match and mismatch cases for a limited HRS/LRS ratio memory. This method is more relevant for exact searching. On the contrary, using ECC might be inapplicable in approximate search scenarios, which fortunately are likely to be less sensitive to inexact matching results.
2) TAG Circuitry: The bottom part of Fig. 11 details the circuitry of the TAG register cells. Its main component is a flip-flop (FF) that holds the compare result according to the control logic signals, i.e., compare and write.
To compare the query (key) data word against the data stored in the AM 4 array (the entire row, a number of bits or a  single bit), the ML is precharged, and the key is driven onto the SLs. In order to mask a column (i.e., ignore it during compare), the SLleft and SLright lines are set to '1'. If every unmasked bit in a row matches the corresponding query bit, the ML is discharged and a '1' is accumulated in the TAG FF. If even a single unmasked bit mismatches the corresponding query bit, the ML remains high and the TAG FF remains unchanged.
As detailed in Section II-B, a compare (or several compare) operation(s) in an AP are typically followed by a parallel write into the unmasked bits of all tagged AP rows. To write data from the write key register into AM 4 , each TAG FF (set earlier by compare operation(s)) is connected to its corresponding WEN. We accumulate the compare results in each AP row to reduce the number of writes, which improves the performance of the AP. If the result of aggregated compare operations is '1', the write key data set on SL lines is written into the AP row in accordance with the MASK pattern. Otherwise, the write does not affect the row.

E. Content Addressable Memory (CAM) and Ternary CAM (TCAM)
In addition to its functionality as an AP, AM 4 can be operated in CAM/TCAM mode. Operation in CAM mode is straightforward, according to the search and write operations, described previously. To support TCAM mode, either the search (query) pattern bits or the bits of the data patterns stored in the MRAM crossbar can be "don't care" in addition to conventional '1' and '0' values. To store a "don't care", a '0' is written to both top and bottom MTJs of an AM 4 cell, presenting a LRS through both top and bottom discharge paths to the ML. During a compare, the "don't care"-written cell will not affect the result of the operation (match or mismatch). To create a "don't care" search pattern bit, we simply mask it off, as presented above.

F. Approximate Search CAM (ACAM)
Multiple applications, including text processing (e.g., text retrieval, signal processing, computational biology [31], [32], [33], and genome analysis [9], [15]), require approximate rather than exact search, for example to tolerate errors, or find similarities among erroneous or ambiguous data patterns. In approximate search, if the difference between a stored pattern and the query pattern is below a certain predefined threshold, the compare result should still be considered a "match". AM 4 can support approximate search by adjusting the MLS sampling time, using the speed of the matchline discharge as a measure of Hamming distance.
To make the operation mode robust, an ML can be amended by an NMOS discharge transistor with a configurable gate voltage. In such a configuration, the approximate search utilizes the matchline charge redistribution rather than its rise or fall time. By tuning this gate voltage (possibly automatically), we can set a desired level of Hamming distance without adjusting the ML sampling time [34], [35].

IV. COMPARE AND WRITE FUNCTIONAL VERIFICATION
To validate the write and compare operations we considered a 32 × 32 array, operating with a supply voltage (V DD ) of 1 V at nominal conditions. Fig. 13 shows the simulation waveforms of write '0', '1', and 'X' operations. As an initial condition, we set all DMTJ devices to HRS. The write key (pattern) is (000XXFFF) Hex , i.e., top and bottom MTJs are written to be (00000FFF) Hex and (FFF00000) Hex , respectively.

A. Write Operation
WEN signals are enabled during four phases of the write operation (refer to Section III-C). An extra phase is added at the end of phases 2 and 4 to write the "don't care" value (as required in TCAM). The write key bus drives the write

B. Compare Operation
Fig. 14 shows the AM 4 compare operation involving all possible stored values, i.e. '0', '1', and 'X'. We consider a compare time of about 2 ns (see Section V). In the precharge stage, the ML is precharged to about half V DD (see Section III-D.1), and ML out is discharged to '0'. Then, during the evaluation stage, SL signals are assigned. When comparing with '0', '1', and masking out the entire AM 4 row (which is equivalent to comparing with an 'X'), the SLleft/SLright pattern is (FFFFFFFF) Hex /(00000000) Hex , (00000000) Hex /(FFFFFFFF) Hex , and (FFFFFFFF) Hex / (FFFFFFFF) Hex , respectively. In the match (mismatch) case, the LRS (HRS) path discharges the ML (maintains the ML at approximately half V DD ), and the ML out outputs a '1' ('0'). If a certain AM 4 cell stores an 'X', or a certain bit column is masked out (by setting SLleft = SLright = '1'), such a cell keeps both top and bottom paths at LRS (open) and hence does not affect the outcome of compare.

V. CIRCUIT-LEVEL RESULTS
Results provided in this section rely on extensive Monte Carlo simulations (1000 samples) at the 3σ corner probability distribution from the statistical models given by the 28 nm FDSOI commercial PDK. For the DMTJ devices, the effect of process variability follows Gaussian-distributed variations with a variability (σ/µ) of 5% for the DMTJ cross-section areas, and 1% for t OX,T , t OX,B , and t FL [36]. Fig. 15 shows the compare timing for a 32-bit AM 4 row operating at a supply voltage of 1 V at the Typical-Typical corner. In particular, ML out responses to match and several mismatch cases are shown. ML out is simulated over the evaluation time-frame t comp ranging from 1 ns to 3 ns. The wider the evaluation time-frame t comp , the longer the ML has to discharge through the NAND-style resistive path, eventually driving ML out to '1' (which results in a compare error). For example, if the compare evaluation stage extends to 3 ns, the 4-bit mismatch will register as a "match" rather than a "mismatch" (thus creating a compare error). This happens because of the poor DMTJ HRS/LRS ratio.
As presented above, we mitigate the effects of the limited HRS/LRS ratio by data coding which ensures a certain minimum HD between datawords. A minimum HD, h, guarantees that the lowest number of mismatching bits in the worst case mismatch equals h rather than 1. Since ECC schemes are regularly included in contemporary memories, especially in nonvolatile ones [30], a certain HD is typically maintained. Hence, we do not necessarily have to extend the memory redundancy to introduce minimum HD. To identify the minimum HD required for safe AM 4 operation, the compare times of the match and several mismatch cases are analyzed through Monte Carlo simulations. Fig. 15(b) shows the statistical distribution of t comp for the highlighted cases of Fig. 15(a), i.e., match and 12-bit mismatch. We define the t comp difference between match and mismatch at nominal and 3σ conditions, as δ µ and δ 3σ , respectively. δ µ gives a difference of about 870 ps, and at the 3σ corner this difference is reduced by 2.5 ×. Nevertheless, these Monte Carlo results show that 3σ values do not overlap, suggesting that AM 4 performs correctly at a t comp of 1.4 ns.
In this example, we show that t comp results remain within a safe time-frame region, assuming the minimum HD between any two data words is 12-bit. To identify the effective minimum HD allowed by a 32-bit AM 4 word, we repeat the same Monte Carlo simulations for different match and mismatch cases. Fig. 16 shows the compare time analysis through Monte Carlo simulation at the 3σ corner. For a t comp time of 1.44 ns, AM 4 operates correctly during compare operations, properly differentiating between match and a 5-bit and above mismatch. This result is obtained while also ensuring a δ 3σ of about 100 ps. As shown in Fig. 16, the overlap region (refer to red time-frame at the left) between the match and 4-bit mismatch suggests that a minimum HD distance of 4 may be always required to avoid compare errors. To ensure the minimum HD of at least 4, we may use one of the error correcting codes, such as BCH. For example, BCH (31,21,2) code guarantees the minimum HD of 2 × 2.1=5. While a 32-bit wide AM 4 suffices for a number of applications [5], it is reasonable to assume that for certain other applications, a wider AM 4 array will be required. Due to the low HRS/LRS ratio of the MTJs, a wider AM 4 row leads to a higher compare error probability. In such a case, the minimum HD should be increased accordingly.
Lastly, we evaluate the the impact of local variations on t comp at 3σ , based on the match and 6-bit mismatch from the above analysis (refer to Fig. 16). The local variations are around even corners, i.e., Typical-Typical (TT), Fast-Fast (FF), and Slow-Slow (SS), and skewed corners, i.e., Fast-Slow (FS), and Slow-Fast (SF), as shown in Fig. 17. The δ 3σ results present robustness to local variations, mainly because of the adopted single-ended sensing scheme. The same behavior was presented for the other mismatch cases.   to reduce the number of writes, the effect of write latency on AM 4 performance is minimal. Table II also reports the area of the AM 4 cell, whose layout and an illustration of the 3D hybrid CMOS/DMTJ process, are shown in Fig. 18(a)-(b). The inclusion of the access transistor for write operation (refer to Fig. 10) would incur an overhead of about 48% of the cell area. This access transistor is also presented in the Samsung crossbar cell [6].

VI. APPLICATION OF AM 4 AP TO SMITH-WATERMAN DNA SEQUENCE ALIGNMENT
Smith-Waterman is a dynamic programming algorithm [12] that identifies the optimal alignment of two sequences. It is widely used in bioinformatics and computational biology for DNA (genome) sequence alignment. Smith-Waterman algorithm has two steps. The first step is scoring, which builds a two-dimensional scoring matrix to find the maximum edit distance between two sequences. The second step is a traceback, which reconstructs the optimal alignment path. Scoring is the most computationally demanding step [15], while traceback requires significantly less computing power and can therefore be performed by an external host CPU. In the following, we focus on AM 4 -based AP implementation of the scoring step.
The sequential time complexity of the Smith-Waterman score matrix calculation is O(nm) where n and m are the lengths of both sequences. The upper bound of the Smith-Waterman scoring complexity on a parallel von Neumann machine with p parallel processing units is O(nm? p). AM 4 based AP can achieve linear time complexity of O(max(n,m)). Smith-Waterman is used in genome analysis to find the optimal local alignment (of two or more DNA sequences). Global alignment, which is also a frequently used genome analysis tool, can be implemented on an AM 4 -based AP with only a few modifications to the local alignment implementation. Similarly, an AP can efficiently implement a multiple sequence alignment [15].
Our comparison includes the resistive device-based NORtype AP (RAP) presented in [42]. NOR configuration is the reason RAP outperforms the AM 4 . However, for the same reason, RAP achieves significantly lower energy efficiency (-69%) compared to AM 4 .
For the above examined workload, AM 4 performance is limited by the density of the datasets and the parallelism of the task. Smith-Waterman DNA sequence alignment scoring matrix calculation is an example of a highly parallelizable task where a large number of data elements are processed simultaneously, allowing AM 4 to apply its parallel processing abilities. In a Smith-Waterman workload, AM 4 scales to the dataset size.
A variety of approximate search CAM designs use timing (i.e., score signal delay or the speed of the matchline discharge) as a measure of Hamming distance. Bui and Shibata [61] exploits the delay of the score signal for a Hamming distance search CAM. A small Hamming distance tolerance (≤ 2 bits) approximate CAM is proposed in [62], [63]. Garzón et al. [34] and Hanhan et al. [35] use the combination of the voltage, controlling the speed of the matchline discharge, and the sense amplifier reference voltage to define the Hamming distance threshold. These designs are capable of tolerating very large Hamming distances. AM 4 differs from conventional and approximate CAM solutions in that to our knowledge, it is the first NAND-type CAM based on a random-access magnetoresistive crossbar. Typical state-of-the-art emerging memory based CAMs are designed as NOR CAM, where the matchline discharges on a mismatch occurrence. Since mismatches are much more frequent than matches (in typical CAM/TCAM applications, only one memory row matches, while the rest mismatch), all matchlines need to be pre-charged before every search/lookup, resulting in significant energy wasting. In contrast, in NAND CAM, only the matching row(s) discharge, reducing the energy consumption of search/lookup by orders of magnitude.

B. Associative Processor
The use of STT-MRAM and Resistive Ternary CAM (TCAM) for data-intensive computing was proposed by Guo et al. [64]. ReAP, a resistive memory based, massively parallel in-memory associative processor was first introduced by Yavits et al. [13]. Yantir et al. [65] introduced a two-dimensional model of an in-memory associative processor. Hout et al. [66] extended the associative processor model to support multi-valued logic. Imani and Rosing [67] proposed another design of an associative processor for nearmemory processing. Caminal et al. [68] applied in-memory associative processing to database analytics acceleration, while Yavits et al. [17] and Neggaz et al. [69] implemented in-memory matrix multiplication on an associative processor. Garzón et al. [14] and Yantir et al. [70] separately proposed a convolutional neural network design using an in-memory associative processor. Complete system designs of in-memory associative processors have been separately proposed by Zha and Li [71] and Caminal et al. [72]. Yantir [73] studied CMOS and resistive NOR CAM based associative processors and their applications.
To our knowledge, AM 4 is the first solution that converts a MRAM crossbar designed for random access storage, into an associative processor. We achieve that without altering the MRAM core, only by manipulating data and amending the peripheral circuitry.

VIII. CONCLUSION
In this work, we presented AM 4 , a multiple purpose (i.e., CAM, TCAM, approximate CAM, and in-memory associative processor) NAND-type architecture based on the siliconproven MTJ-based Samsung crossbar array. AM 4 enables a wide range of in-memory computing applications. We validated the basic AM 4 functionality and evaluated its timing and energy consumption by circuit-level simulations using Cadence EDA tools. AM 4 was designed using a commercial 28 nm FDSOI technology node. A Verilog-A based compact model for the DMTJ device was amended. Simulation results show that AM 4 may require data coding (such as ECC) to operate reliably due to very limited high resistance / low resistance ratio. We conducted an exhaustive design space exploration, showing that AM 4 exhibits very low susceptibility to process variations around even and skewed corners. AM 4 was applied to Smith-Waterman DNA sequence alignment, a frequent bioinformatics workload. AM 4 was shown to significantly outperform state-of-the-art conventional as well as other in-memory computing alternatives in terms of performance and energy-efficiency.