DIPER: Detection and Identification of Pathogens Using Edit Distance-Tolerant Resistive CAM

We propose a novel resistive edit distance-tolerant content addressable memory for computational genomics applications, particularly for detection and identification of pathogens of pandemic importance. Unlike state-of-the-art approximate search solutions that tolerate small number of replacements between the query pattern and the stored data, DIPER tolerates insertions and deletions, ubiquitous in genomics. DIPER achieves up to 1.7<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math><alternatives><mml:math display="inline"><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="garzon-ieq1-3315829.gif"/></alternatives></inline-formula> higher <inline-formula><tex-math notation="LaTeX">$\bf{\textit{F}_{1}}$</tex-math><alternatives><mml:math display="inline"><mml:mrow><mml:msub><mml:mtext mathvariant="italic">F</mml:mtext><mml:mrow><mml:mn mathvariant="bold">1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math><inline-graphic xlink:href="garzon-ieq2-3315829.gif"/></alternatives></inline-formula> score for high-quality DNA reads and up to 6.2<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math><alternatives><mml:math display="inline"><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="garzon-ieq3-3315829.gif"/></alternatives></inline-formula> higher <inline-formula><tex-math notation="LaTeX">$\bf{\textit{F}_{1}}$</tex-math><alternatives><mml:math display="inline"><mml:mrow><mml:msub><mml:mtext mathvariant="italic">F</mml:mtext><mml:mrow><mml:mn mathvariant="bold">1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math><inline-graphic xlink:href="garzon-ieq4-3315829.gif"/></alternatives></inline-formula> score for DNA reads with 15% error rate, compared to state-of-the-art DNA classification tool Kraken2. Simulated at 500 MHz, DIPER provides 910<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math><alternatives><mml:math display="inline"><mml:mo mathvariant="bold">×</mml:mo></mml:math><inline-graphic xlink:href="garzon-ieq5-3315829.gif"/></alternatives></inline-formula> average speedup over Kraken2.


I. INTRODUCTION
I MAGINE that in a few short years, the world is struck by a viral pandemic.The virus evolves quickly, creating thousands of variants.An accurate timely detection and identification of the virus mutation is critical for controlling the pandemic.Pathogen identification tools available today are: 1. PCR tests which allow detecting some variants of concern but are likely to become prohibitively slow, expensive, and inaccurate as the number of variants grows beyond several distinct ones; 2. Genome sequencing followed by detection and identification, using computational genomics tools.Unfortunately, the latter quickly becomes impractical while using state-of-the-art solutions: with the progress of sequencing technologies, it is not implausible to assume that Itay Merlin, Alexander Fish, and Leonid Yavits are with the En-ICS Labs, Faculty of Engineering, Bar-Ilan University, Ramat-Gan 5290002, Israel (e-mail: itaymerlin1@gmail.com;alexander.fish@biu.ac.il; leonid.yavits@biu.ac.il).
Esteban Garzón is with the Department of Computer Engineering, Modeling, Electronics and Systems, University of Calabria, 87036 Rende, Italy (e-mail: esteban.garzon@unical.it).
Digital Object Identifier 10.1109/TC.2023.3315829hundreds of thousands to millions of samples will be sequenced worldwide daily during the next pandemic.Thus, identification of pathogens by computational genomic tools becomes the problem of a very large scale.Analyzing a single genome may take a day on a high-performance server [1].As a result, the computer infrastructure will be overwhelmed, valuable data potentially lost and misinterpreted.New dangerous mutations will be overlooked or discovered late, leading to catastrophic loss of lives and tremendous economic impact.Future viral pandemics are not the only looming healthcare crisis.Antimicrobial resistance is a global health threat, declared by WHO one of the top 10 global public health threats facing humanity.Misuse and abuse of antimicrobials are among the main drivers in the development of drug-resistant pathogens.To prevent such misuse and abuse, an accurate and fast identification of microbial pathogens is critical.
The computational barriers arising from the algorithmic overhead involved in the interpretation of sequencing results, the overwhelming volumes of sequenced data, as well as mounting inefficiencies of conventional von Neumann computers, render large scale real-time and accurate genome analysis infeasible.In this work, we develop a resistive edit distance-tolerant approximate search content addressable memory which can be used for hardware acceleration of detection and identification of pathogens.
Content Addressable Memories (CAMs) are widely used in microprocessors (fully associative cache memories, translation lookaside buffers, branch prediction buffers, superscalar renaming circuitry and so on), as well as in network routers and switches, and other domain specific hardware.While CAMs are typically designed to find exact matches, approximate or similarity search is increasingly required by many popular contemporary data-intensive applications, such as data analytics, machine learning, deep learning, and computational genomics [2], [3], [4].Latter gained keen interest of the research community due to the exponential growth of the sequenced DNA data volume in recent years.It is used in a wide variety of applications well beyond pathogen surveillance, from personalized healthcare, through monitoring environmental ecosystems, to sustainable agriculture.
In approximate search, if the difference between a stored pattern and the query pattern is below a certain predefined threshold, such stored pattern is considered a "match".While in some cases, the difference is limited to replacements (i.e., where a data element is replaced by another), in applications involving text and sequenced DNA data, there are two additional types of difference: insertions (where a data element is inserted into an otherwise identical sequence of data elements), and deletions (where a data element is deleted from a sequence), collectively called indels.Replacements, insertions and deletions are typically referred to as edits.

Pathogen class
A variety of CAM designs support approximate search.These solutions target Hamming distance tolerance, typically limited to only a few bits [5].However, a single insertion or deletion (indel) can result in a very large Hamming distance, as demonstrated by Fig. 2(a).Such a single edit may comprise tens of bits in Hamming distance, which would almost certainly be above the tolerance threshold of state-of-the-art approximate search CAMs.
To alleviate this fundamental limitation, we propose a resistive edit distance-tolerant content addressable memory (DIPER), which is capable of tolerating a user-configurable edit (rather than Hamming) distance.DIPER targets the high speed approximate search in applications such as computational genomics, particularly in detection and identification of pathogens, where indels are ubiquitous.
Traditionally, pathogen detection relies upon the identification of pre-established markers of a particular disease [7].However pathogen detection increasingly deal with metagenomic samples, for example sourced from wastewater.In particular, genome identification is performed as follows (Fig. 1): (1) a sample potentially containing DNA of multiple organisms is obtained and prepared; (2) the sample is sequenced; the sequencer outputs DNA reads (potentially sourced from different organisms); DNA sequencers are prone to sequencing errors, such as indels and replacements; (3) DNA reads (soiled with sequencing errors) are processed by a metagenomic classification application that potentially associates each DNA read with a certain species in its database.A variety of DNA classifiers have been proposed, including Kraken2 [8], which classifies DNA by exact matching of DNA read fragments (called k-mers) against an existing DNA database, e.g., a collection of pathogen DNA.Since DNA reads typically contain sequencing errors, a certain fraction of query k-mers would not hit in the database, thus limiting the sensitivity of conventional DNA classifiers.
The operating principle of DIPER is based on the observation that an insertion or deletion shifts a part of the data pattern right or left, as shown in Fig. 2. Hence, by matching not only the co-located data elements but also their left and right neighbors, it is possible to tolerate insertions and deletions, as shown in Fig. 2(b).If none of the three candidate data elements (colocated, left, and right neighbors) matches, a single element mismatch occurs.If the number of such mismatches is below a certain configurable Edit Distance (ED) threshold, we consider it a match.Conversely, if the number of single mismatches exceeds the ED threshold in every DIPER row, it means that the query pattern misses in DIPER.
Our paper makes the following main contributions: • To our knowledge, DIPER is the first resistive approximate search CAM that can tolerate user-programmable edit distance.• DIPER employs ML discharge speed as a measure of edit distance, and does not require data transformation, such as error correction codes or locality-sensitive hashing.• DIPER stores and compares to several shifted copies of the reference genomes to efficiently support insertion/deletion tolerance.• DIPER is evaluated as a part of pathogen detection and identification platform, to be used for example for pathogen transmission and mutation tracking during viral pandemics.

A. DNA Classification
DNA is composed of four nucleotides: Adenine (A), Guanine (G), Cytosine (C), and Thymine (T), which are frequently referred to as DNA basepairs, bases or bps.Accordingly, a DNA data element is a DNA base that can have one of four values (A, G, C and T).DNA sequencing is the process of determining the bases of a DNA chain.Contemporary high-throughput DNA sequencers can sequence multiple DNA samples in parallel.
The goal of DNA classifier is to find what organism a DNA sequence (or a single DNA read) belongs to, taking into account that it possibly originates from a metagenomic sample (i.e., a sample that contains DNA of many different species).Several probabilistic classifiers have been proposed, such as interpolated Markov model based Phymm and PhymmBL [9], BLAST-based models [10], naive Bayesian classifier NBC [11], and others.These classification tools are sensitive but relatively slow.For example, DNA classification using Smith-Waterman like dynamic programming would have the complexity ranging from O(m • n 2 ) (the best case) to O(m 2 • n 2 ) (the worst case), where m is the number of DNA reads and n is the read length.
Recently, PACIFIC, a convolution neural network based SARS-CoV-2 classification solution has been proposed [12].However, PACIFIC supports only several very distinct classes of organisms.
To speed the DNA classification up, exact pattern matching classifiers were developed, including CLARK [13] and Kraken2 [8].These classifiers exhibit significantly higher speed at the cost of limited sensitivity.One reason for the reduction in sensitivity is sequencing errors that are inherently present in DNA reads.These sequencing errors manifest in replacing bases in DNA reads with incorrect ones, deleting bases, or inserting redundant bases.As a result, DNA read fragments (k-mers) that otherwise should have matched in the classification database, end up being unclassified and discarded.
Approximate search CAMs alleviate this fundamental flaw by tolerating to some extent the sequencing errors.When comparing a sequence s 1 against a set of sequences S contained in an approximate search CAM, every s 2 ∈ S such that Hamming Distance (s 1 , s 2 ) ≤ t for some threshold t, will match.However, as discussed above, DNA reads contain indels.A single indel may result in a very significant Hamming distance, as shown in Fig. 2(a).To endure such sequencing errors, the tolerance threshold t needs to be relatively high.This increases the sensitivity of classification, but at the same time, significantly reduces its precision (by allowing a high false positive rate).
State-of-the-art classifiers include hardware-accelerated solutions as well.SquiggleFilter [14] is a virus detection framework that analyzes the raw output (raw squiggles) of the ONT MinION sequencer and filters out all but the target virus DNA reads.GenSLMs [15] applies large language models to identification and classification of viral variants using supercomputers such as Polaris at the Argonne Leadership Computing Facility and Selene at NVIDIA, as well as Cerebras CS-2 waferscale cluster.In this work, we present a fast, highly sensitive and precise approximate matching-based DNA detection and identification solution, implemented by DIPER.

B. Content-Addressable Memory
Fig. 3 shows the architecture of a conventional CMOS content-addressable memory (CAM) array comprised of n columns and m rows [16].The CAM performs a comparison between the query data pattern stored in the search data register, and the information contained within the six-transistor static random access memory (6T-SRAM) bitcells.A ML is shared between bitcells of an n-bit word and also fed into a sense amplifier (SA).The SL, denoted SL and SL, are shared across all rows of the CAM array.Read and write operations within the CAM array are executed in a similar manner as in conventional 6T-SRAM.Access to a bitcell is facilitated by enabling the word line (WL) for the corresponding row and precharging or asserting (complementary values) the SLs for read or write operations, respectively.A search (compare) operation is performed simultaneously across the entire array during a single clock cycle by asserting the query data on the SLs.The matchline sense amplifiers (MLSAs) evaluate the state of the matchlines at the end of the comparison cycle and signal a match or mismatch.
Several resistive CAM based solutions for genome analysis have been developed.PARC [4], RAPID [17] and RADAR [18] provide resistive RAM and CAM based optimal sequence alignment acceleration solutions.ALIGNS [19] proposes a SOT-MRAM based DNA read alignment accelerator.Aligner [20] is a resistive short DNA read alignment accelerator.RASSA [2] is a resistive CAM based prealignment accelerator for long DNA reads.Resistive memory based pattern matching solutions have also been suggested [21].

C. Approximate Search Content-Addressable Memory
In recent years, many ternary and binary NOR-and NANDbased CAM bitcell designs have been proposed, including both CMOS-based [22] and emerging-memory-based [23] solutions.Several CAM designs offer soft-error tolerance using error correction coding (which requires memory redundancy) and replacing the matchline (ML) sense amplifier with an analog comparator [24], [25].Such designs typically only tolerate a limited Hamming distance (1-4 bits).Another class of approximate search CAMs uses locality sensitive hashing of stored data and query patterns [26], [27].While such schemes potentially tolerate large Hamming distances, they require hashing of data prior to storage and search.Additionally, large Hamming distance does not always result in low similarity of hashed data sketches [28], which leads to false positive results and hence limited precision.
Approximate CAM based solutions for DNA read alignment have been proposed [29].Several emerging memory (such as ReRAM or FeFET) based designs for Hamming distance approximation have also been developed [30], [31], [32].
Some of the approximate search CAM designs use timing (i.e., the matching score signal delay, or the speed of the ML discharge) as a measure of Hamming distance.HD-CAM [33], a Hamming distance-tolerant CAM, uses the combination of the voltage, controlling the speed of the ML discharge, and the sense amplifier reference voltage to define the Hamming distance threshold.HD-CAM is capable of tolerating very large Hamming distances.However it does not directly support the edit distance tolerance.A Hamming distance search CAM, where the matching score signal is delayed every time a bit mismatch occurs, is proposed in [34].In the approximate search enabled CAM for energy efficient GPUs, proposed in [5], a small Hamming distance (≤ 2 bits) is tolerated through meticulous timing of the ML discharge.In [35], Hamming distance of up to 4 bits is tolerated by using delay lines at the clock inputs of four separate sense amplifiers on each ML.Tunable sampling time techniques require very precise device and circuit sizing, while achieving limited sensitivity and precision (due to false mismatches and multiple false matches [5]).These issues are exacerbated by process variations.
Recently proposed EDAM [36] is a CMOS edit distancetolerant content addressable memory for approximate search.EDAM has several disadvantages, such as: 1. CMOS EDAM cell is very large (42 transistors) and hence scaling EDAM to support large genome database is challenging.

CMOS EDAM design requires cross-connectivity among
neighboring memory columns (to enable comparison with the left and right neighbors) which may render it wire-bound, adversely affecting density and timing.

III. RESISTIVE EDIT DISTANCE-TOLERANT CAM DESIGN
We solve the CMOS EDAM inefficiencies by introducing DIPER, whose design is presented in Fig. 4. To significantly increase the memory density, we replace CMOS SRAM bitcells by resistive memory elements (memristors).DIPER stores the DNA bases in one-hot encoding format, such that two SRAM bitcells are replaced by four memristors.To relieve the crossconnectivity problem, we utilize the ability of memristors to form much denser memory structures compared to CMOS.Every DIPER row comprises three subrows, where in addition to the dataword, we store two of its "copies": the first one is shifted left by one DNA base, and the second one is shifted right by one DNA base.As a result, the query DNA base can be compared with the co-located stored DNA base and its left and right neighbors without a need to fetch the neighbor values from the neighboring storage cells.This redundant approach triples the memristor counts per cell (12R) but removes the need to cross-connect memory bit-columns for the left and right neighbor comparison.DIPER supports two operations: write and compare.Write is applied relatively rarely, to build or to update the reference genome database.Therefore we consider write an offline operation and do not optimize its latency.DNA bases are encoded using one-hot encoding, for example A is encoded as '0001', G as '0010', C as '0100' and T as '1000'.The datawords are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.written row by row, as in a typical single-port CAM.To enable the write, one of the three wordline signals (W E 1 , W E 2 and W E 3 ) are asserted and the respective dataword (or one of its left-shifted or right-shifted copies) is asserted on the bitlines.However, since writing a memristor requires a high threshold voltage, writing 4 memristors (a single DNA base) may create an excessive current through the wordline transistors.Therefore, since write is a non-time critical offline operation, we perform it in bit-serial manner: only one of 4 memristors in each subrow is written at a time.Therefore it takes 4 × 3 write cycles to store a dataword (along with its left-shifted and right-shifted copies) into a single DIPER row.

A. DIPER Array Design and Operation
Compare operation is illustrated in Fig. 5.During compare, all wordlines are negated.The inverted query pattern is asserted on the bitlines.If a query base in a certain DIPER column matches the stored base or one of its neighbors stored in the same cell (as illustrated in Fig. 5(a)), logic '0' is asserted to the gate of the respective NMOS transistor (T L , T C , or T R ).Therefore it is sufficient for the query base to match either the co-located base or one of its neighbors to disable the ML discharge path.If however the query base does not match any of the bases stored in the cell (Fig. 5(b)), the gates of all three NMOS transistors (T L , T C , and T R ) are asserted '1', thus enabling a ML discharge path.
One-hot encoding ensures that regardless of what bases are compared (A vs. T, or G vs. C, or T vs. G and so on), the result is always the same: for a base match, only one memristor conducts, passing '0' to the gate of one of the (T L , T C , or T R ); for a base mismatch, one memristor in each subrow (left, co-located and right) conducts, passing '0' to the gates of all NMOS transistors (T L , T C , and T R ).
In some cases, individual DNA bases or DNA fragments of either the query pattern or the stored datawords should not affect the result of the compare (i.e., be set as "don't cares").To mask off query bases, rendering them "don't care", we encode them as '0000'.In such case, '0' is always asserted at the gates of T L , T C , or T R .A "don't care" value can be stored by writing '0000' in the 4 memristors of the target subrow of the target cell, and asserting the respective wordline (W E 1 , W E 2 , or W E 3 ), which will ensure '0' at the gate of the respective transistor (T L , T C , or T R ).
The discharge speed depends on the number of mismatching bases, which define the number of discharging paths.The higher the number of mismatching bases, the higher the ML discharge speed.The Edit Distance (ED) tolerance threshold can be dynamically configured by tuning the evaluation voltage V eval (Fig. 4(b)), as demonstrated in the next section.

B. Timing
Fig. 6 shows the timing diagram of five compare operations (DIPER width is 32-bases) under the nominal voltage (V DD = 1.2 V).A compare operation comprises two steps, precharge and evaluation.In the precharge step, the SLs are discharged and then the ML is precharged to V DD by opening the M PC transistor (PC-ML = '0').In the evaluation step, the M PC transistor is disabled and the query data is driven onto the SLs.The ED threshold is set by the evaluation voltage V eval .The ML levels at the end of each cycle signal the compare result: if the ML voltage is above the reference voltage of a sense amplifier (not shown in the figure), the match is signalled by '1' at the output of a sense amplifier; otherwise, a mismatch is signalled by '0' at the output of a sense amplifier.
To enable the exact search operations, M eval is fully open (V eval = V DD ).The approximate matching operation depends on the conductivity of the M eval transistor.During the approximate search, V eval is set below the nominal voltage (V eval < V DD ), limiting the conductivity of M eval .
The first three compares (1-3, 3-5, 5-7 ns time frames in Fig. 6 are exact searches (where V eval = V DD ), with the first one resulting in a match (the edit distance between the query and the stored pattern is zero).The second and the third compares result in mismatches (the edit distance is 32 and 1, respectively).The third and fourth compare operations (7-9 and 9-11 ns time frames in Fig. 6 are approximate searches (where V eval = 0.5V, which corresponds to the ED threshold of 1).The 4th compare (where the edit distance between the query and the stored dataword is 1) results in a match.The 5th compare (where the edit distance is 2, i.e., above the ED threshold of 1) results in a mismatch.
To evaluate the impact of global and local variations on DIPER classification efficiency, we focus on the most challenging scenario where the slowest ML discharge (i.e., with one mismatching base) which is supposed to result in a mismatch, needs to be distinguished from a match.This is the scenario where the variation-induced change in the ML discharge speed has the highest probability of creating a false result.Fig. 7(a) and 7(b) shows the effect of process and local variations on ML discharge speed for such worst case.We have also taken into account the variability of memristor devices, by normally randomizing Ron and Roff with the mean values of 1KΩ and 1M Ω respectively, and σ of 15%.The results of variation analysis show that the effects of variability on the ML voltage at a sampling time is quite negligible.DIPER can confidently distinguish between the worst case mismatch and a match by applying the sense amplifier threshold of V DD /2.

C. Many-Base Indel Mitigation
In the exact search mode, DIPER efficiently tolerates insertions and deletions of one or multiple single data elements (DNA bases).However, if an indel comprises several consecutive data elements, such indel may cause a mismatch (a false negative result) and in turn limit the sensitivity.One way to mitigate this drawback is to perform several consecutive queries (searches) with the same query pattern shifted left and right.For example, if there is a two-data element (two DNA base)insertion in the query, there is a two-element shift between the query and a part of the stored pattern, leading to a mismatch in the exact search mode.However, if we shift the query pattern left by one data element and repeat the search, it will return a match.This is because the query data elements left of the insertion become one position apart from the stored pattern and match as the left neighbors.The data elements right of the insertion also become one position apart from the stored pattern and will match as the right neighbors.By performing multiple queries with a shifted query pattern, we can improve the sensitivity at the cost of a longer search time.

IV. DIPER AS PATHOGEN DETECTION ACCELERATOR
An accurate and fast pathogen detection and identification platform, capable of overcoming the shortcomings of existing testing techniques [37], can strongly benefit from the edit distance-tolerant matching capabilities of the proposed DIPER, as well as its relative high density and scaling ability.
Fig. 8 presents a DIPER-based accelerator for pathogen detection that employs the concept of edit distance-tolerant approximate matching as part of a pathogen DNA identificationby-sequencing pipeline.The DNA of the target pathogen, referred to as the reference DNA, is pre-determined and composed of a set of short DNA fragments, known as k-mers, each having a length of k basepairs [38].The process of extracting the reference DNA k-mers is illustrated in Fig. 8(b), where the first k-mer is extracted from position 0 to position k − 1, the second k-mer is extracted from position 1 to position k, and so forth.The reference DNA database is generated by storing k-mers in the DIPER offline, as shown in Fig. 8(b), where each unique k-mer is stored in a separate DIPER row.The number of k-mers in the DIPER is limited to N − k + 1, where N represents the length of the pathogen DNA.In the case of Severe Acute Respiratory Syndrome coronavirus-2 (SARS-CoV-2) virus, N is approximately 29 thousand, where some k-mers may appear more than once.Therefore, the actual number of k-mers in the DIPER could be lower.In our evaluation, we use k = 64 (a 64-mer).
A sequenced sample typically consists of a large set of DNA reads, sourced from DNA of different organisms (e.g., bacteria and viruses) presented in the sample.The pathogen detector (Fig. 8(a)) retrieves the DNA reads from memory and transfers them to a read buffer that feeds the shift register.The memory bandwidth required to support the peak DIPER throughput is 10.8GB/s.
The detection and identification operation begins with loading each new DNA read from the read buffer into the shift register.The DIPER array is fed by a 64-base wide segment of the shift register.The read is then shifted one base-pair to the right in a sliding window manner in every clock cycle, allowing the detection of a single 64-mer in every clock cycle.The process is controlled by a microcontroller implemented as a state machine.Its control registers are memory-mapped for accessibility by the host.
In an ideal scenario, the k-mers from the target pathogen DNA should match exactly in the DIPER.However, sequenced DNA reads often contain errors in the form of replacements, insertions and deletions [39].Another source of difference between the query DNA reads and the reference DNA are genetic variations, which may occur in mutations, such as alpha or omicron variants of SARS-CoV-2.Such variations result in a nonzero edit distance between a read and a reference fragment (k-mer) that would otherwise match exactly.
The ability of DIPER to tolerate edit distance as well as large Hamming distance enables accurate genome detection, even when the target DNA reads have multiple sequencing errors or genetic variations.Furthermore, the programmable ED threshold feature in DIPER supports a wide variety of sequencing error profiles.
The reads sourced from DNA of other organisms presented in the sample are expected to exhibit significant difference vs. the target pathogen DNA, such that the edit distance between those reads and the reference k-mers should typically be higher than the ED threshold, which in turn allows classifying those reads as not SARS-CoV-2.
If the value exceeds the hit threshold, the DNA read is classified as belonging to SARS-CoV-2.If a 64-mer approximately matches anywhere in DIPER, the hit counter Hits (whose input is the OR of all MLSAs outputs, see Fig. 8(a)) is incremented.After the DNA read is processed, the hit counter is compared with a user-defined hit threshold; if it exceeds the hit threshold value, the DNA read is classified as SARS-CoV-2.
The ED threshold of the DIPER can be optimized using a validation set, which consists of either simulated DNA reads or DNA reads of known origin (with known detection results).The optimal ED threshold that maximizes a desired criterion, such as sensitivity subject to precision constraint or F 1 score, can be determined by periodically processing the validation set and varying V eval .The optimal ED threshold can then be updated accordingly.
While DIPER does not employ the optimal sequence alignment, its capability to tolerate edit distance makes it useful as a prealignment filter or an index accelerator in DNA read alignment.

V. APPLICATION & RESULTS
Pathogen surveillance and tracking are mandatory for health organizations and governments to control pandemics [40].A rapid, accurate, cost effective and easy to use solution for pathogen detection and identification is critical to enabling the worldwide genomic surveillance [41].

A. Figures of Merit for DIPER DNA Classification Efficiency
Table I shows the examples of true positive, true negative, false negative and false positive (TP, TN, FN and FP, respectively) DNA classification results, generated by DIPER.While true results are trivial, a false negative result is received due to three consecutive insertions (such that the edit distance is less than the ED threshold of 4), which DIPER is unable to tolerate in a single search.A false positive result is received in an unlikely case when a large number of replacements (10 in this example) accidentally match the left or right neighbors of the replaced bases in the stored pattern.
While trivial examples showing false positive results in DIPER are abundant, in reality, the probability of occurrence of 64-mers that accidentally falsely match in DIPER is very low, at least for the viral genomes examined in this work.We demonstrate it with the following experiment.We extract all possible 64-mers from all genomes used in DIPER evaluation other than SARS-CoV-2 (refer to Section V-B).We then match all those 64-mers against the SARS-CoV-2 reference genome in all possible positions.Any positive match would be a false positive since the query genomes are different from the reference genome.The number of accidental false positive results in this experiment is 254 out of approximately 98,000 64-mers, i.e., 0.26%.While this phenomenon is not to be ignored, its impact on DIPER's classification efficiency is almost negligible.This fact is reflected in the DIPER 99.74% precision value when the exact search (i.e. with ED threshold set at 0) is applied to low error rate reads, as presented in Fig. 9(a).
The first two figures of merit, sensitivity and precision, are calculated as follows: Additionally, we calculate F 1 score which is a harmonic mean of sensitivity and precision, as follows: We use arguably the most popular state-of-the-art classification tool Kraken2 [8] as a baseline reference for the virus DNA detection and classification efficiency comparison.Additionally, we compare DIPER with HD-CAM, a Hamming distance-tolerant CAM for DNA classification [33].We use the normalized F 1 score as an additional figure of merit in the DIPER comparative evaluation:

B. Evaluation Methodology and Results
All DNA sequences in our evaluation are downloaded from NCBI online data sets [42].We evaluate the DIPER pathogen detection efficiency by conducting two different experiments, one using simulated virus DNA reads and the other one applying pathogen detection to raw DNA reads produced by an industrial sequencer.
In the first experiment, we attempt to detect the coronavirus SARS-CoV-2 and its variants (alpha -B.1.1.7,beta -B.1.351and gamma -P.1) in a simulated metagenomic sample, containing DNA reads of the above listed organisms, as well as the DNA of several other organisms: SARS-CoV-1, MERS-CoV, Coronavirus HKU1 and Human papillomavirus (HPV) 14.The 64-base-long DNA reads were extracted from random positions in the DNA sequences of each of these organisms.Consecutively, sequencing errors (insertions, deletions and replacements) were randomly injected.Using these reads, a metagenomic dataset was created with the following two error rates: 1. Low error rate reads: Replacement = 3.6%, insertion = 0.2%, deletion = 0.2%, simulating the low error rate of the second generation DNA sequencers [43].2. High error rate reads: Replacement = 1%, insertion = 7%, deletion = 7%, simulating the high error rate of the third generation DNA sequencers [43].Each DNA read (a 64-mer) in the metagenomic sample is searched in the DIPER, which stores the SARS-CoV-2 DNA as a reference database.The error-injected reads of SARS-CoV-2 and its variants that match and mismatch in the DIPER constitute true positive and false negative results, respectively.DNA reads of the other organisms that match and mismatch in DIPER, constitute false positive and true negative results, respectively.
To compare DIPER with HD-CAM [33], SARS-CoV-2 detection in HD-CAM is implemented in a similar fashion.
To compare DIPER efficiency and performance with those of Kraken2, a Kraken2 database containing only SARS-CoV-2 DNA, was created.In other words, Kraken2 was operated in a detection rather than a full classification mode.Kraken2 was applied to our simulated metagenomic dataset using 64-mers.
The sensitivity, precision and F 1 score results of the first experiment (as the function of the ED threshold) are presented in Fig. 9(a), 9(b), 9(d), and 9(e).The classification efficiency of Kraken2 does not depend on DIPER ED threshold; hence, its sensitivity, precision and F 1 score are presented by horizontal lines.We observe that DIPER sensitivity grows with the increasing ED threshold for both low and high error rate read detection runs.At the same time, DIPER precision diminishes (due to the growing number of false positive matches) for both runs.There is a point where the drop in precision begins to outpace the growth of sensitivity.This point marks the optimal F 1 score which occurs at different ED threshold values (3 for low error rate and 10 for high error rate, respectively).Since DIPER tolerates edit distance while HD-CAM tolerates only Hamming distance, DIPER sensitivity and so its F 1 score are typically higher than those of HD-CAM.
The normalized F 1 score results of the first experiment are presented in Fig. 9(d) and 9(e) (right Y axis).While DIPER outperforms Kraken2 and HD-CAM for both low and high error rate DNA reads, the normalized F 1 score is much higher for high error rate reads.For low error reads, DIPER outperforms Kraken2 by approximately 1.4×-1.6×.This figure grows to 1.3×-6.2×for high error reads.Kraken2 fails to correctly classify certain SARS-CoV-2 reads due to sequencing errors, since it performs exact rather than approximate search.In contrast, DIPER correctly classifies more reads due to its ability to tolerate edits caused by sequencing errors.Since indels are likely to result in very large Hamming distances, DIPER outperforms HD-CAM, especially for higher error rates.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In the second experiment, we perform detection of SARS-CoV-2 in raw DNA reads.Specifically, we analyze high quality long reads produced by PacBio sequencers RS, RS II and Sequel II (from Sequence Read Archive at NCBI [42]).
Similar to the first experiment, a metagenomic DNA sample was created, comprising raw reads from the following organisms: Human Alphaherpesvirus 2, Pandoravirus Salinus and SARS-CoV-2.In this experiment, a Human Alphaherpesvirus 2 virus DNA (NCBI [42]) was used as a reference (target) virus DNA.Hence, the raw reads of Human Alphaherpesvirus 2 are expected to match, i.e., yield positive results, while the rest of the reads should ideally mismatch (i.e., produce negative results).The reference database was produced by extracting 64-mers of the clean Human Alphaherpesvirus 2 DNA and storing them in DIPER and HD-CAM.
DIPER sensitivity, precision and F 1 score results of the second experiment are presented in Fig. 9(c) and 9(f).Similar to the first experiment outcome, DIPER sensitivity grows with the increasing ED threshold, while its precision decreases.DIPER F 1 score reaches its maximum at the ED threshold of 6.As the ED threshold continues to grow, the sensitivity rise is increasingly outpaced by the precision drop, leading to the F 1 score reduction.
The normalized F 1 score results of the second experiment are shown at Fig. 9(f) (right Y axis).The overall behavior is similar to that of the first experiment with low error rate reads.DIPER outperforms Kraken2 by 1.3× to 1.7×.DIPER correctly classifies more reads than Kraken2.However, starting at the ED threshold of 6, DIPER incorrectly classifies an increasingly growing numbers of Pandoravirus Salinus and SARS-CoV-2 reads as Human Alphaherpesvirus 2 (generating false positive results).
Based on the results of two experiments, we conclude as follows: 1. DIPER strongly outperforms Kraken2, and the factor grows with the DNA read error rate, reaching 6.2× for reads with 15% error rate.II.

C. Speedup
We compare DIPER performance with that of 3T1R TCAM accelerator [44] (a hardware solution) and Kraken2 [8] (a software-only solution).
Assuming that 3T1R TCAM accelerator is capable of processing one k-mer per cycle, and since its search timing is almost identical to that of DIPER (Table III), we may conclude that DIPER performance is on par with that of 3T1R TCAM of [44].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.DIPER processes one k-mer per cycle, hence its classification throughput is f op × k, where f op is the DIPER operating frequency and k is the k-mer size.For f op = 500MHz and k = 64, DIPER classification throughput is 1, 920 Giga basepairs Minute , achieving an average speedup of 910× over Kraken2.The individual speedup figures are presented in Table II.

D. Area and Power Evaluation
The DIPER array was designed using a commercial 65nm CMOS process, simulated and manually laid out using Cadence Spectre and Virtuoso.Memristors were simulated using TEAM model [45].The DIPER 12T6R cell area is 3 μm 2 .DIPER average search energy is 0.3 fJ per cell.
Table III compares DIPER with a ReRAM based 90nm 3T1R TCAM [44].If such TCAM is used for classification of genome material, it will achieve the classification efficiency similar to that of Kraken2.That is, DIPER achieves up to 1.7× higher F 1 score for high-quality DNA reads and up to 6.2× higher F 1 score for DNA reads with 15% error rate compared to such resistive TCAM [44] based classifier.
Such classification efficiency comes at the cost of 40% area and 45% search energy overhead, assuming that resistive TCAM [44] scales to 65 nm.DIPER implements the same edit distance tolerance technique as EDAM [36], using different architecture, device technology and design.Therefore DIPER achieves the same classification efficiency as EDAM, while having approximately 10.1× smaller area and 5× lower search energy.

VI. CONCLUSION
In this paper, we present DIPER, a novel resistive edit distance-tolerant content addressable memory, designed for computational genomics applications, particularly for detection and identification of pathogens of pandemic importance.DIPER enables a parallel search with a user configurable edit distance tolerance.Edits supported by DIPER are replacements, insertions and deletions, ubiquitous in sequenced DNA.DIPER cell is based on 12 resistive devices (memristors) and six NMOS transistors.DIPER uses one-hot encoding to code DNA bases (4 memristors per base) and in addition to a dataword (a DNA fragment), stores two its copies (shifted left and shifted right by one base position).Edit distance tolerance is defined by the speed of matchline discharge during the approximate search.A special per-row transistor is used to set a user-configurable edit distance threshold.

Manuscript received 3
July 2023; accepted 10 September 2023.Date of publication 22 September 2023; date of current version 10 September 2024.This work was supported in part by the European Union's Horizon Europe programme for research and innovation under Grant 101047160, and in part by the Israeli Ministry of Science and Technology under Lise Meitner Grant 1001569396 for Israeli-Swedish research collaboration.The work of Esteban Garzón was supported by the Italian MUR under the call "Horizon Europe 2021-2027 programme -H25F21001420001."The work of Itay Merlin was supported by the Israeli Ministry of Science and Technology.Recommended for acceptance by A. R. Alameldeen.(Corresponding author: Esteban Garzón.)

Fig. 2 .
Fig. 2. (a) Hamming distance, (b) in DIPER, in addition to the co-located data element (DNA base), its left and right neighbors are also evaluated.

Fig. 4 (
Fig. 4(a) presents the DIPER cell design.DIPER 12R6T cell comprises 12 memristors (12R) and six NMOS transistors (6T).DIPER row is shown in Fig. 4(b).It comprises several cells (for example 32), ML sense amplifier and precharge circuitry.DIPER supports two operations: write and compare.Write is applied relatively rarely, to build or to update the reference genome database.Therefore we consider write an offline operation and do not optimize its latency.DNA bases are encoded using one-hot encoding, for example A is encoded as '0001', G as '0010', C as '0100' and T as '1000'.The datawords are

Fig. 7 .
Fig. 7. matchline (ML) voltage at the sampling time, for the worst case discharge (one mismatching base), under the effect of: (a) process variations (PV), and (b) local variations around TT, FF, and SS corners.

Fig. 8 .
Fig. 8. (a) DIPER as a part of a pathogen detection platform.(b) The offline construction of the reference DNA database in DIPER (top) and the online SARS-CoV-2 detection operation (bottom).

Fig. 9 .
Fig. 9. DIPER vs. HD-CAM and Kraken2: Sensitivity and precision for (a) simulated low error rate reads, (b) simulated high error rate reads and (c) raw PacBio reads; F 1 score and normalized F 1 score for (d) simulated low error rate reads, (e) simulated high error rate reads and (f) raw PacBio reads.Note:In addition to V eval , ED also depends on the sensing threshold of the sense amplifier and its sensing time.

2 .
DIPER also outperforms an earlier DNA classification solution, HD-CAM, a Hamming distance-tolerant CAM.The reason is that edits of indel type frequently result in very large Hamming distances, rendering DNA detection by HD-CAM less efficient.3. F 1 score has an optimum point, whose position depends on the error rate.For the low ED threshold values, the growing DIPER sensitivity outpaces the decreasing precision, leading to the F 1 score growth.However, as the ED threshold increases, the drop in precision surpasses the rise in sensitivity, causing the F 1 score to diminish.The best DIPER normalized F 1 score results are summarized in Table

TABLE I EXAMPLES
OF TRUE POSITIVE, TRUE NEGATIVE, FALSE NEGATIVE AND FALSE POSITIVE RESULTS GENERATED BY DIPER ('REPL' IS REPLACEMENT, 'INS' IS INSERTION, 'DEL' IS DELETION).EDIT DISTANCE THRESHOLD IS 4

TABLE II DIPER
VS. KRAKEN2 COMPARISON SUMMARYNormalized F 1 score is the DIPER F 1 score to Kraken2 F 1 score ratio.Kraken2 is run on i9-10900X CPU operating at 3.70GHz with 32GB of 2,133MHz DDR4 main memory

TABLE III DIPER
[36] AREA, SEARCH ENERGY AND TIMING COMPARED TO A 3T1R RERAM BASED TCAM[44]AND CMOS EDAM[36]To compare with Kraken2, we run the software on Intel Core i9-10900X CPU operating at 3.70GHz with 32GB of 2,133MHz DDR4 main memory.Kraken2 classification throughput figures for high error rate reads, low rate error reads and raw PacBio reads are 2.37Giga basepairs