Demeter: A Fast and Energy-Efficient Food Profiler using Hyperdimensional Computing in Memory

Food profiling is an essential step in any food monitoring system needed to prevent health risks and potential frauds in the food industry. Significant improvements in sequencing technologies are pushing food profiling to become the main computational bottleneck. State-of-the-art profilers are unfortunately too costly for food profiling. Our goal is to design a food profiler that solves the main limitations of existing profilers, namely (1) working on massive data structures and (2) incurring considerable data movement for a real-time monitoring system. To this end, we propose Demeter, the first platform-independent framework for food profiling. Demeter overcomes the first limitation through the use of hyperdimensional computing (HDC) and efficiently performs the accurate few-species classification required in food profiling. We overcome the second limitation by using an in-memory hardware accelerator for Demeter (named Acc-Demeter) based on memristor devices. Acc-Demeter actualizes several domain-specific optimizations and exploits the inherent characteristics of memristors to improve the overall performance and energy consumption of Acc-Demeter. We compare Demeter's accuracy with other industrial food profilers using detailed software modeling. We synthesize Acc-Demeter's required hardware using UMC's 65nm library by considering an accurate PCM model based on silicon-based prototypes. Our evaluations demonstrate that Acc-Demeter achieves a (1) throughput improvement of 192x and 724x and (2) memory reduction of 36x and 33x compared to Kraken2 and MetaCache (2 state-of-the-art profilers), respectively, on typical food-related databases. Demeter maintains an acceptable profiling accuracy (within 2% of existing tools) and incurs a very low area overhead.


Introduction
e urgent need for a real-time, e cient, and accurate food monitoring system is apparent when one considers the economic impacts and health risk issues due to human errors and/or intentional fraud regarding everyday food. For example, a worldwide annual loss of $10 to $24 billion is estimated only for the frauds happening in the sh industry [1]. e Halal meat scandal [2] and the black sh scandal [3] are just a few other preventable examples that could have been quickly prevented if we had accurately and e ciently monitored all the food productions in real-time.
Food pro ling is the rst step and the only computationally expensive task in a food monitoring system. e food pro ling task entails determining the existing species in a food sample and their relative abundance [4,5]. Today's food pro lers work with sequenced data as we can capture a more accurate pro le using the sequences of a food sample. e rapid drop in the cost of DNA sequencing in the past decades and the expectation for a continual trend [6,7] 1 is expected to lead the way for pro ling to become the main bo leneck of this pipeline.
Currently, the industry utilizes state-of-the-art (SOTA) taxonomic pro lers from metagenomic studies for food pro ling due to the similarity of problem statements in food pro ling and metagenomics pro ling. However, as alluded to, such pro lers are developed as the rst step of metagenomic studies [9][10][11]: a new yet di erent line of research that allows us to study many species that are taken directly from their environment altogether, as opposed to studying them individually. Unfortunately, these pro lers are overkill for simply pro ling a given food sample and, therefore, costly since those taxonomic pro lers have been designed for di erent, more complex goals such as (1) capturing complex operations between organisms or (2) nding insights on species that cannot be clonally cultured in labs. Such pro lers are also designed for working on larger, more complex, and randomly mixed genome sequences and demand a signi cant amount of resources that simply impede real-time monitoring of all food samples a er production, shipment, or distribution; the ultimate goal of a food monitoring system. erefore, a new solution must be sought for food pro ling that is cheaper, faster, more energy-e cient, and yet accurate.
In particular, we pinpoint two critical sources of ine ciency in SOTA pro lers currently used for food monitoring, collectively called food pro lers or pro lers herea er. First, all current (food) pro lers work with signi cantly large working data structures, e.g., humongous hash tables or sorted lists, that require high-end servers with extensive storage and memory capabilities to be handled. is fundamentally limits performance scaling on par with that in sequencing technologies. Second, current pro ling techniques incur a signi cant number of random accesses to large working datasets, and as a result, unnecessary data movement between their storage and memory plus their memory and compute units which cannot be otherwise done where the data resides due to (1) the size of the nal data structures and (2) the required operations for tasks in hand. is directly translates to massive energy consumption and latency. For example, as shown in our evaluations Sections 4, 7, a widely used SOTA pro ler takes ∼1 minute to pro le one high-coverage sequenced food sample. However, it requires a super machine or cluster with at least 300 GB of memory and proportionally scaled-up compute power. ese costs add up to an unbearable required time and equipment for real-time monitoring of all existing and producing food samples. erefore, a healthy economy regarding the food industry cannot keep using these pro lers and demands cheaper, faster, more energy-e cient, and accurate food pro lers for the years to come.
Our goal in this work is to solve both limitations of previous pro lers, namely (1) reliance on high-end servers and scaling problems due to required massive data structures and (2) incurring unnecessary data movement. To this end, we propose Demeter, an end-to-end, hardware/so ware co-designed food pro ling framework that e ciently pro les species of a food sample.
e key idea of Demeter is to reduce the food proling problem to a multi-object (multi-species) classi cation problem using hyperdimensional (HD) computing (HDC) followed by an abundance estimation step. Demeter is a platformindependent framework and produces accurate results on any hardware platform such as a central processing unit) CPU, graphics processing unit (GPU), or application-speci c integrated circuit (ASIC).
Our experiments show that although the accuracy of Demeter is comparable with existing SOTA pro lers, typical processing units (CPUs) are not exploiting the full parallelism o ered by our HDC-based approach, prohibiting those platforms from outperforming SOTA pro lers. Moreover, we nd two more optimization opportunities that can be achieved with a wisely-chosen platform: (1) eliminating the cost of existing shi operations and (2) mitigating the signi cant amount of data movement involved in our HDC-based solution. erefore, we propose an in-memory hardware accelerator for Demeter, Acc-Demeter, to mitigate the costs mentioned above and simultaneously solve the second problem of pro lers as well. Acc-Demeter achieves these by (1) the physical a ributes of nanoscale memristive-based devices 2 , (2) Processing-In-Memory (PIM), where the data resides, and (3) zero-overhead shi operation in hardware. It is worth noting that, with the advent of portable sequencing machines, a move from cloud computing with sophisticated infrastructure towards an in-build pro ler (or other genomics-related kernels) inside the sequencer is nally in the foreseeable future.
Our paper makes the following main contributions: • To our knowledge, Demeter is the rst framework that enables food pro ling via HDC. Demeter provides a ve-step approach to determine the relative abundance of a set of the food read sequences at the species level. We design Demeter to (1) address the key problems of food pro ling rather than accelerating regular metagenomic pro lers and (2) be platform-independent (Section 3). • We propose a PIM-enabled hardware accelerator for Demeter using memristor devices (Acc-Demeter) to extract Demeter's full potentials and solve the data movement problem in Demeter and previous pro lers. We propose several op-timization techniques for Acc-Demeter based on domainspeci c knowledge of food pro ling and our background in PCM cells characteristics and HDC operations. To our knowledge, Acc-Demeter is the rst (in-memory) hardware accelerator for a food pro ler (Section 5). • We rigorously compared Demeter and Acc-Demeter to four SOTA food pro lers. We show that Demeter provides an accuracy level comparable with previous food pro lers and within the accepted level of food monitoring systems. e default se ing of Acc-Demeter enables a (1) throughput improvement of ∼192× and 724× and (2) reduction in the required memory of ∼36× and 33× compared to Kraken2 [12] and MetaCache [4], respectively, when querying on a typical food-related reference genome database, i.e., AFS20 [4]. Our design requires only ∼8.9 mm 2 die area and can process ∼9.45 Mbp per joule for our largest food-related database AFS31 [4] (Section 7).

Background and Motivation
is section discusses the necessary background and introduction to (1) the current taxonomic pro lers and their shortcomings when used for food pro ling, (2) HDC, and (3) the PIM paradigm. We devote the materials mainly to those closely related to or used by Demeter. For more detailed background information, we refer the reader to comprehensive reviews on these topics [13][14][15][16][17][18][19].

Metagenomic Pro lers
Constantly increasing the performance of sequencing technologies and the fast drop in the cost of DNA sequencing [6,7] catalyzed the metagenomic studies [9][10][11]. ese studies enable us to capture the big picture of the environment without isolating or cultivating individual organisms. For this purpose, one needs to perform taxonomic pro ling: determining the relative abundances of species in a sample directly taken from the environment. Due to the high cost associated with alignment and assembly for large reference datasets, to this date, we still prefer heuristics statistical-based pro lers to assemblyor alignment-based ones. However, even these pro lers are not yet cheap or economic and prevent large-scale, real-time studying. eir cost is mainly related to the required memory for pro lers' data structure and algorithms. Such large data structures or sophisticated algorithms force us to use high-end servers and are needed to ful ll complex goals of subsequent metagenomic analysis, namely capturing complex operations between organisms and discovering insights on species that can not be clonally cultured in labs. is high cost of pro ling in metagenomics pro ler prevents us from e ciently pro ling food samples in real-time, the end goal of a food monitoring system.

Problems of Food Pro lers
We use VTune [20] and pro le three SOTA pro lers that are currently used for food samples as well, namely Kraken2, CLARK, and MetaCache, using their default datasets and parameters on the original platforms for which they have been designed. We make two main observations, which follow a similar trend reported in previous studies in genomics as well [12,21,22]. Observation 1. All these pro lers induce large memory requirements for their data structures. For example, Kraken2 requires a minimum of 300 GB memory for its reference data structure. Even for smaller and less complex reference data bases such as those in food industry, Kraken2 still requires more than 50 GB of memory (Section 4.4). Observation 2. All pro lers induce high miss rates in L2 and L3 (∼68 to 90%). e nature of their underlying algorithms causes this ine ciency because they always query a small fraction of keys in a large hash table and/or sorted list, leading to random memory access pa erns. In other words, the arithmetic intensity of the pro lers is too small to the extent that even increasing the number of threads does not help resolve the CPU stall cycles caused by memory accesses required by these misses.
Overall, current food pro lers' large working data structure and their low arithmetic intensity lead to high storage cost, low performance, and high energy consumption. It also demands high-end servers. is motivates designs (such as Demeter and Acc-Demeter) that provide reduced working data structures, eliminate unnecessary data movements, and can liberate us from dependency on the clusters.

Hyperdimensional Computing
Hyperdimensional computing (HDC) [23,24] is a braininspired computing paradigm that has been demonstrated to be e ective in reference-based learning domains, such as text classi cation [25][26][27], gesture recognition [28], and latent semantic analysis [29]. Elements of HDC are presented using high-dimensional vectors, herea er called HD vectors. HD vectors can be composed of real [30][31][32], binary [23,33], bipolar [28,34], or complex numbers [35]. Previous works show that binary representations of HD vectors are more practical and e cient for classi cation problems or one-shot reinforcement learning. is representation is also more hardware-friendly. erefore, we proceed with binary HD representations. HD vectors also come with other powerful features such as robustness to random errors, holistic representation, and randomness. We refer the enthusiastic readers to previous works for more details on these features [17,23].
Like other reference-based classi ers, an HDC-based system also takes two steps: (1) training and (2) classi cation. An encoding mechanism is used in both steps. One famous example is the N-gram encoding mechanism that follows a two-step approach for encoding a string of size L to an HD vector of size D.
Step 1: It combines N consecutive alphabets of the string and builds an HD vector that is orthogonal to them all and can preserve their relative order. is operation is called binding and is represented in Equation 1, where ρ i (X ) represents the i th permutation of vector X and B i are once randomly-generated representative HD vectors (also referred to as atomic or basis HD vectors) for the i th character of the string C i . e string is a DNA sequence in Demeter.
Step 2: e encoder performs an element-wise addition between all HD vectors corresponding to consecutive N-grams, called bundling, to present the entire input sequence. To binarize the nal HD vector, the encoder applies a majority function over each position. is nal vector is stored in associate memory (AM) and is called a prototype HD vector if the input was a reference genome. Otherwise, it is called query HD vector and will use it for classi cation. e most common approach for classifying whether the sequence query belongs to any of the classes in AM a er using N-gram encoding mechanism is to measure the hamming distance between the query HD vector (Q) and each of the prototype HD vectors (Ps) and decide based on a xed distance or threshold (T). is can be easily performed with an XNOR of Q and each P followed by a pop-count 3 and thresholding operation, as shown in Equation 2.

Computing Inside/Near Memory
For decades, the processing units have been developed at a faster rate than memory units, causing memory units to become a bo leneck, especially in data-intensive workloads.
e Processing-In-Memory (PIM) (interchangeably also referred to as Computation-In-Memory (CIM)) is a promising paradigm that aims to alleviate the data movement bo leneck. In essence, PIM advocates for avoiding unnecessary data movement and redesigning systems such that they are no longer processor-centric. Previous works show the potential of various memory technologies for implementing PIM-based architectures [36][37][38][39]. Resistive memories or memristive devices, such as ReRAM and PCM [37,40,41], have recently been introduced as a suitable candidate for both storage and computation units that can e ciently perform vector-matrix multiplication [42] and bulk bit-wise logical operations [43,44] since they can follow Kirchho 's law inherently. ey also enjoy non-volatility, high-density, and near-zero standby power. Due to the inherent high parallelism, simplicity of the required operations, and intrinsic robustness and error tolerance of an HDC-based system, such a system ts well with the PIM paradigm. A few recent works propose application-speci c hardware accelerators based on memristive devices [41,[45][46][47] for such HDC-based systems. We describe the di erences between our implementation and closest previous proposals in Sections 5 and 9.

Demeter
Demeter is a con gurable framework for food pro ling and is based on three main insights: (1) current food pro lers, whereas accurate, are neither memory-nor energy-e cient, (2) the primary sources of high cost and ine ciency in current food pro lers is their large reference data structures and working sets, and (3) one can pro le food samples quickly and

Species-level Abundance Calculation
Step 4: Multi-Species Classification accurately using HDC. Fig. 1 provides an overview of the ve key steps in Demeter: 1 de ning an HD space, 2 building an HD reference database (HD-RefDB), 3 converting sample reads into HD space, 4 determining the possible species assignment per sample read, and 5 performing abundance estimation. We describe each step in more detail next.

Step 1: De ne the HD Space
As the rst step ( 1 in Fig. 1), Demeter de nes an HD space for all subsequent operations and steps. is is a crucial step as it determines the operations in the remaining steps. Unfortunately, many previous HDC-based proposals did not support the user's input for determining the HD space and designed their space statically. Hence, such designs are more limited.
Demeter de nes the HD space in 4 stages. Stage 1: Demeter xes two hyperparameters: (1) e dimension of the HD space; i.e., the dimensionality of the HD vectors (element representations), and (2) e sparsity of each element (HD vector). Stage 2: Demeter generates a few atomic HD vectors and stores them in memory (commonly called Item Memory (IM)). ese vectors can be (1) the HD vectors that represent our genome alphabets or (2) the one-time randomly generated HD vectors that some encoding mechanisms use, for example, to introduce the concept of order between alphabets of one input. Stage 3: Demeter decides on the encoding mechanism to build the space with. is very encoding mechanism will be used throughout Steps 2 and 3 of Demeter. Stage 4: Demeter xes the similarity metric and any other associated parameters (such as thresholds) based on the user's input or a common choice considering previous stages. Demeter stores a default value for each stage in a con guration le. Once the user summons Demeter, Demeter quickly checks if a con guration le matches with the user's requested HD space or not. Demeter only runs this step if such le does not exist or the user asked for a change.

Step 2: Build Demeter's Reference Data Structure
Demeter takes two sets of inputs in step 2 : (1) HD space parameters de ned in Step 1 and (2) a reference genome database. Subsequently, Demeter builds a new reference database in its HD space out of all the considered reference genomes. is new database, called HD-RefDB, consists of one (or few) prototype HD vector(s) from any given reference genome in the original reference database and is stored in AM. HD-RefDB can be as varied as the number of combinations of possible hyperparameters, atomic vectors, and encoding mechanisms in Step 1 .
is step aims to reduce the size of the working set for the classi cation task while avoiding accuracy drop. Since this step requires only simple arithmetics and is also highly parallelizable, it can still be accelerated on our proposed PIM-enabled accelerator (Section 5).

Step 3: Demeter's Read Conversion
Demeter again takes two inputs in Step 3 : (1) HD space con guration and (2) read sequences of the food sample under study. Demeter translates each of these read sequences into one query HD vector. To prevent any extra storage cost and to pipeline computations of Step 3 and Step 4 , Demeter forwards each query HD vector to the next step instead of storing them inside a memory unit while waiting for all of them to be constructed rst 4 . ery HD vectors created in this step can require larger or smaller space than a read, depending on the initial length of the read sequences and the dimension of the HD space. erefore, although Steps 2 and 3 share the encoding mechanism, their input and how Demeter treats the outcome are pre y di erent.
Step 3 neither introduces a new operation nor a procedure other than those already existing from Step 2 . erefore, it enjoys similar bene ts as Step 2 , namely high parallelization and in-memory suitability. Demeter runs Step 3 every time it pro les a new read of a food sample.

Step 4: Multi-Species Classi cation per Read
In this step, Demeter takes (1) the query HD vector (Step 3 ), (2) HD-RefDB (Step 2 ), and (3) similarity function and its corresponding parameters (Step 1 ) as inputs. To determine the specie(s) that each read belongs to, Demeter performs a similarity check between the query HD vector and each of the prototype HD vectors in HD-RefDB. e similarity measure can vary depending on the vector representations and encoding approaches. Demeter allows various famous mechanisms for the similarity check, such as Hamming distance [17] and dot product [16]. Usually, this step can be implemented only with simple operations (Section 2.3). It also enjoys high parallelization, similar to the previous steps. Although a similarity metric and its related parameters highly relate to (1) the encoding mechanism and (2) hyperparameters of the HD space, such as representations, sparsity, and the dimension of HD vectors, and therefore it makes sense not to let them change arbitrarily, Demeter supports changing them in Step 4 as well, without needing to re-run Steps 2 or 3 . is is because some studies show that di erent similarity metrics and thresholds may outperform others depending on your application and data for a xed set of hyperparameters and encoding mechanisms.
erefore, if one decides to change their reference database, they may need to play with these to nd the right match, and Demeter allows such investigations. Currently, Demeter provides a default option.
Demeter may nd out that the query HD vector is close to one, multiple, or none of the prototype HD vectors in HD-RefDB. is variety in possible outputs di erentiates Demeter from many previous HDC-based designs [17,41,[48][49][50]. In such works, mostly due to the characteristics of applications under study, researchers always assume that (1) the query HD vector can only belong to one of the prototype HD vectors, and (2) the class of the query HD vector will exist in the AM. However, none of these assumptions hold for a food pro ler. One read from the food sample can be related to one, multiple, or none of the reference genomes in the original reference genome database. is is because the read sequences are mostly short strings with a reasonably high probability of existence in longer reference genome sequences. It is also not uncommon that the query HD vector does not belong to any of the 4 is is the default behavior in Demeter. However, we also provide the option for the user to keep and store these query HD vectors (HD-ReadDB) in case one needs to analyze them further. If one uses this option, the stored query HD vectors create another database representing the reads in our food sample, called HD-ReadDB herea er. reference genomes in the initial reference genome database.
is case can happen when, for example, (1) there is either an unknown species in the food sample, (2) one incorrectly excludes the corresponding reference genome in the initial reference genome database, or (3) an uncorrected sequencing error has happened. A food pro ler should capture such cases.
is di erence between how many prototype HD vectors in HD-RefDB can be assigned to one query HD vector is a key di erence that a ects both the following abundance estimation step and nal results. It also distinguishes this work further from previous HDC-based proposals for di erent applications.
Step 4 also enjoys high parallelization and in-memory suitability features similar to previous steps.

Step 5: Species Level Abundance Estimation
In Step 5 , Demeter performs a relative abundance estimation based on the results of Step 4 . is step is particularly needed for a food pro ler in which one query HD vector can be similar to one or more classes/species. Demeter categorizes each query HD vector into (1) uniquely-mapped, (2) multi-mapped, and (3) unmapped, taking a two-step approach. In the rst step, Demeter assigns the uniquely mapped query HD vectors to the species they are similar to. In the second step, Demeter assigns the multi-mapped query HD vector to multi-species proportionally to the number of reads that have been uniquely aligned to in the rst step divided by the length of species (reference genome). Demeter's Step 5 can be extended to support di erent assignment policies for the multi-mapped reads. We leave investigating the e ect of such methods for future work.

Methodology
We implement a multi-threaded highly-parallelized version of Demeter in C++ using SeqAn library [51], called C-Demeter. SeqAn library is an open-source optimized library for biological data. C-Demeter veri es the accuracy of Demeter. We also implement a GPU version of Demeter, G-Demeter. G-Demeter uses CUDA streams for parallelizing data copy operation between shared memory and global memory with other computations as much as possible. It implements the similarity check using the parallel reduction technique introduced by Harris et al. [52] in the shared memory. All of our experiments run on a 128-core server with AMD EPYC 7742 CPUs [53] and with 500 GB of DDR4 DRAM. G-Demeter runs on an NVIDIA RTX 2080Ti GPU. Our sensitivity analysis shows that binary HD vectors of size 40,000, with dense distributed representation (DDR [17]) and N-gram-based encoding mechanism, strike a sweet spot in the tradeo between accuracy, required memory, and performance. erefore, unless otherwise stated, our evaluations use these setups. Accuracy Metrics. We capture the four fundamental rates from a (food) pro ler when considering the presence and absence of each species in the output, i.e., True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN) Rate. Based on these rates, Demeter reports two standard metrics of Precision and Recall [12,54,55] to assess the accuracy of our (food) pro lers. Performance Metrics. Performance analysis consists of three experiments: (1) Build time, (2) ery time, and (3) ery throughput or speed. is separation has two main reasons.
(1) Build time is normally a one-time job and does not a ect the overall pro ler's performance. erefore, it is only fair to separate build time and query time. (2) ery time is simply the required time for pro ling one single read. However, throughput is measured by million reads per minute ( MR m ) and should be di erentiated as it can get a ected easily by other factors such as the size of the data structure, the classi er's parallelization capability, or the infrastructure's computation and storage/memory limitations (e.g., duplicating capabilities). Datasets. We have two sets of datasets. (1) Genome sequences used as a reference database. (2) Genomes sequences used as food samples and input queries. We consider AFS20 and AFS31 [4,5] as our reference genome datasets. ese datasets are two commonly used datasets consisting of 20 and 31 foodrelated reference genomes related to animals whose sizes vary from 12 MB to 14 GB. AFS31 is currently also the biggest reference dataset used in food pro ling. Food sample reads or queries are from calibrator sausage samples from ENA project ID PRJEB34001 [56], and PRJNA271645 [57]. ese reads are real short-read sequences from a mixture of food ingredients such as chicken, turkey, etc., sequenced on an Illumina HiSeq machine. Baselines. We compare Demeter against MetaCache [4] (the most accurate food pro ler) Kraken2 [12], Kraken2+Bracken [58], and CLARK [59], the top 3 alignment-free and fastest metagenomic pro lers that are also commonly used for food pro ling.  [56,57], respectively. Note that the relative abundance of higher taxonomy levels is not of importance in food pro ling. Additionally, those calculations highly depend on the propagation method from species level to those levels. erefore, they have been excluded from this study. We observe that Demeter stands very close to the most accurate pro ler, MetaCache, and has only 1.4% and 2.6% less precision and recall, respectively, for KLyo samples. Moreover, Demeter achieves similar results on AFS31 and Kal samples. Note that accuracy is very much data-dependent, and indeed this accuracy drop is acceptable for a food pro ler. e results of the latest comparison between current (metagenomics) prolers [60] show an Std error of the mean ranging from 0 to 5% regarding the precision and recall among various widely-used pro lers on di erent datasets.

Demeter's Accuracy Analysis
We conclude that Demeter is accurate and achieves high precision and recall for food samples. ese results show that Demeter's HDC-based classi cation approach followed by our abundance estimation technique does not hurt the accuracy of the pro ler compared to baselines. Fig. 4-a and Fig. 5-a present the time that each pro ler takes to query one (short) read from the query food sample and classify its specie(s) over AFS20 and AFS31, respectively.  We observe that both C-Demeter and G-Demeter, whereas accurate, require higher query time compared to Kraken2. e time breakdown, using Intel VTune [20] and cudaEvents, reveals that both implementations are memory bound, meaning there exists a signi cant percentage of under-utilized slots due to data access issues.

Demeter's So ware Performance Analysis
We believe that there are two main reasons behind this problem. First, the shi operation per processed character in the encoding mechanism of Demeter. Both of these implementations store the large HD vectors into multiple registers. Every shi operation translates to multiple copy operations among those registers, which can become costly in terms of time and energy consumption.
is is why the query time is higher than expected. Second, not all prototype HD vectors t in the caches. erefore, the so ware versions take a few cycles to read prototype HD vectors in batches, compare them to query HD vector, save the results, and continue with the next batch. Note that these also put a limit on the expected throughput. Fig. 4-b and Fig. 5-b present throughput of di erent pro lers over AFS20 and AFS31. We make three observations. First, C-Demeter achieves a lower throughput compared to Kraken2. e reasons behind this are similar to what was discussed for its longer query time. Second, we observe that G-Demeter improves the throughput by up to 24% (depending on the reference dataset) and therefore can be used for food pro ling in the industry in the near future. ird, we observe that simply increasing the number of working threads by moving from C-Demeter to G-Demeter does not improve the throughput considerably. We ask to use the commodity GPUs to perform the food pro ling to cut the cost in the short term. In the long term, we propose extending Demeter to ASIC designs (such as those we present next) that solve the new sources of ine ciency we discussed above.
However, our analysis also shows that even a massivelyparallel implementation of Demeter, G-Demeter, does not fully utilize the parallelism o ered by vector operations of HDC classi cation of Demeter, while also su ering from expensive copy-pasting among registers and its inability to perform the classi cation e ciently on a large vector in so ware.

Demeter's Memory Analysis
To show a key source of improvement in Demeter (and an enabler for Acc-Demeter), we compared the memory requirement of Demeter with the other food pro lers. Fig. 6 presents the required memory for each pro ler on AFS20 and AFS31. We make the following two observations. First, Demeter requires ∼33x and 36x less memory than Kraken2 and Meta-Cache for AFS20 database and ∼27x and 30x less memory for them for AFS31 database, respectively. is makes Demeter the most e cient food pro ler from a memory usage perspective. Second, the reduction in memory requirement for Demeter is to the extent that, for the rst time, the data structure of the food pro ler can t into a standard size memory and does not require a colossal RAM to manage further queries. is reduction is the primary enabler behind Acc-Demeter. We conclude that Demeter is very memory e cient.

Demeter's PIM-enabled Accelerator
Demeter is positioned as a platform-independent food pro ling framework that uses HDC. Demeter works with large HD vectors, is robust against errors, enjoys high parallelism, and exploits simple operations. ese characteristics make Demeter a suitable candidate for hardware acceleration. However, the interest behind accelerating Demeter in a highly parallelizable and energy-e cient platform and speci cally a PIM-enabled design goes beyond being simply its suitability and is a requisite for such a platform with two main motives. Motivation 1: As discussed in Section 4.3, a so ware version of Demeter incurs a considerable cost on copy operations among registers holding intermediate HD vectors and classication. It also performs the classi cation poorly due to larger than cache HD-RefDB and low cache hit rate. ese costs diminish all the bene ts of Demeter that come from its small data structures and memory requirement. However, one can prevent this if Demeter is implemented in hardware as they can (1) realize the shi operation for free by only redirecting the output of each register to the next one and (2) perform the classi cation e ciently. Motivation 2: A so ware-based implementation of Demeter still incurs a lot of unnecessary data movement for Steps 2 , 3 , and 4 . A hardware accelerator, especially a PIM-enabled one, can mitigate this problem greatly. erefore, we propose a PIM-enabled hardware accelerator for Demeter using PCM cells. One can accelerate Demeter using a PIM-enabled design on di erent memory technologies. We choose a memristor-enabled design for three main reasons. First, it is well-known that memristor-based memory technologies can perform vector-matrix multiplication [61][62][63][64] using Kirchho 's law e ciently, making them suitable for our design. In this work, we manage to propose a hybrid row-major/column-major data mapping and intelligent data duplication scheme to perform encoding, classi cation, and pro ling e ciently on PCM devices using this operation. Other technologies than memristors do not o er the same features for our hybrid data mapping.
Second, traditional technologies, such as non-memristorbased ones, are generally general-purpose and cost-driven. Moreover, their design does not allow even simple circuit modi cations without high penalty on the area and cost. is makes them face a lot of pushback from the industry and unlikely to see future adoption. One of the advantages of memristors over them is their high density and scalability, and previous works show a wide range of accelerators using them.
ird, researchers already show the potential of accelerators based on emerging technologies for other ML-based algorithms [62,65]. Also, multiple memory technologies already exist in current sequence machines. erefore, it is not unreasonable to imagine one sort of these emerging memory technologies also be installed in these machines, especially for performing ML-based algorithms such as those for base-calling that are necessary for the sequencers [66].
In this work, we focus on PCM devices, as a member of the family of memristor devices, due to our accessibility to accurate device measurements and models for these devices and leave exploring other technologies for future research.   7 shows an overview of the proposed PIM-enabled hardware accelerator for Demeter, Acc-Demeter. Acc-Demeter consists of 5 key elements: 1 Item Memory (IM), 2 Encoder, 3 Associate Memory (AM), 4 Distance calculator, and 5 Controller. IM and AM units are memory units, and we implement them as PCM arrays with their control circuitry. However, the encoder and distance calculator units are computing units implemented as the periphery. e controller is a simple FSM designed to harmonize the required steps of Demeter. e CPU initiates Demeter by gathering the user's input (Step 1 ) and then booting the controller; i.e., it sends the start command, initializes the registers, and sets the addresses to consider for food samples and/or reference genomes in the controller. In a nutshell, Acc-Demeter accelerates Steps 2 , 3 , and 4 of Demeter. e controller returns the results of Step 4 to the CPU for nal processing and performing the relative abundance estimation (Step 5 ). We will discuss these units in more detail next.

Item Memory (IM) Design
We implement our IM using PCM arrays and corresponding circuits, such as decoders. IM stores the atomic HD vectors. Binary "0" and binary "1" in an HD vector translate to amorphous and crystalline states, respectively. In the beginning, the user (or Demeter) generates 4 HD vectors for each DNA alphabet in Step 1 of Demeter and stores them in the IM. Acc-Demeter reads these atomic HD vectors from IM every time it meets a new symbol. Once Demeter xes the HD space, IM becomes a read-only memory. is allows us to prevent unwanted changes to the atomic vectors. Fig. 8-(A) presents the IM design. e gate enabler provides access to cells that the row decoder activated. is way, the design of an entire array is achieved much easier, and the write/read disturbance e ect is also mitigated to a great extent. However, this design also blocks the write on a row basis and only allows column-wise programming of IM. is does not complicate IM in any way because the atomic vectors are generated once in the beginning by the host CPU and then stored in the IM for a long time. Note that random number generators are already well-optimized in CPUs. In addition, randomly generated values inside memristors are still in early stages [67][68][69], and Acc-Demeter can be modi ed later to bene-t from a non-intrusive (compatible) random number generator in the future.    (2) placement of HD vectors in the IM unit. Note that data mapping is a critical contribution of Acc-Demeter. Acc-Demeter uses a hybrid rowmajor and column-major data mapping for IM and AM units, respectively. IM enjoys a row-major data mapping for two reasons. First, a row-major data mapping of HD vectors allows Acc-Demeter to read the cells wri en in one row in one cycle.
is is helpful as IM is used in the encoding procedure, which is the bo leneck. Second, the used PCM model provides more #columns than #rows. erefore, even if there was a method to read column cells all at once but separately, one could only store smaller chunks of an HD vector on that column.
An important design choice regarding IM is related to the limited size of PCM arrays (512×2048 [41]). is limitation of array size (which also exists in mature memory technologies such as DRAM) prevents us from ing an entire large HD vector in one row or column. erefore, one needs to break such an HD vector into smaller chunks and store them in separate rows. ree options exist: (1) pu ing the chunks in the same array, (2) pu ing them in di erent arrays, (3) a hybrid approach. As shown in Section 7, encoder is the bo leneck of our operation. erefore, to prevent exacerbating the overhead of the encoding procedure, IM breaks a HD vector to the largest power of two that is smaller than the number of columns available in an array (2048 in our case) and stores di erent chunks on di erent arrays. is is a direct tradeo between the used area (#arrays) and performance. Fig. 8-(B) also shows this placement.

Encoder Design
e encoder is the main compute unit of Acc-Demeter. e encoder is implemented in the periphery of arrays and executes the binding and bundling operations via a sequence of commands determined by the controller. Demeter is capable of handling di erent representations (Section 3). However, to reduce the complexity and make the design hardware friendly, the current design of Acc-Demeter only supports binary representations. In this setup, the N-gram encoding mechanism is the most common one, which Acc-Demeter supports. We suspect other choices are possible with the same hardware or minimal changes. We leave the exploration of those designs for future work.
Based on Equation 1, building an N-gram requires only simple XOR and shi operations.
is bitwise XOR operation can be quickly computed a er reading the atomic HD vector from the IM with an XOR gate in the periphery. Note that one can also implement XOR using bitwise AND (∧) and OR (∨). However, this technique requires breaking the XOR operations into minterms whose numbers increase exponentially. Any a empt to reduce them, even if empirically works as in [41], will only produce approximated results and hurt the accuracy. Although some applications can tolerate such extreme accuracy loss, food pro ling cannot. Note that the 2-minterm based encoding in [41] also a ects the sparsity of N-grams (acknowledged in the paper) and limits the size of the N-gram. However, this is not the case in Acc-Demeter because all the operations accurately use XOR gates. is way, Acc-Demeter can bene t from larger N-grams and does not hurt the density of the HD vectors. As discussed in Section 4.3, the shi operation can quickly become a bo leneck for large HD vectors and strings in a so ware-based implementation. However, this does not happen here since Acc-Demeter realizes the shi for free by simply redirecting each Flip-Flop (FF)'s stored value to the neighboring one every clock cycle. Fig. 9 depicts a schematic illustration of the encoder unit.  From the hardware perspective, the encoder distinguishes the binding and bundling components completely. For the binding, the SA reads out the value from IM to one input of an XOR gate and uses the previously stored value of neighbor FFs as the second input. e encoder then stores the results in a bu er and repeats the procedure. is design choice provides Acc-Demeter with the cascaded logical operations, with minimum changes to the memory array, and prevents any write back and pressuring the endurance of the PCM substrates. e encoder performs this sequence N times (enforced by the signals from the controller) to build an N-gram. A er it nishes creating one N-gram, it passes the N-gram to the bundler unit, resets the bu er, and starts building the next N-gram until it hits either the last character of the input or set limit per the nal HD vector. e bundler takes N-grams and adds them to a global HD vector that presents each position with a counter instead of only one bit per position. It then repeats this operation for M N-grams. Finally, the bundler applies a threshold (T) and makes a nal binary HD vector representing all the processed characters while building this vector. At this point, the encoder is done. It passes the results to be stored as prototype HD vectors or used as query HD vector in AM and resets both the integer-based and binary HD vectors.

Associate Memory (AM) Design
e AM unit is implemented using PCM arrays and their corresponding circuitry, similar to the IM unit. is unit takes the output of the encoding mechanism (an HD vector) as input. Although the AM and the IM can technically be combined, Acc-Demeter considers separate hardware for three reasons. First, these units serve in subsequent and completely di erent steps in a pro ling pipeline, naming encoding, and classi cation step. Such a distinct separation enables building a pipeline for them. Second, row-major and column-major data mapping in IM and introduce di erent parallelism opportunities for encoding and classi cation steps of a pro ling pipeline, respectively. Row-major data mapping of IM parallelizes encoding of all bits in a single HD vector in each clock cycle. On the other hand, column-major mapping parallelizes the similarity check of one query HD vector with all prototype HD vectors stored vertically in that clock cycle. ird, separate hardware helps us to simplify IM design by using sense ampli ers instead of ADCs. Doing so brings various bene ts in terms of area saving, energy consumption, and read-out time. Note that ADCs are usually the bo leneck of a memristor-based memory in terms of energy, area, and time [70] and that is why one only uses them when VMM or other logical operations such as Scouting Logic [43] are necessary. Equation 2 shows that for the classi cation, we need to count the di erences between the query HD vector and each prototype HD vector and then decide whether or not it can belong to the corresponding class. Although one can realize this in hardware by performing XNOR operation between the two vectors followed by a pop-count operation all in the periphery, such design comes with two drawbacks: (1) it requires the pop-count operation even a er the XNOR, which introduces an enormous area cost and signi cant delay (log 2 D + 1 cycles [71]), and (2) the AM unit, similar to the IM, only allows to write columns, not the rows. Since prototype HD vectors are not known from the beginning (unlike atomic HD vectors), this limitation forces us to save them all in another extra unit rst and then write them back on a row basis. is is again ine cient.
However, Acc-Demeter proposes a new column-major data mapping and intelligent data duplication for this unit and exploits the characteristics of the PCM substrate to solve all these problems for HDC-based classi cation. It is well-known that memristor-based memory technologies can perform vectormatrix multiplication [61][62][63][64]. erefore, Acc-Demeter implements the required XNOR and following pop-count operations in Equation 2 in four steps, three of which happen in the unit and the last one in the Similarity Check unit.
Step 1: Acc-Demeter stores one prototype HD vector (or a chunk of one HD vector) in one column and its complement in the same column number of a second array. Fig. 10-A shows their placement in the AM unit.
Step 2: Acc-Demeter applies the query HD vector (Q) to the rows of the rst array and the complement of the query HD vector (Q) to the rows of the second array with complement prototype HD vectors (Ps), shown in Fig. 10-B. Step 3: Acc-Demeter enables columns consecutively and e ectively read out the number of ones in Q.P andQ.P in ADCs of each array.
is way, it performs two vector-matrix multiplications using Kirchho 's law, one between Q and all Ps in the rst array and one betweenQ and allPs. Section 5.5 describes Step 4 that realizes XNOR and pop-count operation simultaneously. Fig. 10-B presents a high-level illustration of AM design.  Similar to the case in IM, the limited array size of PCM substrates also prevents Acc-Demeter from storing a full HD vector in one row or column of AM. To reduce the required area, and since the encoding is the bo leneck and not the classi cation (Section 7), in the AM, unlike IM, Acc-Demeter stores the chunks of HD vectors in the same array. Fig. 10-A takes a color-coding approach and depicts the way Acc-Demeter breaks prototype HD vectors into multiple chunks and stores them in columns of AM in and among tiles. It is worth noting that Acc-Demeter only writes to the PCM cells once in both IM and AM units unless either the con guration le in Step 1 or the default reference genome database in Step 2 changes, as a request by the user, for example. is prevents many writes to the devices, which still have limited endurance compared to traditional memory technologies.

Similarity Check Hardware
e similarity check unit is a small computing unit that takes the two ADCs' output of similar columns from the two crossbars and adds them together (Step 4). Fig. 10-C depicts all the logic for this unit. e output of this unit is the results of XNOR and pop-count together. At this stage, the similarity check unit sends the results out to the host CPU to determine whether the similarity is close to the threshold and should be considered in the abundance estimation ( 4 -b, and 5 ). e reason behind sending the results out instead of a winner-take-all (WTA) circuit used in previous works [41,72] is two-folded. First, a WTA circuit assumes that the query matches one and only one prototype HD vector. However, as discussed in Section 3.4 and 3.5, this is not always the case when pro ling the genomics data. Second, the relative abundance estimation techniques (Step 5 in Fig. 1), although simple, require more complex and area-hungry logic circuits, which Acc-Demeter aims to avoid whenever possible. erefore, since the results will be analyzed outside the PCM-substrate anyway and transferring such small data can be easily handled by interconnects between the host and Acc-Demeter, Acc-Demeter relies on the host CPU to perform the nal steps of Demeter. Note that the host is aware of prototype HD vectors' mappings.

Controller Unit
e controller orchestrates all the operations of Acc-Demeter by generating control signals for other components. It gets the start signal and the address of food samples (or reference genomes) in the memory as its inputs. e controller outputs the results of the similarity check unit back to the host for the nal steps. e controller is designed as a simple FSM machine and operates based on parameters set in Step 1 .

System Integration of Acc-Demeter
is section discussed Acc-Demeter's system integration stack that enables it to operate with the host processing system.

Address Translation
Acc-Demeter works with physical addresses instead of virtual ones and is relieved of address translation challenges that exist and are dealt with in previous works [73,74]. e CPU host uses the same translation lookaside bu er (TLB) lookup mechanism for normal load/store operations to translate instructions' virtual memory addresses into their physical addresses when we have a Acc-Demeter's instruction.

Coherence
Acc-Demeter may require modi ed and/or generated atomic vectors (for the IM units) or loaded prototype vectors (for the AM units). Similar to previous works [75][76][77], ensuring that data for Acc-Demeter is up-to-date is a responsibility for programmers and can be achieved easily by ushing cache lines. Acc-Demeter is also capable of leveraging previous PIM coherence optimizations [78,79] for further performance improvement.

Interrupts
We assume that the pages required by Acc-Demeter's AM and IM units are already present. When this is not the case, we rely on the conventional mechanisms for handling the page faults to place this data into the correct arrays. erefore, Acc-Demeter does not face page fault during the execution of food pro ling since pages used by Acc-Demeter are already loaded and pinned into AM and IM units. Acc-Demeter may, however, face an interrupt during a context switch. In such cases, the context of the control unit in Acc-Demeter will be saved and then restored when the pro ler resumes.

ISA Extensions and Programming Interface
An expressive and e cient programming interface is a must for Acc-Demeter as it directly impacts the usability of Demeter framework. To enable easy communication between Acc-Demeter and the programmer, we envision extending the ISA with a few instructions to let the control unit know the required operations, their timing, and where data objects reside in IM and AM units. ISA extension is possible due to the unused opcode space in the host CPU and has also been adopted in previous PIM-related architectures [36,73].
Acc-Demeter requires 2 types of instructions: (1) bbop init address, size, n: initialization of IM and AM units and (2) bbop op size, n: instructions for performing di erent operations in Acc-Demeter. bbop init is the initialization instruction informs the OS that the memory object is for Acc-Demeter.
is way, the OS performs virtual-to-physical memory mapping required for AM and IM units. bbop init takes the base physical address, the size of the vector, and the intended value. For Acc-Demeter's operations, we extend the CPU ISA with bbop op. Acc-Demeter utilizes an array-based computation model, i.e., src and dst are the source and destination arrays. bbop op is the opcode, where size and n are #elements in the array and #bits in each array element, respectively. is paper assumes that the programmer will manually write suitable code for Acc-Demeter operations. We summarized the required CPU ISA extensions for these operations in Table 1  We emulate the execution of Acc-Demeter using a cycleaccurate RTL model and synthesized it using UMC 65 nm technology node in Synopsys Design Compiler [80]. We verify the correct behavior of our memory model using test benches and previous in-memory simulators [41,62]. We consider a typical operation condition of temperature 25°and voltage 1.2V when evaluating our energy consumption. All the experiments for the PCM-based Acc-Demeter are carried out based on PCM statistical models that capture the variations in the spatiotemporal conductivity of the devices. PCM prototypes and analytical models used for validation and further simulations are based on the results of EU project MNEMOSENE [81], led and concluded by TU Del in 2020.   Fig. 11 shows the build time that each pro ler takes to build its initial data structure for two reference databases AFS20 and AFS31. We make two observations. First, Acc-Demeter has the lowest build time among all previous food pro lers. Acc-Demeter builds HD-RefDB corresponding to AFS20 and AFS31 ∼3.2x and 2.8x faster, respectively, than MetaCache, the next fastest pro ler. Unlike previous HDC-based methods that are faster than their ML competitors due to the one-shot learning ability of HDC paradigm, Acc-Demeter outperforms SOTA pro lers due to its highly parallelized performance and simple operations being performed on Acc-Demeter's hardware. SOTA food pro lers parse the reference genomes only once, and the one-shot learning of Demeter is not particularly advantageous.
Second, CLARK exceeds the 500 GB memory of the system when running it for AFS31. is is in line with observations in [4]. erefore, we excluded it from all analyses regarding AFS31 from now on. is case shows an excellent example of where metagenomic pro lers, whereas good for lengthy and costly studies, may not be applicable for the scenario of food pro ling and later food analysis and monitoring.

7.2.2.
ery time.. Fig. 12 presents the time that each pro ler takes to query one (short) read from the query food sample and classify its specie(s) over AFS20 and AFS31. We make two key observations. Acc-Demeter improves the query time by ∼74x/88x and 272x/350x compared to Kraken2 and MetaCache, respectively, on AFS20/AFS31. is shows that the acceleration of Demeter pays o and nally makes Demeter not only an accurate but also a fast food pro ler.
Second, the query time for Acc-Demeter remains almost the same for both databases and does not change much. We further investigate this and realize a bo leneck shi : Step 5 or abundance estimation that is being performed inside the CPU is now the bo leneck of Acc-Demeter. is happens because of the high-frequency Acc-Demeter achieved. However, this contrasts with other pro lers that spend most of their time querying their massive data structure.

7.2.3.
ery throughput.. Fig. 13 shows the throughput of di erent pro lers over AFS20 and ASF31. We make two observations. First, Acc-Demeter provides throughput improvement of ∼192x and 232x for both AFS20 and AFS31, respectively, compared to Kraken2, the second food pro ler regarding throughput.
is more remarkable improvement in throughput than query time results from Acc-Demeter's ability to classify one query read in parallel with the encoding of the following query. Note that the throughput analysis of the previous pro ler does not consider the time for loading their data structure. Second, similar to the query time, throughput is almost the same regardless of the database due to the bo leneck shi . We conclude that Acc-Demeter signicantly outperforms all four SOTA baselines for all performance metrics.  We make two observations. First, the logic for the encoder unit is the most energy and area hungry unit among all others, more than 90% and 78% energy and area of the whole Acc-Demeter. is is expected because (1) the encoder consists of many CMOS circuits, whereas AM and IM are small memory units with PCM technology, and (2) the encoder is in the heart of all operations in Demeter, and we spend most of our time in this unit. We argue that this amount of logic around our array is still justi able. Second, compared to the die area in an Intel Xeon E5-2697 CPU [82], Acc-Demeter only has an area overhead of less than 2%. We conclude that Acc-Demeter is low-cost in terms of die area.

Acc-Demeter's Power and Area Analysis
Our evaluations show that Acc-Demeter can perform a 9.45Mbp query per joule. Unfortunately, measuring the energy consumption of other pro lers and having an apple-toapple comparison between the energy consumption of this method with other ones is hard. However, Merelli et al. [83,84] show that running Kraken2 with querying an even smaller data structure built from a reduced reference genome dataset, minikraken [83,85], can incur more energy (maximum of 0.6 Mbp j ). is considerable di erence happens because of three reasons: (1) Kraken2 queries a more complex data structure compared to Acc-Demeter and requires more complex operations, (2) Kraken2 queries a bigger data structure for its query, and (3) Kraken2 incurs signi cant data movement between the memory and the processing unit. All of these limitations exist in similar forms in CLARK and MetaCache. We conclude that Acc-Demeter is more energy-e cient than all four SOTA baselines.

Discussions and Future Works
Capacity. We de ne the capacity of Demeter as the ratio between the number of reference genomes encoded as prototype HD vectors to the size of HD space for a competitive pro ling accuracy target. e higher #prototype HD vectors are, the bigger capacity is needed, resulting in bigger HD space and lower e ciency. erefore, if one uses Demeter, as is, as a metagenomics pro ler, they cannot expect similar improvements compared to SOTA metagenomics pro lers (e.g., Kraken2, on those datasets. We are currently investigating the additional techniques to enable Demeter for those cases as well. However, we leave further analysis of required changes to Demeter for supporting metagenomics pro ling or other pro ling studies with many reference genomes for future work. Supported functions and representations. As discussed (Sections 2, 3, and 5), Acc-Demeter currently supports only binary representations and N-gram encoding mechanism. is design choice is made for simplicity and is based on acceptable accuracy results of the so ware version. We leave the hardware for other encoding mechanisms and data representations for future work.

Related Works
To our knowledge, Demeter is the rst paper to propose a framework to perform food pro ling using HDC. Acc-Demeter is also the rst hardware accelerator that enables low-cost and accurate in-memory pro ling for a typical reference database in food pro lers. We have already compared Demeter and Acc-Demeter extensively to SOTA pro lers in Sections 4 and 7. is section brie y discusses previous so ware works for pro lers (food or metagenomics), so ware or hardware of HDC-based systems, and PIM-enabled accelerators.

Metagenomic Pro lers
Several recent works propose approaches and techniques to directly or indirectly accelerate or improve the accuracy of metagenomics pro ling, the rst step of such studies. ese works take three approaches: (1) Reducing the reference database's size by pre-alignment ltering [86,87] or heuristics for taxonomic classi cation techniques [55,[88][89][90][91], (2) Accelerating read alignment or assembly (only for alignment-/assembly-based pro lers) on CPUs, FPGAs, or GPUs [92][93][94][95][96][97][98], (3) post-alignment/-assembly/-classi cation presence and abundance estimation heuristics [54,55,99]. Demeter is categorized in the rst group, taking a HDC-based approach for the rst time. However, compared to the rst group, Acc-Demeter is much faster and has a lower cost (regarding both energy and area consumption). Note that Demeter and Acc-Demeter are orthogonal to works in the third group, and their Step 5 can adapt their proposed techniques for the abundance estimation a er the initial classi cation.

HDC-based Systems
Many works exploit the HDC paradigm for speci c machine learning applications that require capturing temporal pa erns. ese works vary from language [100] and voice [101] detection to seizure detection [102]. Demeter is the rst work that investigates HDC in the realm of pro ling genomics data. Although HDNA [48] and GenieHD [103] propose to use HDC for (partial 5 ) sequence alignment of a single reference genome divided into multiple pieces, they never exploit it for any metagenomics or food pro ling.
A few works also suggest various hardware platforms such as FPGAs, GPUs, or ASICs [41,48] to improve the performance of HDC-based designs. Acc-Demeter is di erent from all of these designs in two important aspects. First, Acc-Demeter performs an exact pop-count operation in one cycle, performing two VMM in parallel and then adding the outputs of ADCs.
is is in contrast to previous works [41,48,102,103] that perform an exact pop-count operation in log 2 D + 1 cycles, where D is the size of an HD vector. Second, unlike [41], Acc-Demeter can perform the required HDC-based operations on long, nonsparse N-grams (discussed in Section 5.3). To our knowledge, Acc-Demeter is the only PIM-enabled accelerator for food proling.

PIM-enabled Accelerators
Prior works also heavily investigate various forms of compute-capable memories [36,104,105]. Among these, only a few use in-memory capability for HDC designs [41,106]. However, these works are either tuned for single tasks or capable of limited sizes for N-grams like only up to 3-grams [106], or only based on compact models from small prototypes with 256×256 ReRAM arrays. Demeter is the rst work that proposes food pro ling inside the memory. eoretically, one can accelerate Demeter using a PIM-enabled design on DRAM, SRAM, and other technologies. However, for the reasons iterated in Section 5, Acc-Demeter exploits PCM to improve the performance of Demeter.

Conclusion
is paper introduces Demeter, the rst framework that enables pro ling of food samples via HDC whereas strictly meeting the accuracy of state-of-the-art pro lers. Demeter uses a ve-step approach to enable species-level pro ling using HDC. is paper also introduces the rst PCM-baed PIM-enabled hardware accelerator, called Acc-Demeter. We evaluate Demeter on so ware and Acc-Demeter using a cycle-accurate model based on a small-scale PCM-based prototype. We design Demeter and Acc-Demeter to (1) address the key challenge of HDCsystems when facing a massive input, (2) eliminate the need for a powerful machine with very large memories, and (3) prevent unnecessary data movement between memory and processing units and therefore prevent wasting time and energy. We achieve signi cant performance and energy bene ts over the SOTA CPU implementations whereas achieving the same accuracy. We hope that future work builds on top of our framework and its hardware and extends it to further improve our food pro ling systems.