Accelerating Time Series Analysis via Processing using Non-Volatile Memories

— Time Series Analysis ( TSA ) is a critical workload to extract valuable information from collections of sequential data, e.g., detecting anomalies in electrocardiograms. Subsequence Dynamic Time Warping (sDTW) is the state-of-the-art algorithm for high-accuracy TSA. We find that the performance and energy efficiency of sDTW on conventional CPU and GPU platforms are heavily burdened by the latency and energy overheads of data movement between the compute and the memory units. sDTW exhibits low arithmetic intensity and low data reuse on conventional platforms, stemming from poor amortization of the data movement overheads. To improve the performance and energy efficiency of the sDTW algorithm, we propose MATSA, the first Magnetoresistive RAM (MRAM)-based Accelerator for TSA. MATSA leverages Processing-Using-Memory (PUM) based on MRAM crossbars to minimize data movement overheads and exploit parallelism in sDTW. MATSA improves performance by 7.35 × /6.15 × /6.31 × and energy efficiency by 11.29 × /4.21 × /2.65 × over server-class CPU, GPU, and Processing-Near-Memory platforms, respectively.


I. INTRODUCTION
In the era of Internet-Of-Things and Big Data, emerging applications operate on petabyte-scale datasets that are increasingly difficult to store and analyze.Small sensors and edge devices continuously generate data sampled over time, resulting in time-ordered observations (e.g., temperature or voltage).Such a collection of data values is referred to as a time series (TS) [1].TS is a common data representation in many real-world scientific applications, including sensing, genomics, neuroscience, financial markets, epidemiology, and environmental sciences [2].
Time series analysis (TSA) splits the time series into subsequences of consecutive data points to extract valuable information from large datasets.This information can help filter relevant subsequences to minimize the cost of applying complex and expensive domain-specific analysis algorithms.A real-life example is the detection of anomalies in an electrocardiogram and the elimination of subsequences that indicate normal behavior [3].TSA determines subsequences of interest using different similarity approaches, such as the Euclidean Distance (ED) or the subsequence Dynamic Time Warping (sDTW).Prior work demonstrates that sDTW provides a higher precision than ED in most scenarios [4]; as such, we focus on optimizing sDTW algorithm for TSA analysis.
sDTW is an embarrassingly parallel workload, because each query can be executed without data dependencies from other queries by multiple concurrent processing units.However, sDTW builds a 2D dynamic programming matrix that incurs quadratic runtime and memory complexity.To understand the bottlenecks of sDTW in state-of-the-art conventional CPU * Christina Giannoula and Aditya Manglik have equal contribution.
and GPU architectures, we comprehensively characterize the kernel's performance on these platforms ( §II-D).We observe significant performance and energy efficiency overheads in sDTW due to: 1) underutilization of the execution units, and 2) a large number of expensive main memory accesses.The first problem stems from the low number of operations that the sDTW kernel executes per byte brought from memory, which keeps the arithmetic units idle for the largest part of the execution time.The second problem stems from the large memory footprint of the dynamic programming matrix, causing poor spatial and temporal locality.Consequently, sDTW exhibits poor performance on CPU and GPU platforms.
To overcome the memory access challenge, prior works [5]- [7] have considered memory-centric platforms that integrate processing and storage elements on the same chip to reduce data movement across the constrained data bus that connects a CPU to main memory [8], [9].Based on that, we implement and characterize sDTW in a real Processing-Near-Memory (PNM) platform, UPMEM [10], and observe that this new platform does not provide performance benefits compared to CPU and GPU executions, due to the large latency of simple operations such as addition and comparison operators.Overall, we conclude that the sDTW kernel exhibits memory-bound behavior on CPU and GPU platforms and compute-bound behavior on the PNM platform ( §II-D).
In contrast to PNM, Processing-Using-Memory (PUM) [7], [11]- [15] executes operations using the memory cells and sense amplifiers, completely eliminating the memory and compute dichotomy.PUM enables 1) performing computation in the memory array, since the memory units that store the data also execute the computation, and 2) exploiting a much larger amount of parallelism available in the memory microarchitectures (as high as the number of crossbar columns available [16], i.e., thousands) compared to conventional CPU and GPU systems.From the technology perspective, non-volatile memories (NVM) offer a promising substrate to implement PUM [17].However, different NVM substrates exhibit varying latency, energy, and endurance characteristics, a key design constraint for different accelerators [18].Magnetoresistive RAM (MRAM)-based PUM substrates offer low read/write latencies, low energy per operation, and high endurance [19], [20].Considering these characteristics, we explore MRAM as a potential NVM substrate to accelerate the sDTW kernel.
To this end, our goal in this work is to leverage MRAMbased PUM to enable high-performance and energy-efficient sDTW execution for a wide range of applications.We propose MATSA, the first MRAM-based Accelerator for TSA.MATSA derives its performance benefits from three key mechanisms.
First, MATSA decomposes sDTW's computational kernel into simple bitwise boolean computations and executes them in the MRAM crossbar.This key idea significantly minimizes data movement overheads as it is performed where data resides.Second, we implement a novel data mapping that reduces the runtime memory footprint of sDTW from quadratic to linear based on four vectors.This key idea enables computing the complete 2D dynamic programming matrix on-the-fly without storing it.Third, MATSA integrates an effective computation scheme that overcomes the inter-cell computation dependencies of the matrix by 1) following an anti-diagonal approach and 2) exploiting pipelining to increase parallelism.
We evaluate MATSA's performance based on state-of-theart latency and energy characteristics of MRAM devices [21], [22].To do so, we implement an in-house simulator for MATSA and select 64 synthetic datasets to understand its design tradeoffs.Then, we use six real-world datasets (Human, Song, Penguin, Seismology, Power and ECG) to compare three different versions of MATSA against other state-of-the-art platforms, showcasing its applicability to a wide range of real case scenarios.Our evaluation shows that MATSA improves performance by 7.35×/6.15×/6.31×and energy efficiency by 11.29×/4.21×/2.65×over server-class CPU, GPU, and PNM platforms, respectively.
In summary, we make the following novel contributions: • We thoroughly characterize the state-of-the-art sDTW time series analysis (TSA) algorithm's performance and energy efficiency on conventional CPU, GPU, and PNM (UPMEM) platforms.Our characterization leads to new observations about the characteristics of sDTW that limit its acceleration in current conventional hardware.• We propose MATSA, the first MRAM-based Accelerator for TSA.MATSA 1) exploits a novel data mapping tailored for MRAM substrates that reduce memory footprint in sDTW, 2) efficiently performs computation in-memory to avoid off-chip data movement, and 3) provides an effective computation scheme to increase parallelism.• We conduct a comprehensive evaluation of MATSA across a diverse set of synthetic and real-world datasets.
Our results showcase 6.60× average improvement in overall performance and a average 6.05× boost in energy efficiency over state-of-the-art compute-centric and memory-centric platforms.

II. BACKGROUND & MOTIVATION A. Time Series Analysis
A time series T is a sequence of n data points t i , where 1 ≤ i ≤ n, collected over time.A subsequence of T , also known as a window, is denoted by T i,m , where i is the index of the first data point, and m is the number of samples in the subsequence, with 1 ≤ i, and m ≤ n − i.
There are two main approaches to perform time series analysis: 1) the self-join, and 2) the query-filtering.In self-join, all sequences of a given time series are compared against the remaining subsequences of the same time series.In contrast, query filtering compares a set of queries against a reference.Time series analysis algorithms usually define a distance metric to measure the similarity between two subsequences.Based on such distance metric, the literature classifies the subsequences with low distance as motifs [23] (similarities) and high distance as discords [24] (anomalies).The state-ofthe-art set of tools to perform time series analysis is Matrix Profile [25] (MP).Due to lower computation requirements, prior MP algorithms utilize one-to-one Euclidean Distance as the similarity metric.Recent proposals [4] have started to utilize Dynamic Time Warping (DTW)-based solutions because of higher precision [26].DTW enables the detection of events of interest in out-of-sync subsequences, e.g., in subsequences that have different sampling rates.
Figure 1 shows the key difference between the one-to-one and the DTW approaches, in which we compare two similarshape subsequences that differ in their offset and scale.

a) b)
Fig. 1: Example of similarity calculation between two subsequences (blue and green).The one-to-one approach in a) provides a low similarity as it only compares each i th point of blue with each i th point of green.In contrast, DTW in b) successfully matches the points of the subsequences.
We observe that the DTW algorithm offers better results as it compares a given point with respect to several potential candidates (i.e., determines the best alignment).In contrast, one-to-one executes point-to-point alignment that cannot determine the best alignment in the presence of an offset.Oneto-one can be considered as a special case of DTW where the warping window is set to '1'.Therefore, we aim to optimize DTW, a more generic and high-precision algorithm, to provide a TSA accelerator for a wide range of applications.

B. Time Series Analysis Applications
Time series analysis constitutes one of the most important and general data mining primitives for a wide range of real-world applications [27]: epidemiology, genomics, neuroscience, medicine, environmental sciences, economics, and many more.Table I presents a few examples for applications of TSA.
In statistics, econometrics, meteorology, and geophysics, the primary goal of time series analysis is prediction and forecasting.At the same time, in signal processing, control engineering, and communication engineering, it is used for signal detection and estimation.In data mining, pattern recognition, and machine learning, time series motif and discord discovery are used for clustering, classification, anomaly detection, and forecasting.Finally, the most important application of time series motif and discord discovery is clustering seismic data and discovering earthquake pattern clusters from the continuous seismic recording.Consequently, seismic clustering can Gesture Recognition [66], [67] Trajectories [68] Traffic Monitoring [69] TABLE I: Time Series Analysis main applications be applied to earthquake relocation and volcano monitoring to help improve earthquake and volcanic hazard assessments.Within this field, the subsequence Dynamic Time Warping (sDTW) algorithm is a fundamental kernel due to its superior accuracy and generality when compared to other TSA methods [4].Examples of real-life use cases that can benefit from high-performance and energy-efficient sDTW are: • Circulatory Failure Detection in Intensive Care Units.TSA consumes 90% of the end-to-end execution time [70].Figure 2 describes the aforementioned process based on an example processing flow.• Electroencephalography (ECG).TSA is deployed to monitor and filter ECG readings when monitoring patients [59].• Earthquake Detection.TSA is critical to process seismograph data and detect anomalies for further analysis [42].2)

TSA Algorithm
3) 4) 5) Fig. 2: Example TSA application, where TSA acts as a filter to avoid most of the computation.TSA selects the relevant queries (anomalies) and discards the irrelevant ones.

C. Dynamic Time Warping (DTW)
DTW algorithm was first introduced by [71].The first step of DTW is to compute the distance between a particular point from a subsequence and a set of points from another subsequence, only keeping the minimum of them.This process is repeated for all the points of the first subsequence.Then, DTW computes the addition of all distances, providing a similarity measure between the subsequences (the lower the distance, the higher the similarity).
Assuming that we have two subsequences, Q (query) and R (reference), of length n and m, respectively, where: Q = q 1 , q 2 , ..., q i , ..., q n R = r 1 , r 2 , ..., r j , ..., r m DTW constructs an n-by-m scoring matrix (S) to determine the similarity between the two subsequences.Each (i th , j th ) cell of the matrix (s i,j ) is filled in two steps.First, the algorithm calculates the distance d(q i , r j ) between the two corresponding points of the subsequences.There are several approaches to calculate such distance, while d(q i , c j ) = abs(q i − c j ) and d(q i , c j ) = (q i − c j ) 2 are the most common ones [71].Second, the distance value is added to the minimum of the three neighboring cells as follows: The algorithm fills the entire matrix using dynamic programming.Then, the goal is to find the best alignment (i.e., minimum accumulated cost), known as the warping path (W ).W is a contiguous set of matrix cells that defines the best mapping between Q and R. Subsequence Dynamic Time Warping (sDTW).sDTW is a more general DTW algorithm that allows the query to be aligned with part of the reference.Algorithm 1 presents the pseudocode of sDTW.
Algorithm 1 Subsequence DTW (sDTW) for i ← 1 to N do 7: for j ← 1 to M do 8: First, sDTW initializes the matrix S with zeros.Second, it calculates the distance value of the top-left corner and then the remaining elements of the first row, taking into account the previous values.Third, it fills the remaining elements of the matrix using dynamic programming row by row.Finally, it returns the minimum element of the last row of the S matrix, which indicates the similarity between the query and the best alignment with (part of) the reference.The nested for loops (lines 6 and 7 in Algorithm 1) are responsible for the quadratic runtime and memory complexities.

D. Bottlenecks of sDTW in Conventional and PNM Platforms
sDTW's quadratic computational complexity is challenging to overcome, especially when accurate results are required and algorithmic optimizations are insufficient [72].To determine the bottlenecks in conventional platforms, we perform a detailed characterization of parallelized and optimized sDTW kernels on CPU, GPU, FPGA, and PNM platforms.
CPU.We profile the performance of sDTW on an Intel Xeon Phi 7210 CPU using the Intel Advisor tool [73].We build the roofline plot and present the result in Figure 3left.First, we observe that sDTW-CPU can utilize only 41% of the system's integer peak performance, i.e., 59 GINTOPS out of 145 GINTOPS, and exhibits low arithmetic intensity (0.55 INTOP/Byte).Second, the total memory traffic generated during runtime is 267 GB.In contrast, the memory footprint of the sDTW kernel is only 570 MB.This demonstrates that sDTW is a memory-bound kernel for CPU targets.
GPU. Several prior works propose accelerating sDTW using GPUs (e.g., [74]).However, these implementations are tailored and optimized for specific workload sizes.They rely on high-latency global memory when working with arbitrarysized datasets, which results in large performance penalties compared to the optimal input size.To quantify the bottlenecks, we develop an optimized CUDA-based implementation that supports arbitrary subsequence sizes and characterize it on the NVIDIA Tesla V100 GPU.We analyze the sDTW kernel using NVIDIA Visual Profiler [75] and generate the roofline plot in Figure 3-right.We observe that sDTW-GPU's performance improves with respect to sDTW-CPU but utilizes merely 1% of the GPU's available peak performance.We explain this observation by 1) the low arithmetic intensity of sDTW and 2) the limited per-thread available local memory.Even increasing the available local memory does not improve performance and the algorithm hits the memory roof due to 1), thus greatly underutilizing the platform.Based on this analysis, we conclude that GPU is not a good target for sDTW kernels executing on arbitrary subsequence sizes, which is the common case in many applications.FPGA.sDTW acceleration using FPGAs requires large onboard memory to achieve high performance.As most of the prior work based on FPGAs does not provide high on-chip memory capacity, data is distributed over the chip.We develop an optimized FPGA implementation targeting a Xilinx Alveo U50 and build the roofline model in Figure 4-left.We observe that the eight compute units that fit in the FPGA achieve less than 7% of the available peak throughput and are insufficient to exploit the inherent parallelism in the sDTW kernel.Key Observation 1: Conventional architectures fail to provide a high performance and energy efficient acceleration solution because execution time and energy are wasted on the data movement between memory and processing units.
Processing-Near-Memory (PNM).PNM platforms place processing units in the same die as memory units.The idea behind this paradigm is to exploit the lower latency and higher bandwidth available in memory and mitigate the data movement overheads between the processing units and memory.To evaluate the performance and energy efficiency of sDTW on PNM, we implement an optimized version of the algorithm on UPMEM [76], the first commercially available server-class PNM platform.We build the roofline model in Figure 4-right and observe that sDTW is compute-bound in UPMEM.This observation can be attributed to the low-power general-purpose cores in UPMEM that offer poor throughput (146 GINTOPS in contrast to 15700 GINTOPS for the GPU).As arithmetic operations are at the core of sDTW, PNM cannot provide high performance for it.We also observe that UPMEM reduces the energy consumption by 37% with respect to the GPU by reducing the data movement overheads ( §IV-C).However, poor performance in contrast to the GPU inhibits the effective usability of the platform for the sDTW kernel.
Key Observation 2: General-purpose PNM substrates provide higher energy efficiency compared to CPU/GPU/FPGA platforms.However, they fail to offer a high performance solution because of the limited arithmetic computation throughput supported by the hardware.

E. Overcoming bottlenecks in TSA
Need for Processing-Using-Memory (PUM).We observe that when executing the sDTW kernel, 1) CPU, GPU, and FPGA platforms are memory-bound, and 2) PNM platforms are compute-bound.In contrast to these platforms, PUM accelerators execute operations directly using the memory cells where data resides [15].PUM enables 1) exploiting large internal memory bandwidth for memory-bound kernels, and 2) exploiting massive computation parallelism (as high as each bitline) for compute-bound kernels, overcoming key restrictions of CPU, GPU, FPGA and PNM architectures.Based on these observations, we argue that an accelerator based on PUM is needed to improve TSA's performance and energy efficiency providing a balanced solution.
Cell Technology Choice.A PUM-based accelerator's performance, energy efficiency, and endurance depend on the underlying substrate's cell technology; thus, it is a critical design choice.Non-Volatile-Memories (NVM) offer a lowenergy substrate for PUM as they do not require periodic refresh operations in contrast to DRAM-based PUM [16], [77].However, it is challenging to support frequent write operations as NVM-based PUM architectures due to significant write latency and low endurance [78].Table II presents the characteristics of NVM technologies we considered for accelerating the sDTW kernel.We discard NAND Flash, ReRAM, and PCM in the first step due to their low endurance and high write latency.Next, we consider FRAM due to its high endurance but discard it due to the high read latency.We then consider MRAM technologies ( §II-F) and discard STT-MRAM due to a high write latency.In contrast to STT-MRAM, SOT-MRAM offers 1) high endurance, 2) low read and write latencies, and 3) CMOS compatibility that eases manufacturability [79].Considering these characteristics, we argue that SOT-MRAM is a promising substrate for implementing PUM accelerators for kernels with frequent write operations, and evaluate its feasibility for accelerating the sDTW kernel.We conclude that the MRAM-PUM acceleration approach has the potential to overcome TSA's bottlenecks and provide a faster and more efficient solution than the state-of-the-art.

F. MRAM-based PUM Computation
Many prior works demonstrate significant performance and energy efficiency improvements for machine learning workloads via PUM in resistive crossbars [80] by exploiting matrixvector multiplication.Other approaches can exploit bitwise operations with high performance and energy savings [81]- [83].Figure 5-a shows a typical crossbar organization with memory cells connected using bitlines and wordlines.• Magnetic Tunnel Junction (MTJ).Consists of a fixed layer with a pinned magnetization direction, a free layer whose magnetization can be changed, and an insulating tunnel barrier between them.• Heavy Metal Layer.This layer is placed next to the MJT to facilitate the spin-orbit torque effect.Common heavy metals used include tantalum (Ta) and tungsten (W).The change of orientation of one of the layers of the stack results in a variation in the device's electrical resistance.However, compared to Spin-Transfer-Torque MTJ (STT-MTJ) [19], SOT-MJT features separated read and write paths, enhancing endurance and widening the read/write margin.Then, sense amplifiers interpret the resulting voltage as boolean: • Read Operation.During a read operation, the resistance of the MTJ is measured.The resistance is sensitive to the relative alignment of the magnetization in the fixed and free layers, allowing the stored data (Boolean values representing 0 or 1) to be read.• Write Operation.During a write operation, an electric current is applied through the heavy metal layer, inducing a spin current.This spin current exerts torque on the free layer, causing its magnetization direction to switch and changing the stored Boolean data.Unlike STT-MTJ, which faces read disturbance issues limiting the read circuit frequency, SOT-MTJ allows for flexible adjustment of current magnitude in the read circuit without concerns about read disturbance effects.As a consequence, it enables more accurate sensing which is crucial to implement in-memory operations.This suggests SOT-MRAM as a better candidate for PUM applications.
Bitwise PUM Mechanism.The matrix-vector PUM mapping proposed in prior works cannot be applied to dynamic programming (DP) algorithms (e.g., sDTW) since they perform matrix-vector multiplication.DP requires computing a 2D scoring matrix by traversing it row-by-row.Moreover, prior crossbar substrates offer limited support for other operations (e.g., minimum calculation).To overcome this challenge, MAGIC [84] proposes decomposing complex operations into simple Boolean functions (e.g., AND, NOR, XOR) to support them in the substrate.The key idea is to vertically map the operands (e.g., 32-bit integers) to the crossbars' columns using (typically) one bit per cell (e.g., each operand value takes 32 bits of a given column).Then, the desired operation (e.g., addition) is decomposed to simple bitwise operations (e.g., NOR) and performed bit-by-bit via sequentially activating two cells for each operand simultaneously.This approach creates a difference in the voltage over the bitline depending on the content of the activated cells, which depends on the resistance they hold.Then, a modified sense amplifier calculates the result based on that voltage difference and thresholds, storing it in a cell of the same column.While this process is inherently sequential and the latency per operation is higher than a CMOS-based approach, the 1) independence across columns and 2) the lack of data movement enables immense parallelism and, thus, an overall higher throughput than CMOS-based solutions.Figure 5-c shows a sense amplifier (SA) slightly modified with respect to commodity ones, including different voltage thresholds for the operations.

A. Overview
MATSA is an MRAM-based Accelerator for Time Series Analysis.Figure 6 presents an overview of our proposed architecture.MATSA is composed of several chips divided into multiple banks.Banks belonging to the same chip share buffers and I/O interfaces and work in a lock-step approach.Each bank is composed of several Multiple Memory Matrices (MATs).The MATs share a Global Row Buffer (GRB) and are connected to a Global Row Decoder (GRD).We place a Local Row Buffer (LRB) for every pair of subarrays to improve performance.Each subarray is composed of magnetoresistive devices that are connected to the Write Word Lines (WWL), Write Bit Lines (WBL), Read Word Lines (RWL), Read Bit Lines (RBL), and Source Lines (SL).The compute-enabled subarrays perform the sDTW computation using Reconfigurable Sense Amplifiers (RSAs).
The execution flow is orchestrated by a hierarchy of small controllers implemented as finite state machines (FSMs).MATSA comprises of 1) a global controller that orchestrates inter-bank flow, 2) inter-mat controllers that take care of the inter-mat flow, and 3) subarray controllers that activate the memory rows and drive the RSAs to run sDTW's algorithm.

B. MATSA Subarrays
MATSA subarrays are comprised of MRAM cells following a crossbar organization and can work either in regular memory or compute mode.This is a desirable feature since our design consists of 1) subarrays that temporarily buffer the data until they are being processed and 2) subarrays that perform the actual computation.Adjacent subarrays are connected using pass gates and aux columns (purple one in Figure 6) to enable the data flow through the hierarchy.
Memory Subarrays.MATSA subarrays in regular memory mode support both read and write data operations and work in the same way as conventional non-PUM-enabled memory.
Compute Subarrays.MATSA subarrays working in compute mode perform bit-wise operations on input data located in cells of the same column.This enables the parallel execution of many operations since all columns in the subarray work in parallel.The key idea is to select two or three input values simultaneously using the Memory Row Decoder (MRD).This produces an equivalent resistance that depends on the content of the selected cells and modifies the sensing voltage across the column accordingly.MATSA's Ctrl can select different operations from the Reconfigurable Sense Amplifiers (RSAs) that are placed per column.We modify the RSAs to execute operations by equipping them with different resistances to model the voltage thresholds, logic gates (i.e., NOR, XOR, INV), a register, and a multiplexer (see Figure 6).The RSAs in Compute subarrays support the same operations as memory subarray RSAs, enabling switching between operating in compute and memory modes.

C. PUM Operations
MATSA implements the following PUM operations to support the execution of sDTW (detailed in Algorithm 1): • Vertical Row Copy.MATSA executes consecutive memory read and write operations in the same cycle to improve performance by activating two rows simultaneously.In the first half cycle, the subarray's MRD activates the source row read by the LRB.Next, the destination row is activated to store the data in the second half cycle.This mechanism works at MAT and bank levels using the Global Row Buffer (GRB) to accelerate the copies across the hierarchy.• Diagonal Row Copy.The Ctrl executes a diagonal copy shift data between adjacent columns.The Ctrl leverages the available registers in the RSA and the interconnections between the RSAs.The operation is executed in two steps.First, the RSA reads the value in the source column.Second, the destination RSA (in an adjacent column) reads the value from the source RSA and writes it to its column.• Addition/Subtraction. MATSA executes Bit-serial addition/subtraction across columns.The Ctrl executes operations starting from the least significant bit of the two operands until the most significant bit.Every bit operation requires two memory cycles, further divided into four half cycles.In the first half-cycle, the RSAs read voltage difference across all cells activated in the same bit lines as input operands and calculate the Sum.The RSA updates the Sum based on the stored Carry value in the register.In the second half-cycle, the RSAs write the Sum value to the destination cell.In the third halfcycle, the RSAs calculate the new Carry value based on a majority function of the operand rows and an auxiliary row reserved for the Carry bit.In the fourth half-cycle, RSAs write the new Carry value in the auxiliary row for the next Carry calculation.• Absolute Calculation.To calculate the absolute value, MATSA first checks the sign bit, leading to two possible scenarios: 1) if the number is positive, no change is needed; otherwise, 2) if the number is negative, MATSA inverts the bits of the number and adds '1' to the result (similar to 2's complement).• Minimum Value.To calculate the minimum value between three elements, MATSA performs two comparisons based on the subtraction operation.First, it calculates the difference between the two numbers.Second, it checks the resulting sign from the previous step and selects one of the two numbers for comparison against the third.The final comparison sign determines the minimum between three values.The logic can be similarly extended for comparing more than three values.

D. Data Mapping
Section II-D demonstrates that sDTW is an embarrassingly parallel algorithm.We design MATSA's data mapping to leverage MRAM's parallel column-wise computation capability.Three data structures are involved in the sDTW computation: 1) reference sequence (of length O(M)), 2) query sequence (of length O(N)), and 3) the warping matrix (dynamic programming matric with size O(NM)).The data structures are mapped to the subarray as follows: • Reference Elements (R[j]).We vertically map each reference element to 32 cells of a column.If 1) the number of available columns is larger than the number of elements in reference, we replicate the reference to multiple columns to increase parallelism (distributing the queries between them).If 2) the number of available columns is lower than the number of elements in reference, we divide the query and complete the process in sequential batches.No action is needed if 3) available columns are equal to the number of elements in reference.• Query Elements (Q[i]).We vertically map each query element to 32 cells of a column.New query elements are introduced on the left side of the crossbar, and they are right-shifted in each successive step (see §III-E).• Current S vector (S[i, j]).We define the current vector of the warping matrix as the S vector.We vertically map each element of the S vector to 32 cells of a column, being aligned with the query and reference elements (i and j indexes, respectively).

• Temporal S vectors (S[i-1, j-1], S[i-1, j],
S[i, j-1]).We vertically map the three temporal vectors along the reference and query elements.Mapping the temporal vectors in the same subarray leverages parallelism in the subarray as each column can compute lines 8-9 of Algorithm 1 completely in parallel.Then, those vectors are efficiently updated also in parallel for the next iteration of the loop thanks to the vertical and diagonal row copies.
• Aux Cells.Each column has a slice of 64 cells used to hold the partial results during the execution flow.We calculate the distance between each data point in the reference and the query by iterating over the current S vector of the warping matrix (see Algorithm 1).Each element in the S vector (mapped across different crossbar columns) requires accessing previous S vector values that are mapped to the same column (i.e., S[i − 1, j]) and adjacent columns (i.e., S[i, j − 1], S[i − 1, j − 1]).To break this data dependency, we add three temporal S vectors in the crossbar array that are updated in each step of the computation: 6).Overall, our optimization reduces the memory footprint from O(NM) (whole matrix) to O(4M) (S vector plus three aux ones).

E. Execution Flow
MATSA's execution flow follows a wavefront approach [85], which reflects the computation pattern in dynamic programming applications.The motivation is that sDTW's matrix has to be computed in the wavefront manner due to inter-cell dependencies.Figure 7 shows an example of how we tackle this restriction by assuming one reference time series (red one) and two queries (green and ocher).The key idea is to make computation flow diagonally by assigning one element in the wavefront to each processing element (PE), and using the diagonal row copy operation ( §III-C) to shift data between columns on the wavefront.This is needed since each cell requires taking values from its left column, thus their data values need to be available prior to computation.Because of that, each PE advances computation in the vertical direction with one cell delay with its left PE, ensuring that the data needed to calculate the next value is available.Figure 7-a shows an initial state where the computation just started.In this example, only PEs where their column contain black rectangles are are performing computation.Note that in every step the wavefront introduces a new PE to the active set, achieving maximum performance after number of PEs steps.When reaching point, all PEs are able to perform useful work in a given execution step.Figure 7-b shows how this initialization phase can be amortized by pipelining.By introducing a new query to compare against the reference before the prior one finishes, MATSA ensures that all PEs have work to do even during the transitions between queries.Overall, this execution flow enables 1) leveraging the subarray columns in parallel for the query, and 2) creation of an intersubarray pipeline to leverage parallelism across queries, i.e., by processing queries in parallel.The execution flow of each cell goes through the following steps: 1) Distance Calculation.
), which provides the first partial result P 1.This process implies several substeps depending on the selected distance metric, (e.g., subtraction → absolute value).

F. Programming Interface & System Integration
Programming Interface.We expose an API (Listing 1) that allows to invoke MATSA from the host processing unit.MATSA expects input data to be in a supported type/precision DTYPE (integer: int8, int16, int32 or int64; fixed-point: fp32 or fp64), the selected mode (either query filtering, where queries are compared against the reference or self join, where slices of the reference are compared against themselves) and the distance metric (abs diff or square diff ).MATSA can also take an anomaly threshold, which returns an array with the detected ones.
System Integration.MATSA is designed to work synergistically with the CPU to accelerate TSA.We propose three MATSA versions to meet the requirements of different environments, as we describe next.a) MATSA-HPC.A high-performance PCIe-based accelerator intended to be integrated into servers.b) MATSA-Embedded.A small chip intended to be integrated with edge devices (e.g., sensors).c) MATSA-Portable.A USB-based accelerator intended for use in desktops and laptop computers.

A. Methodology
To comprehensively quantify the performance and energy efficiency improvements of MATSA, we compare it with the following systems.
• UPMEM (upmem): Server-class Processing-Near-Memory DIMMs with 2560 DPUs running at 425MHz [10].• MATSA-Embedded (matsa-embedded): consisting of 128 compute-enabled crossbars (1MB) and 896 regular-memory crossbars (7MB).• MATSA-Portable (matsa-portable): consisting of 1024 compute-enabled crossbars (8MB) and 7168 regular-memory crossbars (56MB).• MATSA-HPC (matsa-hpc): consisting of 4096 compute-enabled crossbars (32MB) and 28672 regularmemory crossbars (224MB).Baselines.We use ZSim+Ramulator [86] and McPAT for the cpuarm platform.For the cpui7 and cpuxeon platforms, we have access to the target hardware and measure performance and energy consumption values by averaging five repeated executions.The energy consumption is determined using Intel RAPL tools.To evaluate the performance of the upmem platform, we implement and optimize the sDTW algorithm as shown in Algorithm 1.To evaluate the performance on the fgpa platform, we implement the sDTW algorithm using High-Level Synthesis vendor tools from Xilinx and optimize the implementation to utilize eight compute units and maximize the utilization of the available HBM bandwidth.We evaluate the performance of the gpu platform by optimizing a CUDA-based implementation of sDTW to maximize the HBM bandwidth utilization via memory coalescing.We measure the GPU's energy consumption using the NVIDIA-smi tool.
MATSA.Due to the lack of a cycle-accurate simulator for MRAM-based accelerators, we implement an in-house simulator for MRAM-based PUM. Figure 8 shows an overview this simulator.We provide the workload characteristics and the MRAM device characteristics under study, and the simulator computes the performance and energy efficiency in return.We plan to release this simulator for public use of the community after acceptance of this work.
We perform a sensitivity analysis by sweeping MRAM devices' latency and energy from conservative to optimistic values based on MRAM device trends [87] listed in Table III.Based on that, we conservatively select an operating point (highlighted in bold) for the evaluations taking into account realistic MRAM device progress projections [88].We input the workload parameters and MRAM characteristics obtained from the parameter sweep to the simulator to get the workload's execution time and energy consumption.Datasets.We perform MATSA's design exploration using the datasets in Table IV, which ease understanding of the tradeoffs.Then, we compare MATSA against baselines in real scenarios using the real datasets in Table V.The data type for these evaluations is int32, which covers the data ranges of all the evaluated the workloads.[91] 109842 800 32K Seismology [90] 1727990 64 16K Power [92] 1754985 1536 16K ECG [93] 1800000 512 16K TABLE V: Real-world workloads used in our evaluation.

B. MATSA Characterization
We perform a design space exploration of MATSA taking into consideration performance parameters of the cells (i.e., read/write latencies and energies).
Read/Write Latencies.We evaluate how changing the read/write latencies affects the execution time and present the results in Figure 9.We observe that, increasing read latency by 10× incurs a 4.7× execution time penalty, while increasing the write latency incurs a 6.5× penalty.Key Observation 3: using a low write latency memory technology is crucial for MATSA's design.
Read/Write Energies.We evaluate how the total execution energy varies with the per word write/read energy, and show the results in Figure 10.We observe here that the contributions of read energy and write energy are similar, thus both of them have to be carefully taken into consideration.Key Observation 4: read energy contributes 45% and write energy contributes 55% to the total energy consumption of a given execution.
Dataset Sizes.First, we evaluate how the execution time varies with different dataset sizes (i.e., ref size and query size) and present the results in Figure 11.Second, we evaluate how the execution energy varies with different dataset sizes and present the results in Figure 12.We observe that both reference size and query size contribute equally to the execution time and energy.This happens because the total number of operations needed is directly proportional to ref size×query size.Our observation corroborates our earlier analysis stating that queryspecific sDTW implementations do not fairly represent GPU performance, and there is a need for a more general solution.MATSA sizes.We evaluate how the execution time varies when changing the number of MATSA's compute-enabled columns in Figure 13.MATSA provides almost-ideal scaling.Endurance.Assuming that MATSA is built using 5/10ns rd/wr cells and runs 24/7 for ten years, we estimate that each cell will be written ≈ 4×10 9 times.Based on Table II, limitedendurance cells (e.g., ReRAM) would fail within one day.In contrast, high-endurance cells (10 15 writes for SOT-MRAM) can provide a very large usable lifetime.

65536-
Hardware Overheads.MATSA introduces hardware overheads in two components: 1) Reconfigurable SAs and 2) MATSA controllers.Reconfigurable SAs add 13 transistors to a traditional SA, thus taking into consideration typical SA and cell areas [94], [95], our design increases the overall crossbar area by less than 1%.MATSA controllers are implemented as small finite-state machines whose area is negligible compared to the memory arrays.

C. System Evaluation
MATSA-Embedded and MATSA-Portable.We compare the performance of MATSA-Embedded (32K compute-enabled columns) and MATSA-Portable (256K compute-enabled columns) with cpuarm, cpui7, and fpga baselines in Figure 14a.The smallest version, MATSA-Embedded, provides 30.20×/1.30×/8.14×lower execution times than cpuarm, cpui7, and fpga, respectively.MATSA-Portable is further able to improve the performance by 241.66×/10.40×/65.28×with respect to the same baselines, respectively.These performance improvements stem from the higher available parallelism in PUM, where all compute-enable columns can compute independently.Next, we compare the energy consumption of MATSA-Embedded and MATSA-Portable with the same baselines in Figure 14b.MATSA-Embedded reduces the energy consumption by 45.67×/10.64×/24.58×with respect to cpuarm, cpui7 and fpga baselines, respectively.We observe that 1) the energy reduction comes from eliminating the expensive off-chip data movement and 2) MATSA-Portable reduces the energy consumption by roughly the same factor as MATSA-Embedded.We deduce from these results that scaling MATSA improves the performance but does not penalize the energy efficiency.MATSA-HPC.We first perform a performance comparison of MATSA-HPC and present the results in Figure 15a.We observe that MATSA-HPC achieves 7.3×/6.15×/6.3×lower execution times than cpuxeon, gpu and upmem, respectively, owing to enormous available parallelism (one million compute columns).Second, we compare the energy consumption of MATSA-HPC in Figure 15b and observe that it provides 11.29×/4.21×/2.65×lower energy consumption than cpuxeon, gpu and upmem, respectively.The energy efficiency benefits of MATSA-HPC stem from the elimination of the off-chip data movements.We note that cpuxeon is bottlenecked by 1) the limited parallelism (number of cores) and 2) the high data movement costs through the memory hierarchy.The gpu baseline provides high parallelism but is limited by data movement from and to memory.The PNMbased upmem baseline provides high parallelism and lowers data access costs compared to CPU and GPUs.However, the sDTW kernel is compute-bound in upmem due to small general-purpose cores, in contrast to MATSA, a dedicated accelerator design for the sDTW kernel.

V. RELATED WORK
To our knowledge, MATSA is the first sDTW accelerator via MRAM-based PUM.We compare extensively to CPU, GPU, FPGA, and state-of-the-art PNM platforms in §IV.In this section, we describe related works focusing on accelerating sDTW and prior PUM-based accelerators.
Processing Near/Using Memory.There has been a significant interest in Processing-[Near/Using]-Memory-based solutions for overcoming the von Neumann bottleneck in modern computation platforms [6], [8], [12], [14]- [16], [76], [84], [99]- [229] for various applications using accelerators or general-purpose cores.In [174], ARM cores are used as NDP compute units to improve data analytics operators (e.g., group, join, sort).IMPICA [230] is an NDP pointer chasing accelerator.Tesseract [231] is a scalable NDP accelerator for parallel graph processing.TETRIS [173] is an NDP neural network accelerator.Lee et al. [232] propose an NDP accelerator for similarity search.GRIM-Filter [146] is an NDP accelerator for pre-alignment filtering [233]- [237] in genome analysis [238].Boroumand et al. [9] analyze the energy and performance impact of data movement for several widelyused Google consumer workloads, providing NDP accelerators for them.CoNDA [150] provides efficient cache coherence support for NDP accelerators.SparseP [239], [240] provides efficient data partitioning/maping techniques of the SpMV kernel tailored for near-bank NDP architectures.Finally, an NDP architecture [169] has been proposed for MapReducestyle applications.. Xu et al. [241] propose a memristor-based accelerator for accelerating the sDTW kernel.Despite promising performance, they do not discuss endurance challenges associated with memristors that restrict the lifetime of the accelerator.In contrast, MATSA considers this challenge and offers a usable lifetime of several decades.Chen and Gu [242] propose an sDTW accelerator that exploits DTW pipelining using a specially designed time flip-flop.Although this work uses memristors for computation, they do not leverage PUM.The data must be moved from/to memory (i.e., memristors do not store the data).In contrast, MATSA eliminates offchip data movement to obtain high performance and energy efficiency.

VI. CONCLUSIONS
This paper presents MATSA, the first MRAM-based Accelerator for Time Series Analysis.The key idea is to exploit magnetoresistive crossbars to enable energy-efficient and fast time series computation in memory.MATSA provides the following key benefits: 1) significantly higher parallelism exploiting column-level bitwise operations, and 2) reduction in data movement overheads by leveraging PUM.MATSA improves performance and energy consumption over CPU, GPU, FPGA, and PNM platforms.

Fig. 3 :
Fig.3: Roofline plots for sDTW on a many-core CPU platform (left) and a server-class GPU (right).

Fig. 5 :
Fig. 5: a) Crossbar organization.b) Magneto-resistive cell.c) Reconfigurable SA that performs in-memory operations based on the voltage variations across the bitline.

Figure 5 -
Figure 5-b shows the basic structure of a Spin-Orbit Torque (SOT)-MRAM cell composed of a stack of Magnetic Tunnel Junctions (MTJs) (cyan and red blocks in the figure) and a Heavy Metal Layer (grey block in the figure).

2 ) 4 )
Minimum.Calculation without storing the result of min(S[i−1, j−1], S[i−1, j], S[i, j−1]), which produces the value for the next step S1. 3) Addition.Calculation of the addition between the minimum value selected in the previous step (S1) and the partial result P 1. Diagonal Copy.Copying the S[i, j] vector into the S[i, j − 1] vector shifted by one to the right.5) Diagonal Copy.Copying the S[i − 1, j] vector into the S[i − 1, j − 1] vector shifted by one to the right.6) Vertical Copy.Copying the S[i, j] vector into the S[i − 1, j] vector.7) Diagonal Copy.Copying the Q[i] vector into the same Q[i] vector but shifted one position to the right.

Fig. 12 :
Fig. 12: Execution energy when varying dataset sizes (num queries=8K, matsa cols=128K).Key Observation 5: Total execution time and energy consumption are proportional to both ref size and the query size.

TABLE III :
MATSA design space exploration parameters.

TABLE IV :
Workloads used in MATSA characterization. .

TABLE VI :
Table VI summarizes MATSA's benefits.MATSA's Speedup and Energy over baselines.