DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Data movement between the CPU and main memory is a first-order obstacle against improv ing performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarch ies, aggressive hardware prefetcher s) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Prior NDP works investigate the root causes of data movement bottlenecks using different profiling methodologies and tools. However, there is still a lack of understanding about the key metrics that can identify different data movement bottlenecks and their relation to traditional and emerging data movement mitigation mechanisms. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques (e.g., cach ing and prefetch ing) to more memory-centric techniques (e.g., NDP), thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.


Introduction
Today's computing systems require moving data from main memory (consisting of DRAM) to the CPU cores so that computation can take place on the data.Unfortunately, this data movement is a major bottleneck for system performance and energy consumption [1,2].DRAM technology scaling is failing to keep up with the increasing memory demand from applications , resulting in significant latency and energy costs due to data movement [1-3, 5, 6, 30-49].High-performance systems have evolved to include mechanisms that aim to alleviate data movement's impact on system performance and energy consumption, such as deep cache hierarchies and aggressive prefetchers.However, such mechanisms not only come with significant hardware cost and complexity, but they also often fail to hide the latency and energy costs of accessing DRAM in many modern and emerging applications [1,5,[50][51][52].These applications' memory behavior can differ significantly from more traditional applications since modern applications often have lower memory locality, more irregular access patterns, and larger working sets [36,45,46,[53][54][55][56][57][58][59][60][61].One promising technique that aims to alleviate the data movement bottleneck in modern and emerging applications is Near-Data Processing (NDP) [1, 33, 34, 46-48, 54, 55, 59-118], 1 where the cost of data movement to/from main memory is reduced by placing computation capability close to memory.In NDP, the computational logic close to memory has access to data that resides in main memory with significantly higher memory bandwidth, lower latency, and lower energy consumption than the CPU has in existing systems.There is very high bandwidth available to the cores in the logic layer of 3D-stacked memories, as demonstrated by many past works (e.g., [1, 46, 59, 60, 62-64, 67-69, 74, 76, 99, 119]).
To illustrate this, we use the STREAM Copy [120] workload to measure the peak memory bandwidth the host CPU and an NDP architecture with processing elements in the logic layer of a single 3D-stacked memory (e.g., Hybrid Memory Cube [73]) can leverage. 2We observe that the peak memory bandwidth that the NDP logic can leverage (431 GB/s) is 3.7× the peak memory bandwidth that the host CPU can exploit (115 GB/s).This happens since the external memory bandwidth is bounded by the limited number of I/O pins available in the DRAM device [121].
Many recent works explore how NDP can benefit various application domains, such as graph processing [46,47,54,63,74,93,[122][123][124][125][126], machine learning [1,61,69,70,84,85,103], bioinformatics [59,60,68], databases [55,61,63,66,67,74,86,102], security [71,105,106], data manipulation [49,86,88,89,[127][128][129][130], and mobile workloads [1,61].These works demonstrate that simple metrics such as last-level CPU cache Misses per Kilo-Instruction (MPKI) and Arithmetic Intensity (AI) are useful metrics that serve as a proxy for the amount of data movement an application experiences.These metrics can be used as a potential guide for choosing when to apply data movement mitigation mechanisms such as NDP.However, such metrics (and the corresponding insights) are often extracted from a small set of applications, with similar or not-rigorously-analyzed data movement characteristics.Therefore, it is difficult to generalize the metrics and insights these works provide to a broader set of applications, making it unclear what different metrics can reveal about a new (i.e., previously uncharacterized) application's data movement behavior (and how to mitigate its associated data movement costs).
We illustrate this issue by highlighting the limitations of two different methodologies commonly used to identify memory bottlenecks and often used as a guide to justify the use of NDP architectures for an application: (a) analyzing a roofline model [131] of the application, and (b) using last-level CPU cache MPKI as an indicator of NDP suitability of the application.The roofline model correlates the computation requirements of an application with its memory requirements under a given system.The model contains two roofs: (1) a diagonal line (y = Peak Memory Bandwidth × Arithmetic Intensity) called the memory roof, and (2) a horizontal line (y = Peak System Throughput) called the compute roof [131].If an application lies under the memory roof, the application is classified as memory-bound; if an application lies under the compute roof, it is classified as compute-bound.Many prior works [99,103,[132][133][134][135][136][137][138][139][140][141][142][143][144] employ this roofline model to identify memory-bound applications that can benefit from NDP architectures.Likewise, many prior works [1,36,51,54,55,[145][146][147][148][149][150] observe that applications with high last-level cache MPKI 3 are good candidates for NDP.
Figure 1 shows the roofline model (left) and a plot of MPKI vs. speedup (right) of a system with general-purpose NDP support over a baseline system without NDP for a diverse set of 44 applications (see Table 8).In the MPKI vs. speedup plot, the MPKI corresponds to a baseline host CPU system.The speedup represents the performance improvement of a general-purpose NDP system over the baseline (see Section 2.4 for our methodology).We make the following observations.First, analyzing the roofline model (Figure 1, left), we observe that most of the memory-bound applications (yellow dots) benefit from NDP, as foreseen by prior works.We later observe (in Section 3.3.1)that such applications are DRAM bandwidth-bound and are a natural fit for NDP.However, the roofline model does not accurately account for the NDP suitability of memory-bound applications that (i) benefit from NDP only under particular microarchitectural configurations, e.g., either at low or high core counts (green dots, which are applications that are either bottlenecked by DRAM latency or suffer from L3 cache contention; see Sections 3.3.3and 3.3.4);or (ii) experience performance degradation when executed using NDP (blue dots, which are applications that suffer from the lack of a deep cache hierarchy in NDP architectures; see Section 3.3.6).Second, analyzing the MPKI vs. speedup plot (Figure 1, right), we observe that while all applications with high MPKI benefit from NDP (yellow dots with MPKI higher than 10), some applications with low MPKI can also benefit from NDP in all of the NDP microarchitecture configurations we evaluate (yellow dots with MPKI lower than 10) or under specific NDP microarchitecture configurations (green dots with MPKI lower than 10).Thus, even though both the roofline model and MPKI can identify some specific sources of memory bottlenecks and can sometimes be used as a proxy for NDP suitability, they alone cannot definitively determine NDP suitability because they cannot comprehensively identify different possible sources of memory bottlenecks in a system.
Our goal in this work is (1) to understand the major sources of inefficiency that lead to data movement bottlenecks by observing and identifying relevant metrics and (2) to develop a benchmark suite for data movement that captures each of these sources.To this end, we develop a new three-step methodology to correlate application characteristics with the primary sources of data movement bottlenecks and to determine the potential benefits of three example data movement mitigation mechanisms: (1) a deep cache hierarchy, (2) a hardware prefetcher, and (3) a general-purpose NDP architecture. 4We use two main profiling strategies to gather key metrics from applications: (i) an architecture-independent profiling tool and (ii) an architecture-dependent profiling tool.The architecture-independent profiling tool provides metrics that characterize the application memory behavior independently of the underlying hardware.In contrast, the architecture-dependent profiling tool evaluates the impact of the system configuration (e.g., cache hierarchy) on the memory behavior.Our methodology has three steps.In Step 1, we use a hardware profiling tool to identify memory-bound functions across many applications.This step allows for a quick first-level identification of many applications that suffer from memory bottlenecks and functions that cause these bottlenecks.In Step 2, we use the architecture-independent profiling tool to collect metrics that provide insights about the memory access behavior of the memory-bottlenecked functions.In Step 3, we collect architecture-dependent metrics and analyze the performance and energy of each function in an application when each of our three candidate data movement mitigation mechanisms is applied to the system.By combining the data obtained from all three steps, we can systematically classify the leading causes of data movement bottlenecks in an application or function into different bottleneck classes.
Using this new methodology, we characterize a large, heterogeneous set of applications (345 applications from 37 different workload suites) across a wide range of domains.Within these applications, we analyze 77K functions and find a subset of 144 functions from 74 different applications that are memory-bound (and that consume a significant fraction of the overall execution time).We fully characterize this set of 144 representative functions to serve as a core set of application kernel benchmarks, which we release as the open-source DAMOV (DAta MOVement) Benchmark Suite [158].Our analyses reveal six new insights about the sources of memory bottlenecks and their relation to NDP: (1) Applications with high last-level cache MPKI and low temporal locality are DRAM bandwidth-bound.These applications benefit from the large memory bandwidth available to the NDP system (Section 3.

Last−Level Cache MPKI NDP Speedup
Figure 1: Roofline (left) and last-level cache MPKI vs. NDP speedup (right) for 44 memory-bound applications.Applications are classified into four categories: (1) those that experience performance degradation due to NDP (blue; Faster on CPU), (2) those that experience performance improvement due to NDP (yellow; Faster on NDP), (3) those where the host CPU and NDP performance are similar (red; Similar on CPU/NDP), (4) those that experience either performance degradation or performance improvement due to NDP depending on the microarchitectural configuration (green; Depends).benefit from L2/L3 caches.The NDP system improves performance and energy efficiency by sending L1 misses directly to DRAM (Section 3.3.2).
(3) A second group of applications with low LLC MPKI and low temporal locality are bottlenecked by L1/L2 cache capacity.These applications benefit from the NDP system at low core counts.However, at high core counts (and thus larger L1/L2 cache space), the caches capture most of the data locality in these applications, decreasing the benefits the NDP system provides (Section 3.3.3).We make this observation using a new metric that we develop, called last-to-first miss-ratio (LFMR), which we define as the ratio between the number of LLC misses and the total number of L1 cache misses.We find that this metric accurately identifies how efficient the cache hierarchy is in reducing data movement.(4) Applications with high temporal locality and low LLC MPKI are bottlenecked by L3 cache contention at high core counts.In such cases, the NDP system provides a cost-effective way to alleviate cache contention over increasing the L3 cache capacity (Section 3.3.4).( 5) Applications with high temporal locality, low LLC MPKI, and low AI are bottlenecked by the L1 cache capacity.The three candidate data movement mitigation mechanisms achieve similar performance and energy consumption for these applications (Section 3.3.5).( 6) Applications with high temporal locality, low LLC MPKI, and high AI are compute-bound.These applications benefit from a deep cache hierarchy and hardware prefetchers, but the NDP system degrades their performance (Section 3.3.6).
We publicly release our 144 representative data movement bottlenecked functions from 74 applications as the first open-source benchmark suite for data movement, called DAMOV Benchmark Suite, along with the complete source code for our new characterization methodology [158].

This work makes the following key contributions:
• We propose the first methodology to characterize data-intensive workloads based on the source of their data movement bottlenecks.This methodology is driven by insights obtained from a large-scale experimental characterization of 345 applications from 37 different benchmark suites and an evaluation of the performance of memory-bound functions from these applications with three data-movement mitigation mechanisms.In particular, we evaluate (i) the impact of load balance and inter-vault communication in NDP systems, (ii) the impact of NDP accelerators on our memory bottleneck analysis, (iii) the impact of different core models on NDP architectures, and (iv) the potential benefits of identifying simple NDP instructions.We conclude that our benchmark suite and methodology can be employed to address many different open research and development questions on data movement mitigation mechanisms, particularly topics related to NDP systems and architectures.
a scalability analysis to nail down the sources of memory boundedness, including architecture-dependent characterization.Our methodology takes as input an application's source code and its input datasets, and produces as output a classification of the primary source of memory bottleneck of important functions in an application (i.e., bottleneck class of each key application function).
We illustrate the applicability of this methodology with a detailed characterization of 144 functions that we select from among 77K analyzed functions of 345 characterized applications.In this section, we give an overview of our workload characterization methodology.We use this methodology to drive the analyses we perform in Section 3.

Experimental Evaluation Framework
As our scalability analysis depends on the hardware architecture, we need a hardware platform that can allow us to replicate and control all of our configuration parameters.Unfortunately, such an analysis cannot be performed practically using real hardware, as (1) there are very few available NDP hardware platforms, and the ones that currently exist do not allow us to comprehensively analyze our general-purpose NDP configuration in a controllable way (as existing platforms are specialized and non-configurable); and (2) the configurations of real CPUs can vary significantly across the range of core counts that we want to analyze, eliminating the possibility of a carefully controlled study.As a result, we must rely on accurate simulation platforms to perform an accurate comparison across different configurations.To this end, we build a framework that integrates the ZSim CPU simulator [159] with the Ramulator memory simulator [160] to produce a fast, scalable, and cycle-accurate open-source simulator called DAMOV-SIM [158].We use ZSim to simulate the core microarchitecture, cache hierarchy, coherence protocol, and prefetchers.We use Ramulator to simulate the DRAM architecture, memory controllers, and memory accesses.To compute spatial and temporal locality, we modify ZSim to generate a single-thread memory trace for each application, which we use as input for the locality analysis algorithm described in Section 2.3 (which statically computes the temporal and spatial locality at word-level granularity).

Step 1: Memory-Bound Function Identification
The first step (labeled ❶ in Figure 2) aims to identify the functions of an application that are memory-bound (i.e., functions that suffer from data movement bottlenecks).These bottlenecks might be caused at any level of the memory hierarchy.There are various potential sources of memory boundedness, such as cache misses, cache coherence traffic, and long queuing latencies.Therefore, we need to take all such potential causes into account.This step is optional if the application's memory-bound functions (i.e., regions of interest, roi, in Figure 2) are already known a priori.
Hardware profiling tools, both open-source and proprietary, are available to obtain hardware counters and metrics that characterize the application behavior on a computing system.In this work, we use the Intel VTune Profiler [161], which implements the wellknown top-down analysis [162].Top-down analysis uses the available CPU hardware counters to hierarchically identify different sources of CPU system bottlenecks for an application.Among the various metrics measured by top-down analysis, there is a relevant one called Memory Bound [163] that measures the percentage of CPU pipeline slots that are not utilized due to any issue related to data access.We employ this metric to identify functions that suffer from data movement bottlenecks (which we define as functions where Memory Bound is greater than 30%).

Step 2: Locality-Based Clustering
Two key properties of an application's memory access pattern are its inherent spatial locality (i.e., the likelihood of accessing nearby memory locations in the near future) and temporal locality (i.e., the likelihood of accessing a memory location again in the near future).These properties are closely related to how well the application can exploit the memory hierarchy in computing systems and how accurate hardware prefetchers can be.Therefore, to understand the sources of memory bottlenecks for an application, we should analyze how much spatial and temporal locality its memory accesses inherently exhibit.However, we should isolate these properties from particular configurations of the memory subsystem.Otherwise, it would be unclear if memory bottlenecks are due to the nature of the memory accesses or due to the characteristics and limitations of the memory subsystem (e.g., limited cache size, too simple or inaccurate prefetching policies).As a result, in this step (labeled ❷ in Figure 2), we use architecture-independent static analysis to obtain spatial and temporal locality metrics for the functions selected in the previous step (Section 2.2).Past works [164][165][166][167][168][169][170][171][172][173] propose different ways of analyzing spatial and temporal locality in an architecture-independent manner.In this work, we use the definition of spatial and temporal metrics presented in [166,167].
The spatial locality metric is calculated for a window of memory references5 of length W using Equation 1. First, for every W memory references, we calculate the minimum distance between any two addresses (stride).Second, we create a histogram called the stride profile, where each bin i stores how many times each stride appears.Third, to calculate the spatial locality, we divide the percentage of times stride  is referenced (    ()) by the stride length  and sum the resulting value across all instances of .

𝑆𝑝𝑎𝑡𝑖𝑎𝑙 𝐿𝑜𝑐𝑎𝑙𝑖𝑡𝑦
A spatial locality value close to 0 is caused by large stride values (e.g., regular accesses with large strides) or random accesses, while a value equal to 1 is caused by a completely sequential access pattern.The temporal locality metric is calculated by using a histogram of reused addresses.First, we count the number of times each memory address is repeated in a window of L memory references.Second, we create a histogram called reuse profile, where each bin  represents the number of times a memory address is reused, expressed as a power of 2. For each memory address, we increment the bin that represents the corresponding number of repetitions.For example, reuse profile(0) represents memory addresses that are reused only once.reuse profile(1) represents memory addresses that are reused twice.Thus, if a memory address is reused  times, we increment reuse profile(⌊ 2  ⌋) by one.Third, we obtain the temporal locality metric with Equation 2.
A temporal locality value of 0 indicates no data reuse, while a value close to 1 indicates very high data reuse (i.e., a value equal to 1 means that the application accesses a single memory address continuously).
To calculate these metrics, we empirically select window lengths W and L to 32.We find that different values chosen for W and L do not significantly change the conclusions of our analysis.We observe that our conclusions remain the same when we set both values to 8, 16, 32, 64, and 128.

Step 3: Bottleneck Classification
While Step 2 allows us to understand inherent application sources for memory boundedness, it is important to understand how hardware architectural features can also result in memory bottlenecks.As a result, in our third step (❸ in Figure 2), we perform a scalability analysis of the functions selected in Step 1, where we evaluate performance and energy scaling for three different system configurations.The scalability analysis makes use of three architecture-dependent metrics: (1) Arithmetic Intensity (AI), (2) Misses per Kilo-Instruction (MPKI), and (3) a new metric called Last-to-First Miss-Ratio (LFMR).We select these metrics for the following reasons.First, AI can measure the compute intensity of an application.Intuitively, we expect an application with high compute intensity to not suffer from severe data movement bottlenecks, as demonstrated by prior work [174].Second, MPKI serves as a proxy for the memory intensity of an application.It can also indicate the memory pressure experienced by the main memory system [45,47,48,58,151,153,156,[175][176][177].Third, LFMR, a new metric we introduce and is described in detail later in this subsection, indicates how efficient the cache hierarchy is in reducing data movement.
As part of our methodology development, we evaluate other metrics related to data movement, including raw cache misses, coherence traffic, and DRAM row misses/hits/conflicts.We observe that even though such metrics are useful for further characterizing an application (as we do in some of our later analyses in Section 3.3), they do not necessarily characterize a specific type of data movement bottleneck.We show in Section 4.1 that the three architecture-dependent and two architecture-independent metrics we select for our classification are enough to accurately characterize and cluster the different types of data movement bottlenecks in a wide variety of applications.
2.4.1 Definition of Metrics.We define Arithmetic Intensity (AI) as the number of arithmetic and logic operations performed per L1 cache line accessed. 6This metric indicates how much computation there is per memory request.Intuitively, applications with high AI are likely to be computationally intensive, while applications with low AI tend to be memory intensive.We use MPKI at the lastlevel cache (LLC), i.e., the number of LLC misses per one thousand instructions.This metric is considered to be a good indicator of NDP suitability by several prior works [1,36,51,54,55,[145][146][147][148][149].We define the LFMR of an application as the ratio between the number of LLC misses and the total number of L1 cache misses.We find that this metric accurately identifies how much an application benefits from the deep cache hierarchy of a contemporary CPU.An LFMR value close to 0 means that the number of LLC misses is very small compared to the number of L1 misses, i.e., the L1 misses are likely to hit in the L2 or L3 caches.However, an LFMR value close to 1 means that very few L1 misses hit in L2 or L3 caches, i.e., the application does not benefit much from the deep cache hierarchy, and most L1 misses need to be serviced by main memory.

Scalability
Analysis and System Configuration.The goal of the scalability analysis we perform is to nail down the specific sources of data movement bottlenecks in the application.In this analysis, we (i) evaluate the performance and energy scaling of an application in three different system configurations; and (ii) collect the key metrics for our bottleneck classification (i.e., AI, MPKI, and LFMR).During scalability analysis, we simulate three system configurations of a general-purpose multicore processor: • A host CPU with a deep cache hierarchy (i.e., private L1 (32 kB) and L2 (256 kB) caches, and a shared L3 (8 MB) cache with 16 banks).We call this configuration Host CPU.• A host CPU with a deep cache hierarchy (same cache configurations as in Host CPU ), augmented with a stream prefetcher [178].
We call this configuration Host CPU with prefetcher.
• An NDP CPU with a single level of cache (only a private readonly 7 L1 cache (32 kB), as assumed in many prior NDP works [1,46,51,63,66,74,99,101,119,179]) and no hardware prefetcher.We call this configuration NDP.The remaining components of the processor configuration are kept the same (e.g., number of cores, instruction window size, branch predictor) to isolate the impact of only the caches, prefetchers, and NDP.This way, we expect that the performance and energy differences between the three configurations to come exclusively from the different data movement requirements.For the three configurations, we sweep the number of CPU cores in our analysis from 1 to 256, as previous works [46,66,180] show that large core counts are necessary to saturate the bandwidth provided by modern high-bandwidth memories, and because modern CPUs and NDP proposals can have varying core counts.The core count sweep allows us to observe (1) how an application's performance changes when increasing the pressure on the memory subsystem, (2) how much Memory-Level Parallelism (MLP) [176,[181][182][183][184] the application has, and (3) how much the cores leverage the cache hierarchy and the available memory bandwidth.We proportionally increase the size of the CPU's private L1 and L2 caches when increasing the number of CPU cores in our analysis (e.g., when scaling the CPU core count from 1 to 4, we also scale the aggregated L1/L2 cache size by a factor of 4).We use out-of-order and in-order CPU cores in our analysis for all three configurations.In this way, we build confidence that our trends and findings are independent of a specific underlying general-purpose core microarchitecture.We simulate a memory architecture similar to the Hybrid Memory Cube (HMC) [73], where (1) the host CPU accesses memory through a high-speed off-chip link, and (2) the NDP logic resides in the logic layer of the memory chip and has direct access to the DRAM banks (thus taking advantage of higher memory bandwidth and lower memory latency).Table 1 lists the parameters of our host CPU, host CPU with prefetcher, and NDP baseline configurations.
2.4.3Choosing an NDP Architecture.We note that across the proposed NDP architectures in literature, there is a lack of consensus on whether the architectures should make use of generalpurpose NDP cores or specialized NDP accelerators [36,37].In this work, we focus on general-purpose NDP cores for two major reasons.First, many prior works (e.g., [1,46,51,63,66,76,99,101,119,147,179,190,[192][193][194]) suggest that general-purpose cores (especially simple in-order cores) can successfully accelerate memorybound applications in NDP architectures.In fact, UPMEM [83], a start-up building some of the first commercial in-DRAM NDP systems, utilizes simple in-order cores in their NDP units inside DRAM chips [83,140].Therefore, we believe that general-purpose NDP cores are a promising candidate for future NDP architectures.Second, the goal of our work is not to perform a design space exploration of different NDP architectures, but rather to understand the key properties of applications that lead to memory bottlenecks that can be mitigated by a simple NDP engine.While we expect that each application could potentially benefit further from an NDP accelerator tailored to its computational and memory requirements, such

Characterizing Memory Bottlenecks
In this section, we apply our three-step workload characterization methodology to characterize the sources of memory bottlenecks across a wide range of applications.First, we apply Step 1 to identify memory-bound functions within an application (Section 3.1).Second, we apply Step 2 and cluster the identified functions using two architecture-independent metrics (spatial and temporal locality) (Section 3.2).Third, we apply Step 3 and combine the architecture-dependent and architecture-independent metrics to classify the different sources of memory bottlenecks we observe (Section 3.3).
We also evaluate various other aspects of our three-step workload characterization methodology.We investigate the effect of increasing the last-level cache on our memory bottleneck classification in Section 3.4.We provide a validation of our memory bottleneck classification in Section 3.5.We discuss the limitations of our proposed methodology in Section 3.6.

Step 1: Memory-Bound Function Identification
We first apply Step 1 of our methodology across 345 applications (listed in Appendix C) to identify functions whose performance is significantly affected by data movement.We use the previouslyproposed top-down analysis methodology [162] that has been used by several recent workload characterization studies [5,195,196].As discussed in Section 2.2, we use the Intel VTune Profiler [161], which we run on an Intel Xeon E3-1240 processor [197] with four cores.We disable hyper-threading for more accurate profiling results, as recommended by the VTune documentation [198].For the applications that we analyze, we select functions (1) that take at least 3% of the clock cycles, and (2) that have a Memory Bound percentage that is greater than 30%.We choose 30% as the threshold for this metric because, in preliminary simulation experiments, we do not observe significant performance improvement or energy savings with data movement mitigation mechanisms for functions whose Memory Bound percentage is less than 30%.
The applications we analyze come from a variety of sources, such as popular workload suites (Chai [199], CORAL [200], Parboil [201], PARSEC [202], Rodinia [203], SD-VBS [204], SPLASH-2 [205]), benchmarking (STREAM [120], HPCC [206], HPCG [207]), bioinformatics [208], databases [209,210], graph processing frameworks (GraphMat [211], Ligra [212]), a map-reduce framework (Phoenix [213]), and neural networks (AlexNet [214], Darknet [215]).We explore different input dataset sizes for the applications and choose real-world input datasets that impose high pressure on the memory subsystem (as we expect that such realworld inputs are best suited for stressing the memory hierarchy).We also use different inputs for applications whose performance Energy: 2 pJ/bit internal, 8 pJ/bit logic layer [51,64,190], 2 pJ/bit links [51,76,191] is tightly related to the input dataset properties.For example, we use two different graphs with varying connectivity degrees (rMat [216] and USA [217]) to evaluate graph processing applications and two different read sequences to evaluate read alignment algorithms [60,218,219].In total, our application analysis covers more than 77K functions.To date, this is the most extensive analysis of data movement bottlenecks in real-world applications.We find a set of 144 functions that take at least 3% of the total clock cycles and have a value of the Memory Bound metric greater or equal to 30%, which forms the basis of DAMOV, our new data movement benchmark suite.We provide a list of all 144 functions selected based on our analysis and their major characteristics in Appendix A.
After identifying memory-bound functions over a wide range of applications, we apply Steps 2 and 3 of our methodology to classify the primary sources of memory bottlenecks for our selected functions.We evaluate a total of 144 functions out of the 77K functions we analyze in Step 1.These functions span across 74 different applications, belonging to 16 different widely-used benchmark suites or frameworks.
From the 144 functions that we analyze further, we select a subset of 44 representative functions to explore in-depth in Sections 3.2 and 3.3 and to drive our bottleneck classification analysis.We use the 44 representative functions to ease our explanations and make figures more easily readable.Table 8 in Appendix A lists the 44 representative functions that we select.The table includes one column that indicates the class of data movement bottleneck experienced by each function (we discuss the classes in Section 3.3), and another column representing the percentage of clock cycles of the selected function in the whole application.We select representative functions that belong to a variety of domains: benchmarking, bioinformatics, data analytics, databases, data mining, data reorganization, graph processing, neural networks, physics, and signal processing.In Section 3.5, we validate our classification using the remaining 100 functions and provide a summary of the results of our methodology when applied to all 144 functions.

Step 2: Locality-Based Clustering
We cluster the 44 representative functions across both spatial and temporal locality using the K-means clustering algorithm [221].Figure 3 shows how each function is grouped.We find that two groups emerge from the clustering: (1) low temporal locality functions (orange boxes in Figure 3), and (2) high temporal locality functions (blue boxes in Figure 3).Intuitively, the closer a function is to the bottom-left corner of the figure, the less likely it is to take advantage of a multi-level cache hierarchy.These functions are more likely to be good candidates for NDP.However, as we see in Section 3.3, the NDP suitability of a function also depends on a number of other factors.

Step 3: Bottleneck Classification
Within the two groups of functions identified in Section 3.2, we use three key metrics (AI, MPKI, and LFMR) to classify the memory bottlenecks.We observe that the AI of the analyzed low temporal locality functions is low (i.e., always less than 2.2 ops/cache line, with an average of 1.3 ops/cache line).Among the high temporal locality functions, there are some with low AI (minimum of 0.3 ops/cache line) and others with high AI (maximum of 44 ops/cache line).LFMR indicates whether a function benefits from a deeper cache hierarchy.When LFMR is low (i.e., less than 0.1), then a function benefits significantly from a deeper cache hierarchy, as most misses from the L1 cache hit in either the L2 or L3 caches.When LFMR is high (i.e., greater than 0.7), then most L1 misses are not serviced by the the L2 or L3 caches, and must go to memory.A medium LFMR (0.1-0.7) indicates that a deeper cache hierarchy can mitigate some, but not a very large fraction of L1 cache misses.MPKI indicates the memory intensity of a function (i.e., the rate at which requests are issued to DRAM).We say that a function is memoryintensive (i.e., it has a high MPKI) when the MPKI is greater than 10, which is the same threshold used by prior works [151][152][153][154][155][156][157].
We find that six classes of functions emerge, based on their temporal locality, AI, MPKI, and LFMR values, as we observe from Figures 3 and 4. We observe that spatial locality is not a key metric for our classification (i.e., it does not define a bottleneck class) because the L1 cache, which is present in both host CPU and NDP system configurations, can capture most of the spatial locality for a function.Figure 4 shows the LFMR and MPKI values for each class.Note that we do not have classes of functions for all possible combinations of metrics.In our analysis, we obtain the temporal locality, AI, MPKI, and LFMR values and their combinations empirically.Fundamentally, not all value combinations of different metrics are possible.We list some of the combinations we do not observe in our analysis of 144 functions: • A function with high LLC MPKI does not display low LFMR.This is because a low LFMR happens when most L1 misses hit the L2/L3 caches.Thus, it becomes highly unlikely for the L3 cache to suffer many misses when the L2/L3 caches do a good job in fulfilling L1 cache misses.• A function with high temporal locality does not display both high LFMR and high MPKI.This is because a function with high temporal locality will likely issue repeated memory requests to few memory addresses, which will likely be serviced by the cache hierarchy.• A function with low temporal locality does not display low LFMR since there is little data locality to be captured by the cache hierarchy.
We discuss each class in detail below, identifying the memory bottlenecks for each class and whether the NDP system can alleviate these bottlenecks.To simplify our explanations, we focus on a  smaller set of 12 representative functions (out of the 44 representative functions) for this part of the analysis.Figure 5 shows how each of the 12 functions scales in terms of performance for the host CPU, host CPU with prefetcher, and NDP system configurations.

Class 1a:
Low Temporal Locality, Low AI, High LFMR, and High MPKI (DRAM Bandwidth-Bound Functions)Functions in this class exert high main memory pressure since they are highly memory intensive and have low data reuse.To understand how this affects a function's suitability for NDP, we study how performance scales as we increase the number of cores available to a function, for the host CPU, host CPU with prefetcher, and NDP system configurations.Figure 5(a) depicts performance 11 as we increase the core count, normalized to the performance of one host CPU core, for two representative functions from Class 1a (HSJNPO and LIGPrkEmd; we see similar trends for all functions in the class).We make three observations from the figure.First, as the number of host CPU cores increases, performance eventually stops increasing significantly.For HSJNPO, host CPU performance increases by 27.5× going from 1 to 64 host CPU cores but only 27% going from 64 host CPU cores to 256 host CPU cores.For LIGPrkEmd, host CPU performance increases by 33× going from 1 to 64 host CPU cores but decreases by 20% going from 64 to 256 host CPU cores.We find that the lack of performance improvement at large host CPU core counts is due to main memory bandwidth saturation, as shown in Figure 6.Given the limited DRAM bandwidth available across the off-chip memory channel, we find that Class 1a functions saturate the DRAM bandwidth once enough host CPU cores (e.g., 64) are used, and thus these functions are bottlenecked by the DRAM bandwidth.Second, the host CPU system with prefetcher slows down the execution of the HSJNPO (LIGPrkEmd) function compared with the host CPU system without prefetcher by 43% (38%), on average across all core counts.The prefetcher is ineffective since these functions have low temporal and spatial locality.Third, when running on the NDP system, the functions see continued performance improvements as the number of NDP cores increases.By providing the functions with access to the much higher bandwidth available inside memory, the NDP system can greatly outperform the host CPU system at a high enough core count.For example, at 64/256 cores, the NDP system outperforms the host CPU system by 1.7×/4.8×for HSJNPO, and by 1.5×/4.1×for LIGPrkEmd.
Figure 7 depicts the energy breakdown for our two representative functions.We make two observations from the figure.First, for HSJNPO, the energy spent on DRAM for both host CPU system and NDP system are similar.This is due to the function's poor locality, as 98% of its memory requests miss in the L1 cache.Since LFMR is near 1, L1 miss requests almost always miss in the L2 and L3 caches and go to DRAM in the host CPU system for all core counts we evaluate, which requires significant energy to query the large caches and then to perform off-chip data transfers.The NDP system does not access L2, L3, and off-chip links, leading to large system energy reduction.Second, for LIGPrkEmd, the DRAM energy is higher in the NDP system than in the host CPU system.Since the function's LFMR is 0.7, some memory requests that would be cache hits in the host CPU's L2 and L3 caches are instead sent directly to DRAM in the NDP system.However, the total energy consumption on the host CPU system is still larger than that on the NDP system, again because the NDP system eliminates the L2, L3 and off-chip link energy.
DRAM bandwidth-bound applications such as those in Class 1a have been the primary focus of a large number of proposed NDP architectures (e.g., [1,46,54,69,76,132,133,192,222,223]), as they benefit from increased main memory bandwidth and do not have high AI (and, thus, do not benefit from complex cores on the host CPU system).An NDP architecture for a function in Class 1a needs to extract enough MLP [57,176,[181][182][183][184][224][225][226][227][228][229] to maximize the usage of the available internal memory bandwidth.However, prior work has shown that this can be challenging due to the area and power constraints in the logic layer of a 3D-stacked DRAM [1,46].To exploit the high memory bandwidth while satisfying these q q q q q q q q q q HSJNPO LIGPrkEmd 0 1 0 0 2 0 0  area and power constraints, the NDP architecture should leverage application memory access patterns to efficiently maximize main memory bandwidth utilization.We find that there are two dominant types of memory access patterns among our Class 1a functions.First, functions with regular access patterns (DRKYolo, STRAdd, STRCpy, STRSca, STRTriad) can take advantage of specialized accelerators or Single Instruction Multiple Data (SIMD) architectures [1,66], which can exploit the regular access patterns to issue many memory requests concurrently.Such accelerators or SIMD architectures have hardware area and thermal dissipation that fall well within the constraints of 3D-stacked DRAM [1,46,64,230].Second, functions with irregular access patterns (HSJNPO, LIGCompEms, LIGPrkEmd, LIGRadiEms) require techniques to extract MLP while still fitting within the design constraints.This requires techniques that cater to the irregular memory access patterns, such as prefetching algorithms designed for graph processing [46,[231][232][233][234][235], pre-execution of difficult access patterns [57,58,151,183,184,[236][237][238][239][240][241][242][243] or hardware accelerators for pointer chasing [55,56,149,193,[244][245][246].

Class 1b:
Low Temporal Locality, Low AI, High LFMR, and Low MPKI (DRAM Latency-Bound Functions)While functions in this class do not effectively use the host CPU caches, they do not exert high pressure on the main memory due to their low MPKI.Across all Class 1b functions, the average DRAM bandwidth consumption is only 0.5 GB/s.However, all the functions have very high LFMR values (the minimum is 0.94 for CHAHsti), indicating that the host CPU L2 and L3 caches are ineffective.Because the functions cannot exploit significant MLP but still incur long-latency requests to DRAM, the DRAM requests fall on the critical path of execution and stall forward progress [57,58,151,176,247].Thus, Class 1b functions are bottlenecked by DRAM latency.Figure 5(b) shows performance of both the host CPU system and the NDP system for two representative functions from Class 1b (CHAHsti and PLYalu).We observe that while performance of both the host CPU system and the NDP system scale well as the core count increases, NDP system performance is always higher than the host CPU system performance for the same core count.The maximum (average) speedup with NDP over host CPU at the same core count is 1.15× (1.12×) for CHAHsti and 1.23× (1.13×) for PLYalu.
We find that the NDP system's improved performance is due to a reduction in the Average Memory Access Time (AMAT) [248].Figure 8 shows the AMAT for our two representative functions.Memory accesses take significantly longer in the host CPU system than in the NDP system due to the additional latency of looking up requests in the L2 and L3 caches, even though data is rarely present in those caches, and going through the off-chip links.Figure 9 shows the energy breakdown for Class 1b representative functions.Similar to Class 1a, we observe that the L2/L3 caches and off-chip links are a large source of energy usage in the host CPU system.While DRAM energy increases in the NDP system, as L2/L3 hits in the host CPU system become DRAM lookups with NDP, the overall energy consumption in the NDP system is greatly smaller (by 69% maximum and 39% on average) due to the lack of L2 and L3 caches.Class 1b functions benefit from the NDP system, but primarily because of the lower memory access latency (and energy) that the NDP system provides for memory requests that need to be serviced by DRAM.These functions could benefit from other latency and energy reduction techniques, such as L2/L3 cache bypassing [51,[249][250][251][252][253][254][255][256][257][258][259][260], low-latency DRAM [15, 22-26, 89, 127, 261-276], and better memory access scheduling [153-157, 175-177, 247, 277-290].However, they generally do not benefit significantly from prefetching (as seen in Figure 5(b)), since infrequent memory requests make it difficult for the prefetcher to successfully train on an access pattern.

Class 1c: Low Temporal Locality, Low AI, Decreasing
LFMR with Core Count, and Low MPKI (L1/L2 Cache Capacity Bottlenecked Functions) We find that the behavior of functions in this class depends on the number of cores they are using.Figure 5(c) shows the host CPU system and the NDP system performance as we increase the core count for two representative functions (DRKRes and PRSFlu).We make two observations from the figure.First, at low core counts, the NDP system outperforms the host CPU system.With a low number of cores, the functions have medium to high LFMR (0.5 for DRKRes at 1 and 4 host CPU cores; 0.97 at 1 host CPU core and 0.91 at 4 host CPU cores for PRSFlu), and behave like Class 1b functions, where they are DRAM latency-sensitive.Second, as the core count increases, the host CPU system begins to outperform the NDP system.For example, beyond 16 (64) cores, the host CPU system outperforms the NDP system for DRKRes (PRSFlu).This is because as the core count increases, the aggregate L1 and L2 cache size available at the host CPU system grows, which reduces the miss rates of both L2 and L3 caches.As a result, the LFMR decreases significantly (e.g., at 256 cores, LFMR is 0.09 for DRKRes and 0.35 for PRSFlu).This indicates that the available L1/L2 cache capacity bottlenecks Class 1c functions.
Figure 10 shows the energy breakdown for Class 1c functions.We make three observations from the figure.First, for functions with larger LFMR values (PRSFlu), the NDP system provides energy savings over the host CPU system at lower core counts, since the NDP system eliminates the energy consumed due to L3 and offchip link accesses.Second, for functions with smaller LFMR values (DRKRes), the NDP system does not provide energy savings even for low core counts.Due to the medium LFMR, enough requests still hit in the host CPU system L2/L3 caches, and these cache hits become DRAM accesses in the NDP system, which consume more energy than the cache hits.Third, at high-enough core counts, the NDP system consumes more energy than the host CPU system for all Class 1c functions.As the LFMR decreases, the functions effectively utilize the caches in the host CPU system, reducing the off-chip traffic and, consequently, the energy Class 1c functions spend on accessing DRAM.The NDP system, which does not have L2 and L3 caches, pays the larger energy cost of a DRAM access for all L2/L3 hits in the host CPU system.We find that the primary source of the memory bottleneck in Class 1c functions is limited L1/L2 cache capacity.Therefore, while the NDP system improves performance and energy of some Class 1c functions at low core counts (with lower associated L1/L2 cache capacity), the NDP system does not provide performance and energy benefits across all core counts for Class 1c functions.

Class 2a:
High Temporal Locality, Low AI, Increasing LFMR with Core Count, and Low MPKI (L3 Cache Contention Bottlenecked Functions) Like Class 1c functions, the behavior of the functions in this class depends on the number of cores that they use.Figure 5(d) shows the host CPU system and the NDP system performance as we increase the core count for two representative functions (PLYGramSch and SPLFftRev).We make two observations from the figure.First, at low core counts, the functions do not benefit from the NDP system.In fact, for a single core (16 cores), PLYGramSch slows down by 67% (3×) when running on the NDP system, compared to running on the host CPU system.This is because, at low core counts, these functions make reasonably good use of the cache hierarchy, with LFMR values of 0.03 for PLYGramSch and lower than 0.44 for SPLFftRev until 16 host CPU cores.We confirm this in Figure 11, where we see that very few memory requests for PLYGramSch and SPLFftRev go to DRAM (5% for PLYGramSch, and at most 13% for SPLFftRev) at core counts lower than 16.Second, at high core counts (i.e., 64 for PLYGramSch and 256 for SPLFftRev), the host CPU system performance starts to decrease.This is because Class 2a functions are bottlenecked by cache contention.At 256 cores, this contention undermines the cache effectiveness and causes the LFMR to increase to 0.97 for PLYGramSch and 0.93 for SPLFftRev.With the last-level cache rendered essentially ineffective, the NDP system greatly improves performance over the host CPU system: by 2.23× for PLYGramSch and 3.85× for SPLFftRev at 256 cores.One impact of the increased cache contention is that it converts these high-temporal-locality functions into memory latency-bound functions.We find that with the increased number of requests going to DRAM due to cache contention, the AMAT increases significantly, in large part due to queuing at the memory controller.At 256 cores, the queuing becomes so severe that a large fraction of requests (24% for PLYGramSch and 67% for SPLFftRev) must be reissued because the memory controller queues are full.The increased main memory bandwidth available to the NDP cores allows the NDP system to issue many more requests concurrently, which reduces the average length of the queue and, thus, the main memory latency.The NDP system also reduces memory access latency by getting rid of L2/L3 cache lookup and interconnect latencies.
Figure 12 shows the energy breakdown for the two representative Class 2a functions.We make two observations.First, the host CPU system is more energy-efficient than the NDP system at low core counts, as most of the memory requests are served by on-chip caches in the host CPU system.Second, the NDP system provides large energy savings over the host CPU system at high core counts.This is due to the increased cache contention, which increases the number of off-chip requests that the host CPU system must make, increasing the L3 and off-chip link energy.We conclude that cache contention is the primary scalability bottleneck for Class 2a functions, and the NDP system can provide an effective way of mitigating this cache contention bottleneck without incurring the high area and energy overheads of providing additional cache capacity in the host CPU system, thereby improving the scalability of these applications to high core counts.
3.3.5Class 2b: High Temporal Locality, Low AI, Low/Medium LFMR, and Low MPKI (L1 Cache Capacity Bottlenecked Functions) Figure 5(e) shows the host CPU system and the NDP system performance for PLYgemver and SPLLucb.We make two observations from the figure.First, as the number of cores increases, performance of the host CPU system and the NDP system scale in a very similar fashion.The NDP system and the host CPU system perform essentially on par with (i.e., within 1% of) each other at all core counts.Second, even though the NDP system does not provide any performance improvement for Class 2b functions, it also does not hurt performance.Figure 13 shows the AMAT for our two representative functions.When PLYgemver executes on the host CPU system, up to 77% of the memory latency comes from accessing L3 and DRAM, which can be explained by the function's medium LFMR (0.5).For SPLLucb, even though up to 73% of memory latency comes from L1 accesses, some requests still hit in the L3 cache (its LFMR is 0.2), translating to around 10% of the memory latency.However, the latency that comes from L3 + DRAM for the host CPU system is similar to the latency to access DRAM in the NDP system, resulting in similar performance between the host CPU system and the NDP system.We make a similar observation for the energy consumption for the host CPU system and the NDP system (Figure 14).Even though a small number of memory requests hit in L3, the total energy consumption for both the host CPU system and the NDP system is similar due to L3 and off-chip link energy.For some functions in Class 2b, we observe that the NDP system slightly reduces energy consumption compared to the host CPU system.For example, the NDP system provides an 12% average reduction in energy consumption, across all core counts, compared to the host CPU system for PLYgemver.
We conclude that while the NDP system does not solve any memory bottlenecks for Class 2b functions, it can be used to reduce the overall SRAM area in the system without any performance or energy penalty (and sometimes with energy savings).

Class 2c:
High Temporal Locality, High AI, Low LFMR, and Low MPKI (Compute-Bound Functions).Aside from one exception (PLYSymm), all of the 11 functions in this class exhibit high temporal locality.When combined with the high AI and low memory intensity, we find that these characteristics significantly impact how the NDP system performance scales for this class.Figure 5(f) shows the host CPU system and the NDP system performance for HPGSpm and RODNw, two representative functions from the class.We make two observations from the figure.First, the host CPU system performance is always greater than the NDP system performance (by 44% for HPGSpm and 54% for RODNw, on average).The high AI (more than 12 ops per cache line), combined with the high temporal locality and low MPKI, enables these functions to make excellent use of the host CPU system resources.Second, both of the functions benefit greatly from prefetching in the host CPU system.This is a direct result of these functions' high spatial locality, which allows the prefetcher to be highly accurate and effective in predicting which lines to retrieve from main memory.
Figure 15 shows the energy breakdown consumption for the two representative Class 2c functions.We make two observations.First, the host CPU system is 77% more energy-efficient than the NDP system for HPGSpm, on average across all core counts.Second, the NDP system provides energy savings over the host CPU system at high core counts for RODNw (up to 65% at 256 cores).When the core count increases, the aggregate L1 cache capacity across all cores increases as well, which in turn decreases the number of L1 cache misses.Compared to executing on a single core, executing on 256 cores decreases the L1 cache miss count by 43%, reducing the memory subsystem energy consumption by 40%.However, due to RODNw's medium LFMR of 0.5, the host CPU system still suffers from L2 and L3 cache misses at high core counts, which require the large L3 and off-chip link energy.In contrast, the NDP system eliminates the energy of accessing the L3 cache and the off-chip link energy by directly sending L1 cache misses to DRAM, which, at high core counts, leads to lower energy consumption than the host CPU system.We conclude that Class 2c functions do not experience large memory bottlenecks and are not a good fit for the NDP system in terms of performance.However, the NDP system can sometimes provide energy savings for functions that experience medium LFMR.

Effect of the Last-Level Cache Size
The bottleneck classification we present in Section 3.3 depends on two key architecture-dependent metrics (LFMR and MPKI) that are directly affected by the parameters and the organization of the cache hierarchy.Our analysis in Section 3.3 partially evaluates the effect of caching by scaling the aggregated size of the private (L1/L2) caches with the number of cores in the system while maintaining the size of the L3 cache fixed at 8 MB for the host CPU system.However, we also need to understand the impact of the L3 cache size on our bottleneck classification analysis.To this end, this section evaluates the effects on our bottleneck classification analysis of using an alternative cache hierarchy configuration, where we employ a Non-Uniform Cache Architecture (NUCA) [291] model to scale the size of the L3 cache with the number of cores in the host CPU system.
In this configuration, we maintain the sizes of the private L1 and L2 caches (32 kB and 256 kB per core, respectively) while increasing the shared L3 cache size with the core count (we use 2 MB/core) in the host CPU system.The cores, shared L3 caches, and DRAM memory controller are interconnected using a 2D-mesh Network-on-Chip (NoC) [292][293][294][295][296][297][298][299] of size ( + 1) × ( + 1) (an extra interconnection dimension is added to place the DRAM memory controllers).To faithfully simulate the NUCA model (e.g., including network contention in our simulations), we integrate the M/D/1 network model proposed by ZSim++ [300] in our DAMOV simulator [158].We use a latency of 3 cycles per hop in our analysis, as suggested by prior work [301].We adapt our energy model to account for the energy consumption of the NoC in the NUCA system.We consider router energy consumption of 63 pJ per request and energy consumed per link traversal of 71 pJ, same as previous work [251].
Figure 16 shows the performance scalability curves for representative functions from each one of our bottleneck classes presented in Section 3.3 for the baseline host CPU system (Host with 8MB Fixed LLC), the host CPU NUCA system (Host with NUCA 2MB/Core LLC), and the NDP system.We make two observations.First, the observations we make for our bottleneck classification (Section 3.3) are not affected by increasing the L3 cache size for Classes 1a, 1b, 1c, 2b, and 2c.We observe that Class 1a functions benefit from a large L3 cache size (by up to 1.9×/2.3×for HSJNPO/LIGPrkEmd at 256 cores).However, the NDP system still provides performance benefits compared to the host CPU NUCA system.We observe that increasing the L3 size reduces some of the pressure on main memory but cannot fully reduce the DRAM bandwidth bottleneck for Class 1a functions.Functions in Class 1b do not benefit from extra L3 capacity (we do not observe a decrease in LFMR or MPKI).Functions in Class 1c do not benefit from extra L3 cache capacity.We observe that the private L1 and L2 caches capture most of their data locality, as mentioned in Section 3.3.3,and thus, these functions do not benefit from increasing the L3 size.Functions in Class 2b do not benefit from extra L3 cache capacity, which can even lead to a decrease in performance at high core counts for the host CPU NUCA system in some Class 2b functions due to long NUCA L3 access latencies.For example, we observe that PLYgemver's performance drops 18% when increasing the core count from 64 to 256 in the host CPU NUCA system.We do not observe such a performance drop for the host CPU system with fixed LLC size.The performance drop in the host CPU NUCA system is due to the increase in the number of hops that L3 requests need to travel in the NoC at high core counts, which increase the function's AMAT.Class 2c functions benefit from a larger last-level cache.We observe that their performance improves by 1.3×/1.2×for HPGSpm/RODNw compared to the host CPU system with 8MB fixed LLC at 256 cores.Second, we observe two different types of behavior for functions in Class 2a.Since cache conflicts are the major bottleneck for functions in this class, we observe that increasing the L3 cache size can mitigate this bottleneck.In Figure 16, we observe that for both PLYGramSch and SPLFftRev, the host system with NUCA 2MB/Core LLC provides better performance than the host system with 8MB fixed LLC.However, the NDP system can still provide performance benefits in case of contention on the L3 NoC (e.g., in SPLFftRev).For example, the NDP system provides 14% performance improvement for SPLFftRev compared to the NUCA system (with 512 MB L3 cache) for 256 cores.
In summary, we conclude that the key takeaways and observations we present in our bottleneck classification in Section 3.3 are also valid for a host system with a shared last-level cache whose size scales with core count.In particular, different workload classes get affected by an increase in L3 cache size as expected by their characteristics distilled by our classification.
Figure 17 shows the energy consumption for representative functions from each one of our bottleneck classes presented in Section 3.3.We observe that the NDP system can provide substantial energy savings for functions in different bottleneck classes, even compared against a system with very large (e.g., 512 MB) cache sizes.We make the following observations for each bottleneck class: • Class 1a: First, for both representative functions in this bottleneck class, the host CPU NUCA system and the NDP system reduce energy consumption compared to the baseline host CPU system.However, we observe that the NDP system provides larger energy savings than the host CPU NUCA system.On average, across all core counts, the NDP system and the host CPU NUCA system reduce energy consumption compared to the host CPU system for HSJNPO/LIGPrkEmd by 46%/65% and 25%/22%, respectively.Second, at 256 cores, the host CPU NUCA system provides larger energy savings than the NDP system for both representative functions.This happens because at 256 cores, the large L3 cache (i.e., 512 MB) captures a large portion of the dataset for these functions, reducing costly DRAM traffic.The host CPU NUCA system reduces energy consumption compared to the host CPU system for HSJNPO/LIGPrkEmd at 256 cores by 2.0×/2.2×while the NDP system reduces energy consumption by 1.6×/1.8×.The L3 cache capacity needed to make the host CPU NUCA system more energy efficient than the NDP system is very large (512 MB SRAM), which is likely not cost-effective.• Class 1b: First, for CHAHsti, the host CPU NUCA system increases energy consumption compared to the host CPU system by 9%, on average across all core counts.In contrast, the NDP system reduces energy consumption by 57%.Due to its low spatial and temporal locality (Figure 3), this function does not benefit from a deep cache hierarchy.In the host CPU NUCA system, the extra energy from the large amount of NoC traffic further increases the cache hierarchy's overall energy consumption.Second, for PLYalu, the host CPU NUCA system and the NDP system reduce energy consumption compared to the host CPU system by 76% and 23%, on average across all core counts.Even though the increase in LLC size does not translate to performance improvements, the large LLC sizes in the host CPU NUCA system aid to reduce DRAM traffic, thereby providing energy savings compared to the baseline host CPU system.• Class 1c: First, for DRKRes, the host CPU NUCA system reduces energy consumption compared to the host CPU system by 15%, on average across all core counts.In contrast, the NDP system increases energy consumption by 30%, which is due to the function's medium LFMR (Section 3.3.3).Second, for PRSFlu, we observe that the NDP system provides large energy savings than the host CPU NUCA system.The host CPU NUCA system reduces

RODNw (f) Class 2c
Energy (J) Figure 17: Energy of the host and the NDP system as we vary the LLC size.Host refers to the host system with a fixed 8MB LLC size; Host NUCA refers to the host system with 2MB/Core LLC.
energy consumption compared to the host CPU system by 21%, while the NDP system reduces energy consumption by 25%, on average across all core counts.However, the energy savings of both host CPU NUCA and NDP systems compared to the host CPU system reduces at high-enough core counts (the energy consumption of the host CPU NUCA system (NDP system) is 0.6× (0.9×) that of the host CPU system at 64 cores and 1.1× (1.3×) that of the host CPU system at 256 cores).This result is expected for Class 1c functions since the functions in this class have decreasing LFMR, i.e., the functions effectively utilize the private L1/L2 caches in the host CPU system at high-enough core counts.• Class 2a: First, for PLYGramSch, compared to the host CPU system the host CPU NUCA system reduces energy consumption by 2.53× and the NDP system increases energy consumption by 55%, on average across all core counts.Even though at high core counts (64 and 256 cores) the host CPU NUCA system provides larger energy savings than the NDP system compared to the host CPU system (the host CPU NUCA system and the NDP system reduce energy consumption compare to the host CPU system by 9× and 65% respectively, averaged across 64 and 256 cores), such large energy savings come at the cost of very large (e.g., 512 MB) cache sizes.Second, for SPLFftRev, the host CPU NUCA system and the NDP system reduce energy consumption compared to the host CPU system by 42% and 7%, on average across all core counts.The NDP system increases energy consumption compared to the host CPU system at low core counts (an increase of 33%, averaged across 1, 4, and 16 cores).However, it provides similar energy savings as the host CPU NUCA system for large core counts (99% and 75% energy reduction compare to the host CPU system for the host CPU NUCA system and the NDP system, respectively, averaged across 64 and 256 cores counts).Since the function suffers from high network contention, the increase in core count increases NoC traffic, which in turn increases energy consumption for the host CPU NUCA system.We conclude that the NDP system provides energy savings for Class 2a applications compared to the host CPU system at lower cost than the host CPU NUCA system.• Class 2b: First, for PLYgemver, the host CPU NUCA system increases energy consumption compared to the host CPU system by 2%, on average across all core counts.In contrast, the NDP system reduces energy consumption by 13%.This function does not benefit from large L3 cache sizes since Class 2b functions are bottlenecked by L1 capacity.Thus, the NoC only adds extra static and dynamic energy consumption.Second, for SPLLucb, the host CPU NUCA system consumes the same energy as the host CPU system while the NDP system increases energy consumption by 5%, averaged across all core counts.• Class 2c: For both representative functions in this class, the host CPU NUCA system reduces energy consumption compared to the host CPU system while the NDP system increases energy consumption.For HPGSpm/RODNw, the host CPU NUCA system reduces energy consumption by 6%/9% while the NDP system increases energy consumption by 74%/22%, averaged across all core counts.This result is expected since Class 2c functions are compute-bound and highly benefit from a deep cache hierarchy.
In conclusion, the NDP system can provide substantial energy savings for functions in different bottleneck classes, even compared against a system with very large (e.g., 512 MB) cache sizes.

Validation and Summary of Our Workload Characterization Methodology
In this section, we present the validation and a summary of our new workload characterization methodology.First, we use the remaining 100 memory-bound functions we obtain from Step 1 (see Section 3.1) to validate our workload characterization methodology.To do so, we calculate the accuracy of our workload classification by using the remaining 100 memory-bound functions, which were not used to identify the six classes we found and described in Section 3.3.Second, we present a summary of the key metrics we obtain for all 144 memory-bound functions, including our analysis of the host CPU system and the NDP system using two types of cores (in-order and out-of-order).

Validation of Our Workload Characterization
MethodologyOur goal is to evaluate the accuracy of our workload characterization methodically on a large set of functions.To this end, we apply Step 2 and Step 3 of our memory bottleneck classification methodology (as described in Sections 2.3 and 2.4) to the remaining 100 memory-bound functions we obtain from Step 1 (in Section 3.1).Then, we perform a two-phase validation to calculate the accuracy of our workload characterization.
In phase 1 of our validation, we calculate the threshold values that define the low/high boundaries of each of the four metrics we use to cluster the initial 44 functions in the six memory bottleneck classes in Section 3.3 (i.e., temporal locality, LFMR, LLC MPKI, and AI).We also include the LFMR curve slope to indicate when the LFMR increases, decreases or stays constant as we scale the core count.We calculate the threshold values for a metric M by computing the middle point between (i) the average value of M across the memory bottleneck classes with low values of M and (ii) the average value of M across the memory bottleneck classes with high values of M values out of the 44 functions.In phase 2 of our validation, we calculate the accuracy of our workload characterization by classifying the remaining 100 memory-bound functions using the threshold values obtained from phase 1 and the LFMR curve slope.After phase 2, a function is considered to be accurately classified into a correct memory bottleneck class if and only if it (1) fits the definition of the assigned class using the threshold values obtained from phase 1 and (2) follows the expected performance trends of the assigned class when the function is executed in the host CPU system and the NDP system.For example, a function is correctly classified into Class 1a if and only if it (1) displays low temporal locality, low AI, high LFMR, high MPKI and (2) the NDP system outperforms the host CPU system as we scale the core count when executing the function.The final accuracy of our workload characterization methodology is calculated by computing the percentage of the functions that are accurately classified into one of the six memory bottleneck classes.
First, by applying phase 1 of our two-phase validation, we obtain that the threshold values are: 0.48 for temporal locality, 0.56 for LFMR, 11.0 for MPKI, and 8.5 for AI.Second, by applying phase 2 of our two-phase validation, we find that we can accurately classify 97% of the 100 memory-bound functions into one of our six memory bottleneck classes (i.e., the accuracy of our workload characterization methodology is 97%).We observe that three functions (Ligra:ConnectedComponents:compute:rMat, Ligra:MaximalIndependentSet:edgeMapDense:USA, and SPLASH-2:Oceanncp:relax) could not be accurately classified into their correct memory bottleneck class (Class 1a).We observe that these functions have LLC MPKI values lower than the MPKI threshold expected for Class 1a functions.We expect that the accuracy of our methodology can be further improved by incorporating more workloads into our workload suite and fine-tuning each metric to encompass an even larger set of applications.
We conclude that our workload characterization methodology can accurately classify a given new application/function into its appropriate memory bottleneck class.

Summary of Our Workload Characterization Results.
Figure 18a summarizes the metrics we collect for all 144 functions across all core counts (i.e., from 1 to 256 cores) and different core microarchitectures (i.e., out-of-order and in-order cores).The figure shows the distribution of the key metrics we use during our workload characterization for each memory bottleneck class in Section 3.3, including architecture-independent metrics (i.e., temporal locality) and architecture-dependent metrics (i.e., AI, LFMR,   and LLC MPKI).We report the architecture-dependent metrics for two core models: (i) in-order and (ii) out-of-order cores. 12Together with the out-of-order core model that we use in Section 3.3, we incorporate an in-order core model to our analysis, so as to show that our memory bottleneck classification methodology focuses on data movement requirements and works independently of the core microarchitecture.Figure 18b shows the distribution of speedups we observe for when we offload the function to our general-purpose NDP cores, while employing the same core type as the host CPU system.We make two key observations from Figure 18.First, we observe similar values for each architecture-dependent key metric (i.e., LFMR, MPKI, AI) regardless of core type for all 144 functions (in Figure 18a).Second, we observe that the NDP system achieves similar speedups over the host CPU system, when using both in-order and out-of-order core configurations (in Figure 18b).The speedup provided by the NDP system compared to the host CPU system when both systems use out-of-order (in-order) cores for Classes 1a, 1b, 1c, 2a, 2b, and 2c is 1.59 (1.77), 1.22 (1.15), 0.96 (0.95), 1.04 (1.22), 0.94 (1.01), and 0.56 (0.76), respectively, on average across all core counts and functions within a memory bottleneck class.The NDP system greatly outperforms the host CPU system across all core counts for Class 1a and 1b functions, with a maximum speedup for the out-of-order (in-order) core model of 4.8 (3.5) and 3.4 (2.9), respectively.The NDP system greatly outperforms the host CPU system at low core counts for Class 1c functions and at high core counts for Class 2a functions, with a maximum speedup for the out-of-order (in-order) core model of 2.3 (2.4) and 3.8 (3.4), respectively.The NDP system provides a modest speedup compared to the host CPU system across all core counts for Class 2b functions and slowdown for Class 2c functions, with a maximum speedup for the out-of-order (in-order) core model of 1.2 (1.1) and 1.0 (1.0), respectively.We observe that, averaged across all classes and core types, the average speedup provided by the NDP system using inorder cores is 11% higher than the average speedup offered by the NDP system using out-of-order cores.This is because the host CPU system with out-of-order cores can hide the performance impact of memory access latency to some degree (e.g., using dynamic instruction scheduling) [57,58,183,184,240,302].On the other hand, the host CPU system using in-order cores has little tolerance to hide memory access latency [57,58,183,184,240,302].
We conclude that our methodology to classify memory bottlenecks of applications is robust and effective since we observe similar trends for the six memory bottleneck classes across a large range of ( 144) functions and two very different core models.

Limitations of Our Methodology
We identify three limitations to our workload characterization methodology.We discuss each limitation next.NDP Architecture Design Space.Our methodology uses the same type and number of cores in the host CPU and the NDP system configurations for our scalability analysis (Section 3.3) because our main goal is to highlight the performance and energy differences between the host CPU system and the NDP system that are caused by data movement.We do not consider practical limitations related to area or thermal dissipation that could affect the type and the maximum number of cores in the NDP system, because our goal is not to propose NDP architectures but to characterize data movement and understand the different data movement bottlenecks in modern workloads.Proposing NDP architectures for the workload classes that our methodology identifies as suitable for NDP is a promising topic for future work.Function-level Analysis.We choose to conduct our analysis at a function granularity rather than at the application granularity for two major reasons.First, general-purpose NDP architectures are typically leveraged as accelerators to which only parts of the application or specific functions are offloaded [1,47,48,54,59,64,65,83,86,89,92,98,100,102,133,192,193,[303][304][305][306][307], rather than the entire application.Functions typically form natural boundaries for parts of algorithms/applications that can potentially be offloaded.Second, it is well-known that applications go through distinct phases during execution.Each phase may have different characteristics (e.g., a phase might be more compute-bound, while another one might be more memory-bound) and thus fall into different classes in our analysis.A fine-grained analysis at the function level enables us to identify each of those phases and hence, identify more fine-grained opportunities for NDP offloading.However, the main drawback of function-level analysis is that it does not take into account data movement across function boundaries, which affects the performance and energy benefits the NDP system provides over the host CPU system.For example, the NDP system might hurt overall system performance and energy consumption when a large amount of data needs to be continuously moved between a function executing on the NDP cores and another executing on the host CPU cores [63,74].Overestimating NDP Potential.Offloading kernels to NDP cores incurs overheads that our analysis does not account for (e.g., maintaining coherence between the host CPU and the NDP cores [63,74], efficiently synchronizing computation across NDP cores [101,140], providing virtual memory support for the NDP system [47,55,308], and dynamic offloading support for NDP-friendly functions [48]).Such overheads can impact the performance benefits NDP can provide when considering the end-to-end application.However, deciding how to and whether or not to offload computation to NDP is an open research topic, which involves several architecture-dependent components in the system, such as the following two examples.First, maintaining coherence between the host CPU and the NDP cores is a challenging task that recent works tackle [63,74].Second, enabling efficient synchronization across NDP cores is challenging due to the lack of shared caches and hardware cache coherence protocols in NDP systems.Recent works, such as [101,309], provide solutions to the NDP synchronization problem.Therefore, to focus our analysis on the data movement characteristics of workloads and the broad benefits of NDP, we minimize our assumptions about our target NDP architecture, making our evaluation as broadly applicable as possible.

DAMOV: The Data Movement Benchmark Suite
In this section, we present DAMOV, the DAta MOVement Benchmark Suite.DAMOV is the collection of the 144 functions we use to drive our memory bottleneck classification in Section 3. The benchmark suite is divided into each one of the six classes of memory bottlenecks presented in Section 3. DAMOV is the first benchmark suite that encompasses real applications from a diverse set of application domains tailored to stress different memory bottlenecks in a system.We present the complete description of the functions in DAMOV in Appendix A. We highlight the benchmark diversity of the functions in DAMOV in Section 4.1.We open source DAMOV [158] to facilitate further rigorous research in mitigating data movement bottlenecks, including in near data processing.

Benchmark Diversity
We perform a hierarchical clustering algorithm with the 44 representative functions we employ in Section 3.3. 13Our goal is to showcase our benchmark suite's diversity and observe whether a clustering algorithm produces a noticeable difference from the application clustering presented Section 3. The hierarchical clustering algorithm [310] takes as input a dataset containing features that define each object in the dataset.The algorithm works by incrementally grouping objects in the dataset that are similar to each other in terms of some distance metric (called linkage distance), which is calculated based on the features' values.Two objects with a short linkage distance have more affinity to each other than two objects with a large linkage distance.To apply the hierarchical clustering algorithm, we create a dataset where each object is one of the 44 representative functions from DAMOV.We use as features the same metrics we use for our analysis, i.e., temporal locality, MPKI, LFMR, and AI.We also include the LFMR curve slope to indicate when the LFMR increases, decreases or stays constant when scaling the core count.We use Euclidean distance [310] to calculate the linkage distance across features in our dataset.We evaluate other  linkage distance metrics (such as Manhattan distance [310]), and we observe similar clustering results.
Figure 19 shows the dendrogram that the hierarchical clustering algorithm produces for our 44 representative functions.We indicate in the figure the application class each function belongs to, according to our classification.We make three observations from the figure.
First, our benchmarks exhibit a wide range of behavior diversity, even among those belonging to the same class.For example, we observe that the functions from Class 1a are divided into two groups, with a linkage distance of 3. Intuitively, functions in the first group (HSJNPO, STRAdd, STRCpy, STRSca, STRTriad) have regular access patterns while functions in the second group (DRKYolo, LIGCompEms, LIGPrkEmd, LIGRadiEms) have irregular access patterns.We observe a similar clustering in Section 3.3.1.
Second, we observe that our application clustering (Section 3.3) matches the clustering that the hierarchical clustering algorithm provides (Figure 19).From the dendrogram root, we observe that the right part of the dendrogram consists of functions with high temporal locality (from Classes 2a, 2b, and 2c).Conversely, the left part of the dendrogram consists of functions with low temporal locality (from Classes 1a, 1b, and 1c).The functions in the right and left part of the dendrogram have a high linkage distance (higher than 15), which implies that the metrics we use for our clustering are significantly different from each other for these functions.Third, we observe that functions within the same class are clustered into groups with a linkage distance lower than 5.This grouping matches the six classes of data movement bottlenecks present in DAMOV.Therefore, we conclude that our methodology can successfully cluster functions into distinct classes, each one representing a different memory bottleneck.
We conclude that (i) DAMOV provides a heterogeneous and diverse set of functions to study data movement bottlenecks and (ii) our memory bottleneck clustering methodology matches the clustering provided by a hierarchical clustering algorithm (this section; Figure 19).

Case Studies
In this section, we demonstrate how our benchmark suite is useful to study open questions related to NDP system designs.We provide four case studies.The first study analyzes the impact of load balance and communication on NDP execution.The second study assesses the impact of tailored NDP accelerators on our memory bottleneck analysis.The third study evaluates the effect of different core designs on NDP system performance.The fourth study analyzes the impact of fine-grained offloading (i.e., offloading small blocks of instructions to NDP cores) on performance.

Case Study 1: Impact of Load Balance and Inter-Vault Communication on NDP Systems
Communication between NDP cores is one of the key challenges for future NDP system designs, especially for NDP architectures based on 3D-stacked memories, where accessing a remote vault incurs extra latency overhead due to network traffic [46,101,311].This case study aims to evaluate the load imbalance and inter-vault communication that the NDP cores experience when executing functions from the DAMOV benchmark suite.We statically map a function to an NDP core, and we assume that NDP cores are connected using a 6x6 2D-mesh Network-on-Chip (NoC), similar to previous works [66,70,[312][313][314]. Figure 20 shows the performance overhead that the interconnection network imposes to NDP cores when running several functions from our benchmark suite.We report performance overheads of functions from different bottleneck classes (i.e., from Classes 1a, 1b, 2a, and 2b) that experience at least 5% of performance overhead due to the interconnection network.We calculate the interconnection network performance overhead by comparing performance with the 2D-mesh versus that with an ideal zero-latency interconnection network.We observe that the interconnection network performance overhead varies across functions, with a minimum overhead of 5% for SPLOcpSlave and a maximum overhead of 26% for SPLLucb.
We further characterize the traffic of memory requests injected into the interconnection network for these functions, aiming to understand the communication patterns across NDP cores.Figure 21 shows the distribution of all memory requests (y-axis) in terms of how many hops they need to travel in the NoC between NDP cores (x-axis) for each function.We make the following observations.First, we observe that, on average, 40% of all memory requests need Performance Overhead (%) Figure 20: Interconnection network performance overhead in our NDP system.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0 1 2 3 4 5 6 7 8 9 10 NoC Hops Traveled Memory Requests (%) q q q q q q q q CHABsBez CHAHsti LIGBfsEms SPLFftRev SPLFftTra SPLLucb SPLOcnpLap SPLOcpSlave STRSca STRTriad to travel 3 to 4 hops in the NoC, and less than 5% of all requests are issued to a local vault (0 hops).Even though the functions follow different memory access patterns, they all inject similar network traffic into the NoC. 14Therefore, we conclude that the NDP design can be further optimized by (i) employing more intelligent data mapping and scheduling mechanisms that can efficiently allocate data nearby the NDP core that accesses the data (thereby reducing inter-vault communication and improving data locality) and (ii) designing interconnection networks that can better fit the traffic patterns that NDP workloads produce.The DAMOV benchmark suite can be used to develop new ideas as well as evaluate existing ideas in both directions.

Case Study 2: Impact of NDP Accelerators on Our Memory Bottleneck Analysis
In our second case study, we aim to leverage our memory bottleneck classification to evaluate the benefits an NDP accelerator provides compared to the same accelerator accessing memory externally.We use the Aladdin accelerator simulator [315] to tailor an accelerator for an application function.Aladdin works by estimating the performance of a custom accelerator based on the data-flow graph of the application.The main difference between an NDP accelerator and a regular accelerator (i.e., compute-centric accelerator) is that the former is placed in the logic layer of a 3D-stacked memory device and thus can leverage larger memory bandwidth, shorter memory access latency, and lower memory access energy, compared to the compute-centric accelerator that is exemplary of existing compute-centric accelerator designs.
To evaluate the benefits of NDP accelerators, we select three functions from our benchmark suite for this case study: DRKYolo 14 We use the default HMC data interleaving scheme in our experiments (Table 1).
(from Class 1a), PLYalu (from Class 1b), and PLY3mm (from Class 2c).We select these functions and memory bottleneck classes because we expect them to benefit the most (or to show no benefit) from the near-memory placement of an accelerator.According to our memory bottleneck analysis, we expect that the functions we select to (i) benefit from NDP due to its high DRAM bandwidth (Class 1a), (ii) benefit from NDP due to its shorter DRAM access latency (Class 1b), or (iii) do not benefit from NDP in any way (Class 2c).
Figure 22 shows the speedup that the NDP accelerator provides for the different functions compared to the compute-centric accelerator.We make four observations.First, as expected based on our classification, the NDP accelerator provides performance benefits compared to the compute-centric accelerator for functions in Classes 1a and 1b.It does not provide performance improvement for the function in Class 2c.Second, the NDP accelerator for DRKYolo shows the largest performance benefits (1.9× performance improvement compared to the compute-centric accelerator).Since this function is DRAM bandwidth-bound (Class 1a, Section 3.3.1),the NDP accelerator can leverage the larger memory bandwidth available in the logic layer of the 3D-stacked memory device.Third, we observe that the NDP accelerator also provides speedup (1.25×) for the PLYalu function compared to the compute-centric accelerator, since the NDP accelerator provides shorter memory access latency to the function, which is latency-bound (Class 1b, Section 3.3.2).Fourth, the NDP accelerator does not provide performance improvement for the PLY3mm function since this function is compute-bound (Class 2c, Section 3.3.6).In conclusion, our observations for the performance of NDP accelerators are in line with the characteristics of the three memory bottleneck classes we evaluate in this case study.Therefore, our memory bottleneck classification can be applied to study other types of system configurations, e.g., the accelerators used in this section.However, since NDP accelerators are often employed under restricted area and power constraints (e.g., limited area available in the logic layer of a 3D-stacked memory [63,74]), the core model of the compute-centric and NDP accelerators cannot always be the same.We leave a thorough analysis that takes area and power constraints in the study of NDP accelerators into consideration for future research.

Case Study 3: Impact of Different Core Models on NDP Architectures
This case study aims to analyze when a workload can benefit from different core models and numbers of cores while respecting the area and power envelope of the logic layer of a 3D-stacked memory.Many prior works employ 3D-stacked memories as the substrate to implement NDP architectures [1, 46-48, 54, 55, 59-61, 63-70, 74-77, 79, 80, 99, 101-103, 137, 146, 192, 194, 305, 316-324].However, 3D-stacked memories impose severe area and power restrictions on NDP architectures.For example, the area and power budget of the logic layer of a single HMC vault are 4.4  2 and 312  , respectively [1,63].
In the case study, we perform an iso-area and iso-power performance evaluation of three functions from our benchmark suite.We configure the host CPU system and the NDP system to guarantee an iso-area and iso-power evaluation, considering the area and power budget for a 32-vault HMC device [1,63].We use four out-of-order cores with a deep cache hierarchy for the host system configuration and two different NDP configurations: (1) one using six out-of-order NDP cores (NDP+out-of-order) and (2) using 128 in-order NDP cores (NDP+in-order), without a deep cache hierarchy.We choose functions from Classes 1a, 1b, and 2b for this case study since the major effects distinct microarchitectures have on the memory system are: (a) how much DRAM bandwidth they can sustain, and (b) how much DRAM latency they can hide.Classes 1a, 1b, and 2b are the most affected by memory bandwidth and access latency (as shown in Section 3).We choose two representative functions from each of these classes.
Figure 23 shows the speedup provided by the two NDP system configurations compared to the baseline host system.We make two observations.First, in all cases, the NDP+in-order system provides higher speedup than the NDP+out-of-order system, both compared to the host system.On average across all six functions, the NDP+inorder system provides 4× the speedup of the NDP+out-of-order system.The larger speedup the NDP+in-order system provides is due to the high number of NDP cores in the NDP+in-order system.We can fit 128 in-order cores in the logic layer of the 3D-stacked memory as opposed to only six out-of-order cores in the same area/power budget.Second, we observe that the speedup the NDP+in-order system provides compared to the NDP+out-of-order system does not scale with the number of cores.For example, the NDP+in-order system provides only 2× the performance of the NDP+out-of-order system for DRKYolo and PLYalu, even though the NDP+in-order system has 21× the number of NDP cores of the NDP+out-of-order system.This implies that even though the functions benefit from a large number of NDP cores available in the NDP+in-order system, static instruction scheduling limits performance on the NDP+in-order system.We believe, and our previous observations suggest, that an efficient NDP architecture can be achieved by leveraging mechanisms that can exploit both dynamic instruction scheduling and manycore design while fitting in the area and power budget of 3D-stacked memories.For example, past works [57,58,183,184,224,[325][326][327][328][329][330][331][332][333][334][335][336][337][338][339][340][341][342][343] propose techniques that enable the benefits of simple and complex cores at the same time, via heterogeneous or adaptive architectures.These ideas can be examined to enable better core and system designs for NDP systems, and DAMOV can facilitate their proper design, exploration, and evaluation.

Case Study 4: Impact of Fine-Grained
Offloading to NDP on Performance Several prior works on NDP (e.g., [47,54,86,89,100,146,303,306,[344][345][346]) propose to identify and offload to the NDP system simple primitives (e.g., instructions, atomic operations).We refer to this NDP offloading scheme as a fine-grained NDP offloading, in contrast to a coarse-grained NDP offloading scheme that offloads whole functions and applications to NDP systems.A fine-grained NDP offloading scheme provides two main benefits compared to a coarse-grained NDP offloading scheme.First, a fine-grained NDP offloading scheme allows for a reduction in the complexity of the processing elements used as NDP logic, since the NDP logic can consist of simple processing elements (e.g., arithmetic units, fixed function units) instead of entire in-order or out-of-order cores often utilized when employing a coarse-grained NDP offloading scheme.Second, a fine-grained NDP offloading scheme can help developing simple coherence mechanism needed to allow shared host and NDP execution [47].However, identifying arbitrary NDP instructions can be a daunting task since there is no comprehensive methodology that indicates what types of instructions are good offloading candidates.
As the first step in this direction, we exploit the key insight provided by [151,347] to identify potential regions of code that can be candidates for fine-grained NDP offloading.[151,347,348] show that few instructions are responsible for generating most of the cache misses during program execution in memory-intensive applications.Thus, these instructions are naturally good candidates for fine-grained NDP offloading.Figure 24 shows the distribution of unique basic blocks (x-axis) and the percentage of last-level cache misses (y-axis) the basic block produces for three representative functions from our benchmark suite.We select functions from Classes 1a (LIGKcrEms), 1b (HSJPRH), and 1c (DRKRes) since functions in these classes have higher L3 MPKI than functions in Classes 2a, 2b, and 2c.We observe from the figure that 1% to 10% of the basic blocks in each function are responsible for up to 95.3% of the LLC misses.We call these basic blocks the hottest basic blocks. 15e investigate the data-flow of each basic block and observe that these basic blocks often execute simple read-modify-write operations, with few arithmetic operations.Therefore, we believe that such basic blocks are good candidates for fine-grained offloading.Figure 25 shows the speedup obtained by offloading (i) the hottest basic block we identified for the three representative functions and (ii) the entire function to the NDP system, compared to the host system.Our initial evaluations show that offloading the hottest basic block of each function to the NDP system can provide up to 1.25× speedup compared to the host CPU, which is half of the 1.5× speedup achieved when offloading the entire function.Therefore, we believe that methodically identifying simple NDP instructions can be a promising research direction for future NDP system designs, which our DAMOV Benchmark Suite can help with.

Key Takeaways
We summarize the key takeaways from our extensive characterization of 144 functions using our new three-step methodology to identify data movement bottlenecks.We also highlight when NDP is a good architectural choice to mitigate a particular memory bottleneck.
Figure 26 pictorially represents the key takeaways we obtain from our memory bottleneck classification.Based on four key metrics, we classify workloads into six classes of memory bottlenecks.We provide the following key takeaways: (1) Applications with low temporal locality, high LFMR, high MPKI, and low AI are DRAM bandwidth-bound (Class 1a, Section 3.3.1).They are bottlenecked by the limited off-chip memory bandwidth as they exert high pressure on main memory.We make three observations for Class 1a applications.First, these applications do benefit from prefetching since they display a low degree of spatial locality.Second, these applications highly benefit from NDP architectures because they take advantage of the high memory bandwidth available within the memory device.Third, NDP architectures significantly improve energy for these applications since they eliminate the off-chip I/O traffic between the CPU and the main memory.
(2) Applications with low temporal locality, high LFMR, low MPKI, and low AI are DRAM latency-bound (Class 1b, Section 3.3.2).We make three observations for Class 1b applications.First, these applications do not significantly benefit from prefetching since infrequent memory requests make it difficult for the prefetcher to train successfully on an access pattern.Second, these applications benefit from NDP architectures since they take advantage of NDP's lower memory access latency and the elimination of deep L2/L3 cache hierarchies, which fail to capture data locality for these workloads.Third, NDP architectures significantly improve energy for these applications since they eliminate costly (and unnecessary) L3 cache look-ups and the off-chip I/O traffic between the CPU and the main memory.(3) Applications with low temporal locality, decreasing LFMR with core count, low MPKI, and low AI are bottlenecked by the available L1/L2 cache capacity (Class 1c, Section 3.3.3).We make three observations for Class 1c applications.First, these applications are DRAM latency-bound at low core counts, thus taking advantage of NDP architectures, both in terms of performance improvement and energy reduction.Second, NDP's benefits reduce when core count becomes larger, which consequently allows the working sets of such applications to fit inside the cache hierarchy at high core counts.Third, NDP architectures can be a good design choice for such workloads in systems with limited area budget since NDP architectures do not require large L2/L3 caches to outperform or perform similarly to the host CPU (in terms of both system throughput and energy) for these workloads.(4) Applications with high temporal locality, increasing LFMR with core count, low MPKI, and low AI are bottlenecked by L3 cache contention (Class 2a, Section 3.3.4).We make three observations for Class 2a applications.First, these applications benefit from a deep cache hierarchy and do not take advantage of NDP architectures at low core counts.Second, the number of cache conflicts increases when the number of cores in the system increases, leading to more pressure on main memory.We observe that NDP can effectively mitigate such cache contention for these applications without incurring the high area and energy overheads of providing additional cache capacity in the host.Third, NDP can improve energy for these workloads at high core counts, since it eliminates the costly data movement between the last-level cache and the main memory.(5) Applications with high temporal locality, low LFMR, low MPKI, and low AI are bottlenecked by L1 cache capacity (Class 2b, Section 3.3.5).We make two observation for Class 2b applications.First, NDP can provide similar performance and energy consumption than the host system by leveraging lower memory access latency and avoiding off-chip energy consumption for these applications.Second, NDP can be used to reduce the overall SRAM area (by eliminating L2/L3 caches) in the system without a performance or energy penalty.(6) Applications with high temporal locality, low LFMR, low MPKI, and high AI are compute-bound (Class 2c, Section 3.3.6).We make three observations for Class 2c applications.First, these applications suffer performance and energy penalties due to the lack of a deep L2/L3 cache hierarchy when executed on the NDP architecture.Second, these applications highly benefit from prefetching due to their high temporal and spatial locality.Third, these applications are not good candidates to execute on NDP architectures.

Shaping Future Research with DAMOV
A key contribution of our work is DAMOV, the first benchmark suite for main memory data movement studies.DAMOV is the collection of 144 functions from 74 different applications, belonging to 16 different benchmark suites or frameworks, classified into six different classes of data movement bottlenecks.
We believe that DAMOV can be used to explore a wide range of research directions on the study of data movement bottlenecks, appropriate mitigation mechanisms, and open research topics on NDP architectures.We highlight DAMOV's usability and potential benefits with four brief case studies, which we summarize below: • In the first case study (Section 5.1), we use DAMOV to evaluate the interconnection network overheads that NDP cores placed in different vaults of a 3D-stacked memory suffer from.We observe that a large portion of the memory requests an NDP core issues go to remote vaults, which increases the memory access latency for the NDP core.We believe that DAMOV can be employed to study better data mapping techniques and interconnection network designs that aim to minimize (i) the number of remote memory accesses the NDP cores execute and (ii) the interconnection network latency overheads.• In the second case study (Section 5.2), we evaluate the benefits that NDP accelerators can provide for three applications from our benchmark suite.We compare the performance improvements an NDP accelerator provides against the compute-centric version of the same accelerator.We observe that the NDP accelerator provides significant performance benefits compared to the compute-centric accelerator for applications in Classes 1a and 1b.At the same time, it does not improve performance for an application in Class 2c.We believe that DAMOV can aid the design of NDP accelerators that target different memory bottlenecks in the system.• In the third case study (Section 5.3), we perform an iso-area/power performance evaluation to compare NDP systems using in-order and out-of-order cores.We observe that the in-order cores' performance benefits for some applications are limited by the cores' static instruction scheduling mechanism.We believe that better NDP systems can be built by leveraging techniques that enable dynamic instruction scheduling without incurring the large area and power overheads of out-of-order cores.DAMOV can help in the analysis and development of such NDP architectures.• In the fourth case study (Section 5.4), we evaluate the benefits of offloading small portions of code (i.e., a basic block) to NDP, which simplifies the design of NDP systems.We observe that for many applications, a small percentage of basic blocks is responsible for most of the last-level cache misses.By offloading these basic blocks to an NDP core, we observe a performance improvement of up to 1.25×.We believe that DAMOV can be used to identify simple NDP instructions that enable building efficient NDP systems in the future.

Related Work
To our knowledge, this is the first work that methodically characterizes data movement bottlenecks and evaluates the benefits of different data movement mitigation mechanisms, with a focus on Near-Data Processing (NDP), for a broad range of applications.This is also the first work that provides an extensive open-source benchmark suite, with a diverse range of real world applications, tailored to stress different memory-related data movement bottlenecks in a system.
We highlight two of these prior works, [426] and [1], since they also focus on characterizing applications for NDP architectures.In [426], the authors provide the first work that characterizes workloads for NDP.They analyze five applications (FFT, ray tracing, method of moments, image understanding, data management).The NDP organization [426] targets is similar to [429], where vector processing compute units are integrated into the DDRx memory modules.Even though [426] has a similar goal to our work, it understandably does not provide insights into modern data-intensive applications and NDP architectures as it dates from 2001.Also, [426] focuses its analysis only on a few workloads, whereas we conduct a broader workload analysis starting from 345 applications.Therefore, a new, more comprehensive and rigorous analysis methodology of data movement bottlenecks in modern workloads and modern NDP systems is necessary.A more recent work investigates the memory bottlenecks in widely-used consumer workloads from Google and how NDP can mitigate such bottlenecks [1].This work focuses its analysis on a small number of consumer workloads.Our work presents a comprehensive analysis of a much broader set of applications (345 different applications, and a total of 77K application functions), which allows us to provide a general methodology, a comprehensive workload suite, and general takeaways and guidelines for future NDP research.With our comprehensive analysis, this work is the first to develop a rigorous methodology to classify applications into six groups, which have different characteristics with respect to how they benefit from NDP systems as well as other data movement bottleneck mitigation techniques.

Conclusion
This paper introduces the first rigorous methodology to characterize memory-related data movement bottlenecks in modern workloads and the first data movement benchmark suite, called DAMOV.We perform the first large-scale characterization of applications to develop a three-step workload characterization methodology that introduces and evaluates four key metrics to identify the sources of data movement bottlenecks in real applications.We use our new methodology to classify the primary sources of memory bottlenecks of a broad range of applications into six different classes of memory bottlenecks.We highlight the benefits of our benchmark suite with four case studies, which showcase how representative workloads in DAMOV can be used to explore open-research topics on NDP systems and reach architectural as well as workload-level insights and conclusions.We open-source our benchmark suite and our bottleneck analysis toolchain [158].We hope that our work enables further studies and research on hardware and software solutions for data movement bottlenecks, including near-data processing.

APPENDIX A Application Functions in the DAMOV Benchmark Suite
We present the list of application functions in each one of the six classes of data movement bottlenecks we identify using our new methodology.

Figure 2 :
Figure 2: Overview of our three-step workload characterization methodology.

Figure 5 :
Figure 5: Performance of 12 representative functions on three systems: host CPU, host CPU with prefetcher, and NDP, normalized to one host CPU core.

Figure 6 :
Figure 6: Host CPU system IPC vs. utilized DRAM Bandwidth for representative Class 1a functions.

Figure 8 :
Figure 8: Average Memory Access Time (AMAT) for representative Class 1b functions.

Figure 9 :
Figure 9: Energy breakdown for representative Class 1b functions.

Figure 10 :
Figure 10: Energy breakdown for representative Class 1c functions.

Figure 12 :
Figure 12: Energy breakdown for representative Class 2a functions.

Figure 14 :
Figure 14: Energy breakdown for representative Class 2b functions.

Figure 15 :
Figure 15: Energy breakdown for representative Class 2c functions.

Figure 16 :
Figure16: Performance of the host and the NDP system as we vary the LLC size, normalized to one host core with a fixed 8MB LLC size.

Figure 18 :
Figure 18: Summary of our characterization for all 144 memory-bound functions.Each box is lower-bounded by the first quartile and upper-bounded by the third quartile.The median falls within the box.The inter-quartile range (IQR) is the distance between the first and third quartiles (i.e., box size).Whiskers extend to the minimum and maximum data point values on either sides of the box.

Figure 21 :
Figure 21: Distribution of NoC hops traveled per memory request.

Figure 22 :
Figure 22: Speedup of the NDP Accelerators over the Compute-Centric Accelerators for three functions from Classes 1a, 1b, and 2c.

Figure 23 :
Figure23: Speedup of NDP architectures over 4 out-of-order host CPU cores for two NDP configurations: using 128 inorder NDP cores (NDP+in-order) and 6 out-of-order NDP cores (NDP+out-of-order) for representative functions from Classes 1a, 1b, and 2b.

Figure 25 :
Figure 25: Speedup of offloading to NDP the hottest basic block in each function versus the entire function.

Figure 26 :
Figure 26: Summary of our memory bottleneck classification.
lists application functions in Class 1a, i.e., that are DRAM bandwidth-bound (characterized in Section 3.3.1);• Table 3 lists application functions in Class 1b, i.e., that are DRAM latency-bound (characterized in Section 3.3.2);• Table 4 lists application functions in Class 1c, i.e., that are bottlenecked by the available L1/L2 cache capacity (characterized in Section 3.3.3);• Table 5 lists application functions in Class 2a, i.e., that are bottlenecked by L3 cache contention (characterized in Section 3.3.4);• Table 6 lists application functions in Class 2b, i.e., that are bottlenecked by L1 cache size (characterized in Section 3.3.5);• Table 7 lists application functions in Class 2c, i.e., that are compute-bound (characterized in Section 3.3.6).In each table we list the benchmark suite, the application name, and the function name.We also list the input size/problem size we use to evaluate each application function.