I. Introduction
Conventional computer systems adopt separate processing (CPUs and GPUs) and data storage components (memory, flash, and disks). As the volume of data to process has skyrocketed over the last decade, data movement between the processing units (PUs) and the memory is becoming one of the most critical performance and energy bottlenecks in various computer systems, ranging from cloud servers to end-user devices. For example, the data transfer between CPUs and off-chip memory consumes two orders of magnitude more energy than a floating point operation [1]. Recent progress in processing-in-memory (PIM) techniques introduce promising solutions to the challenges [2]–[5], by leveraging 3D memory technologies [6] to integrate computation logic with the memory.