I. Introduction
Data movement is a major contributor to the total system energy consumption in deeply scaled CMOS ICs [1]. Studies show that for scientific and mobile applications, 40% [2] and 35% [3] of the total system energy is consumed by data movement, respectively. The energy cost of data movement is substantially higher than that of computation. For example, when performing a double precision addition on a graphics processing unit (GPU) implemented at the 22nm technology node, fetching the two operands from memory consumes greater energy than moving the operands from the edge of the chip to its center, which in turn consumes another higher energy than the addition [4]. This orders of magnitude energy gap between data movement and computation is expected to widen in the future with technology scaling [5]. Thus, reducing data movement energy is critical to future computer systems.