1 Introduction
Machine learning (ML) algorithms [1 –6] have become ubiquitous in many fields of science and technology due to their ability to learn from and improve with experience with minimal human intervention. These algorithms train by updating their model parameters in an iterative manner to improve the overall prediction accuracy. However, training ML algorithms is a computationally intensive process, which requires large amounts of training data [7] –[9]. Accessing training data in current processor-centric systems (e.g., CPU, GPU) requires costly data movement between memory and processors, which results in high energy consumption and a large percentage of the total execution cycles. This data movement can become the bottleneck of the training process, if there is not enough computation and locality to amortize its cost [10 –15].