I. Introduction
Deep learning is an extraordinarily popular machine-learning technique that is revolutionizing many aspects of our society. Machine learning addresses a wide variety of decision-making tasks such as image classification [1], audio recognition [2], autonomous vehicle control [3], and cancer detection [4]. Matrix multiplication is an essential but time-consuming operation in deep learning computations. It is the most time-intensive step in both feedforward and backpropagation stages of deep neural networks (DNNs) during the training and inference and dominates the computation time and energy for many workloads [1]–[3], [5], [6]. Deep learning uses models that are trained using large sets of data and neural networks with many layers. Since DNNs have high computational complexity, recent years have seen many efforts to go beyond general-purpose processors and toward dedicated accelerators that provide superior processing throughput and improved energy efficiency.