I. Introduction
In the past decade, deep learning has been successfully applied in diverse application domains including computer vision, image classification, speech recognition, natural language processing, etc. The success of deep learning is attributed to its high representational ability of input data, by using various layers of artificial neurons [1]. GPUs have played a key role in the success of deep learning by significantly reducing the training time [2]. In order to improve the efficiency in developing new deep neural networks, many open-source deep learning toolkits have been recently developed, including Caffe from UC Berkeley [3], CNTK from Microsoft [4], TensorFlow (TF) from Google [5], Torch [6], and many other tools like Theano [7], MXNet [8], etc. All these tools support multi-core CPUs and manycore GPUs for high-performance. One of the main tasks of deep learning is to learn a huge number of weights, which can be implemented by vector or matrix operations. TensorFlow uses Eigen [9] as accelerated matrix operation library, while Caffe, CNTK and Torch employ OpenBLAS [10] or cuBLAS [11] to speed up matrix related calculations.