I. Introduction
Deep convolutional neural networks (CNNs) have achieved impressive results on a wide range of computer vision tasks from object classification, detection, segmentation, to image and video editing and interpolation. To achieve state-of-the-art results on these tasks, CNNs have become larger and deeper with increased complexity. On the one hand, the excellent accuracies coming from computationally heavy CNNs make them demanding to be used in various applications. On the other hand, it calls for computationally efficient implementations. There are several research directions to make CNNs run in a computationally efficient manner such as efficient implementations of CNNs on different hardware [1]–[6], novel convolutional neural network architectures that exploit memory and computation efficiency [7]–[11], knowledge distillation to decrease the number of parameters [12]–[15], pruning techniques to decrease the network size [16]–[18], and weight or activation quantization of CNNs from 32-bit floating point into lower bit-width representations [19]–[22].