I. Introduction
Despite the recent success of Transformer-based neural networks, convolutional-based neural networks (CNNs) remain widely used for image classification [1], object detection [2], and natural language processing [3]. At the heart of CNNs is the convolution operation (Conv) [4], which often represents more than 90% of the execution time in classical CNN models like VGG [5]. As such, considerable interest has been put in optimizing convolution implementations to accelerate CNNs [6], [7].