I. Introduction
Over the past decade, Convolutional Neural Networks (CNNs) have become an essential workhorse for computer vision tasks. Carrying forward its success, CNNs have been extended from the image processing field to spatio-temporal reasoning tasks. Recently, many successful models have been developed in different application domains, such as visual tracking [1], [2], acoustic perception [3], [4], biomedical information extraction [5], [6], etc. As shown in Fig. 1a, these designs typically include the time axis as an additional dimension of the input feature map which then allows to leverage conventional SotA CNN structures with a reshaped spatio-temporal input. However, adding the required time dimension to these CNNs inflates the CNN models with massive but unnecessary hardware overhead that hinders real-time edge processing.