1. INTRODUCTION
Machine learning with visual data has been described as the means to translate "pixels to concepts" [1], e.g., classify active pixel sensor (APS) video according to its illustrated human activity ("tennis match", "cooking", "people marching",…). However, APS-based video representations are known to be cumbersome for machine learning systems, due to [2]: limited frame rate, too much redundancy between successive frames, calibration problems under irregular camera motion, blurriness due to shutter adjustment under varying illumination, and very high power requirements. Inspired by these observations, hardware designs of neuromorphic sensors, a.k.a., silicon retinas [3], [4], have been proposed recently. Silicon retinas mimic the photoreceptor-bipolar-ganglion cell information flow of biological retinas by producing coordinates and timestamps of on/off spikes in an asynchronous manner, i.e., when the logarithm of the intensity value of a CMOS sensor grid position changes beyond a threshold due to scene luminance changes. Unlike conventional frame-based cameras that tend to blur the image due to slow shutter speed, silicon retinas capture the illumination changes caused by fast object motion and are inherently differential in nature. In practice, this means that neuromorphic vision sensing (NVS) data from hardware like the iniLabs DAVIS and the Pixium Vision ATIS cameras [4], [5], [6], [7] can be rendered to representations comprising up to 2000 frames-per-second (fps), whilst operating with robustness to changes in lighting and at low power, on the order of 10mW. Conversely, a typical APS video camera only captures (up to) 60 fps at more than 20 times the active power consumption and with shutter-induced blurriness artifacts when rapid illumination changes take place. The combination of these advantages makes NVS-based sensing particularly appealing within Internet-of-Things (IoT) and robotics contexts [8], where NVS data would be gathered at very low power and streamed to cloud computing servers for back-end analysis with deep convolutional neural networks (CNNs).