I. Introduction
In recent years, deep learning methods, and in particular convolutional neural networks (CNNs), have achieved considerable success in a range of computer vision applications including object recognition [25], object detection [10], semantic segmentation [37], action recognition [46], and face recognition [42]. The recent success of CNNs stems from the following facts: (i) big annotated training datasets are currently available for a variety of recognition problems to learn rich models with millions of free parameters; (ii) massively parallel GPU implementations greatly improve the training efficiency of CNNs; and (iii) new effective CNN architectures are being proposed, such as the VGG-16/19 networks [47], inception networks [55], and deep residual networks [13].