I. Introduction
Recent years have seen increasing adoption of Convolutional Neural Networks (CNN) in AI applications due to their superior accuracy on many computer vision tasks. Developing deeper and wider CNNs is an effective approach to improving task accuracy. However, the development is hindered by the memory constraints of GPUs. Deeper and wider CNN models will consume more memory during the training. For example, ResNet152 [1] consumes around 18GB of memory for a batch size of only 32 while the mainstream type of GPUs in cloud platforms P100 has only 16GB memory. As GPU memory sizes grow at a slower rate than memory requirements of large CNNs, there is a strong need to design memory management techniques to support the training of large CNN models on a single GPU.