I. Introduction
Nowadays, intelligent services based on deep learning (DL) are becoming more popular with the development of artificial intelligence (AI) and communication technology. As one of the main deep neural networks (DNNs), convolutional neural networks (CNNs) are widely used in vision applications such as object detection, object classification, and visual reality [1], [2], [3]. Considering the relatively less computing resources, little storage space and limited power of end devices and edge servers [4], it is difficult to deploy and infer a whole CNN model with large parameter quantity and computation amount completely with high efficiency on an edge device [5], [6]. One traditional solution to the problem is to offload the inference task of the CNN to the cloud server. With so strong computing power, the cloud server is able to infer a large CNN model with low latency and improve quality of intelligent services (QoSs).