Human Segmentation Based on Compressed Deep Convolutional Neural Network

Most semantic segmentation models based on deep convolutional neural network (CNN) typically require a large number of weight parameters, high hardware resources for storage and computation. Moreover, redesigning a compact network suffers from some training problems, such as under-fitting. A human segmentation algorithm is proposed based on compressed deep CNN to optimize the convolution layers and filters. PSPNet-50 is fine-tuned on the human segmentation dataset to obtain the human segmentation model with higher accuracy. Then the convolutional-layer level pruning and corresponding structure optimization are performed so that the parameters of the model are substantially reduced. Finally, the two-stage global filter-level pruning strategy is used. Compared with the method of layer by layer pruning and retraining, our strategy not only reduces parameters of the model and saves the time of retraining, but also keeps the high IoU (Intersection over Union) accuracy. In addition, by adding auxiliary losses in the network during training CNN, the supervised training of the network is improved, and IoU is further increased. Compared to the model before compression, the sufficient experiments show that the parameter number, computation cost, memory consumption, and parameter storage are decreased by 1/7.5, 5.6/6.6, 0.7/1, 6.5/7.5, respectively, while the segmentation speed is accelerated by 2.4 times, and IoU on test set reaches 93.2%.


I. INTRODUCTION
Human segmentation refers to segmenting human regions from the background. As the foundation of analysis and understanding of human behavior, segmentation results are important to the subsequent works, such as 3D reconstruction, recognition, detection, and tracking.
The traditional methods of image segmentation based on digital image processing and mathematical morphology are not robust enough to resist the noise, and the accurate The associate editor coordinating the review of this manuscript and approving it for publication was Yongjie Li. segmentation typically requires a lot of human-computer interaction. With the rapid development of deep learning, especially convolutional neural network, image segmentation methods based on deep learning outperform traditional methods in many aspects. The models trained with these segmentation algorithms can be used for portrait images; however, it is difficult to obtain high IoU because they are not specifically trained for human body images. Moreover, deep learning network models need to calculate convolution on the feature maps with the large size. The number of weight parameters of the segmentation model is substantially large, accordingly the segmentation speed is slow, and requirements VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of computation and storage are high. Human photograph has become the main application of mobile devices such as mobile phones, whereas human segmentation requires high accuracy and real-time performance.
In this article, a human segmentation algorithm is proposed based on a compressed deep CNN, which reduces the resources for computation and storage while yields high IoU. Firstly, PSPNet-50 [1] is specifically fine-tuned for human segmentation, which is not only for preserving the compactness of the original network structure, but also for avoiding the underfitting. Then, the model is pruned and simplified at convolutional-layer-level and filter-level. The convolutional layers and filters that have little impact on accuracy are pruned, so the amount of parameters and calculation is reduced while preserving high IoU. The sufficient experiments on the human segmentation dataset demonstrate that our algorithm outperforms the commonly used models in terms of model size, segmentation speed and accuracy.
The remainder of this article is organized as follows: Section 2 presents a review of related works. Section 3 describes fine-tune of PSPNet-50 for human segmentation. Section 4 introduces the compression and acceleration of our method. Section 5 describes the compressed DCNN training for human segmentation. Experiments are displayed in Section 6, followed by a conclusion in Section 7.

II. RELATED WORKS
There are many traditional human segmentation methods, which can be briefly divided into brush-based, graffiti-based, and border-based methods. Brush-based methods are typically based on graph cut [2], [3], geodesic distance [4], random walk [5], and fully connected conditional random field [6]. They require users to specify some foreground and background as the boundary conditions by using a brush, and then optimize the solution. In contrast to the brush-based method, the graffiti-based method [7] only requires users to simply paint on the object. The boundarybased method [8], [9] calculates the boundary of the object by tracking the rough boundary input by users. Although the applications of these methods in image processing software are very popular, the tedious and complex interaction limits their application in automatic image processing.
With the rapid development of deep learning, many semantic segmentation methods based on deep CNN have emerged, such as FCN [10], DeepLab [11], PSPNet [1], MDCCNets [12], etc., which can automatically learn the deep features of object representation, and yield better results than the traditional segmentation methods. These deep learning models can be used for human segmentation; however, the accuracies are not high since they are not specifically trained for human images. Wang et al. [13] proposed a human segmentation method based on FCN structure combining with deconvolution.
Based on FCN, Shen et al. [14] added the position and shape information of portraits into the network training to improve human segmentation. However, deep network models typical require more hardware for the substantial computational complexity and storage. For example, the widely used VGG-16 [15] requires 138 million parameters to be trained. It requires more than 15G (10 9 ) of floating operations (FLOPs) to classify a 224 × 224 image and storage space of 500 MB. Thus it is difficult to deploy on the devices with limited resources, such as mobile phones.
Therefore, many methods were proposed for the compression and acceleration of classification network, which can be divided into quantization-based methods, pruning-based methods, and the methods for designing compact networks. Quantization-based methods are to quantify each weight of the network into an element in a finite set. Zhou et al. [16] proposed an incremental network quantization, which gradually converted a pre-trained full-precision floating-point neural network into a lossless low-bit binary model. Gong et al. [17] directly performed k-means clustering on the weights of the network. Each weight was represented by a cluster center, which can realize a high compression ratio. These quantified methods can reduce the storage consumption, but the speed of the neural networks has not been greatly accelerated.
Pruning-based methods [18]- [22] can reduce the storage and computation by removing some connections in the neural network. Through linear discriminant analysis, Tian et al. [23] found that a lot of filters are highly uncorrelated for facial gender classification in the last convolution layer of VGG-16 model [15], so this can be used to remove those filters with high within-class variance and low betweenclass variance. Some filter-level pruning strategies have been developed by evaluating neuron importance. Li et al. [24] measured the importance of each filter by calculating its absolute weight sum. The whole filters with the small sum values are removed from the network together with their connecting feature maps. This approach does not result in sparse connectivity patterns, and the computation costs are reduced significantly. ThiNet [25] pruned the filters of the network based on the statistical information of the next layer, and achieved 16.63 times of the parameter compression on VGG-16. To evaluate the importance of each filter in the network, Liu et al. [26] employed the scaling factor parameter of the Batch Normalization (BN) layer [27]. The filters with the scaling factors close to zero were considered unimportant and could be deleted. Chen et al. [28] combined a weight-based pruning with an adaptive architecture squeezing to obtain a high compression ratio in CNN, and utilized pruning to find an appropriate squeezing ratio. In [29], a lossless lightweight CNN design strategy is explored for the SAR target recognition by using the structured pruning and the knowledge distillation. Pruning methods can reduce the amount of parameters while reducing the computational complexity, but it is possible that the accuracy is degraded due to the retraining of each network layer iteratively.
Designing compact network architecture can achieve low storage cost, low computational complexity and high network performance. Since the classical deep networks, including Alex-Net [30] and VGG16, require a large amount of computation resources, some researchers proposed more efficient neural network architectures, such as residual network (ResNet) [31], which reduce model parameters but maintain or even improve the performance of network models. SqueezeNet [32] further simplified the weight parameters by replacing the 3 × 3 filter with the 1 × 1 filter and reducing the number of channels of the 3×3 filter. Unfortunately, redesign a compact network still encounters with the underfitting problem.

III. HUMAN SEGMENTATION MODEL BASED ON PSPNet-50
The first 50 layers of PSPNet-50 [1] use a 50-layer residual network [32] as the feature extractor to generate the feature maps in the various levels by pyramid pooling with various pooling kernel sizes. Since fusion can improve the performance [33]- [38], these features are fused under four different pyramid scales as a pyramid module, and then the different levels of features obtained by the pooling are fused. An upsampling and convolutions are performed to form the final feature representation, which achieve end-to-end output of both local and global context information. In this article, we first fine-tune PSPnet-50 neural network for human segmentation.

A. TRAINING AND TESTING DATASETS
The segmentation dataset is derived from Baidu dataset [36], which contains 5,387 human images of various poses and their segmentation labels in various scenes. According to our observation, a few segmentation labels in the dataset are incorrect or inaccurate, so we remove these segmentation labels and the corresponding images to avoid degrading the training, and finally 4,512 human images and their segmentation labels are selected. 4,112 images are training samples, and the remaining 400 images are test samples. Each of image size is 473 × 473.

B. MODEL TRAINING
We assign the number of output feature maps of the last convolutional layer in PSPNet-50 [1] to 2, which correspond to the background and foreground (human), respectively, and then use the training set to fine-tune the human segmentation model.
When the network performs forward propagation, the average value of the prediction error sum corresponding to all pixels is calculated as the training error, and the weight parameters are updated according to the minimizing training error/loss. The loss in training: In Eq. (1) and (2), L denotes the model training loss. p(x i = k) is the probability that the i-th pixel x i belongs to the k-th category. n is the number of pixels in the current training image. z k is the eigenvalue of the k-th category. The bigger the value of z k is, the higher the probability that the pixel x belongs to the k-th category. z j the j-th element of the eigenvector z, which is turned into probabilities by softmax function Eq. (2). K is the total number of categories. In this article, there are only two categories, namely human and background, so K = 2.
Stochastic gradient descent (SGD) [37] is used to update the weight parameters by a linear combination of the negative gradient and the last weight update value: In Eq. (3) and (4), W t is the weight matrix in the t-th iteration calculation. V t is the weight update value matrix in the t-th iteration calculation. α is the basic learning rate of the negative gradient. µ is the weight of V t , which is used to weight the effect of the previous gradient direction on the current gradient descent direction. Here µ = 0.9.
During iterative calculating, in order to accelerate the convergence speed of model loss, the learning rate is adjusted as: In Eq. (5), LR denotes the actual learning rate. base_lr is the basic learning rate, max_iter is the maximum number of iteration. iter is the current number of iterations. power is the learning rate parameter. Here base_lr = 0.0001, and power = 0.9. In addition, to realize data enhancement, training images are horizontally mirrored and flipped during training to improve the generalization ability.

IV. COMPRESSION AND ACCELERATION
The strategies for compression and acceleration are as follows. The hierarchical structure pruning is applied on the model that is initially trained in Section 3, and then this model is retrained on the human dataset. Finally, filter-level pruning is used on the retrained model.

A. HIERARCHICAL STRUCTURE PRUNING
As shown in Figure 1(a), there are 4 parallel network paths in the initial pyramid pooling module obtained by finetuning PSPNet-50, which contains a complicated structure and requires a large amount of computation. We prune the convolutional layer structure by removing 4 parallel convolutional layers, and then merge the parallel output features of the pyramid pooling module by using Eltwise instead of Concat (as shown in Figure 1(b)). In Figure 1(a), the first upsample interpolates the feature maps to the 1/4 size of the original image, and the second one interpolates the feature maps to the size of the original image. In Figure 1(b), the first, the second, and the third upsamples interpolate the feature maps to the 1/4 size, the 1/2 size, and the same size of the original image respectively. It means that the concatenating operation of the 4 parallel output features is replaced with the merging operation, i.e. the sum of the eigenvalues. The reason is that the numbers of the filter parameters of these 4 parallel convolutional layers are the largest among all the convolutional layers except the penultimate convolutional layer. Figure 2 shows that the number of parameters of each convolutional layer in the PSPNet-50 network. The number of parameters for the 4 parallel convolution layers reached 4.2M (million). The removal of the 4 parallel convolutional layers can also ensure that the numbers of input channels of multiple bottoms in the Eltwise layer are equal so that the sum operation can be performed. This process not only reduces the number of channels in the output feature maps of the pyramid pool module by half and speeds up the computation, but also substantially decreases the number of weight parameters. In addition, because the number of weight parameters of the penultimate convolution layer is the largest among all the convolution layers in the model, the filter size of the penultimate convolution layer is changed from 3 × 3 to 1 × 1 so that the number of parameters in this layer is decreased from 18.9M to 2.1M. Table 1 shows comparisons of the segmentation speed, parameter storage, and IoU before and after the pruning of the hierarchical structure. IoU calculates the intersection ratio of the target human region and the human region predicted by the segmentation network. Results show that the segmentation speed is accelerated after the hierarchical structure pruning, while the parameter storage is decreased by 47% and IoU is still high.

B. FILTER-LEVEL PRUNING
Generally, the smaller the absolute value of the weight, the lower the importance, and the smaller the impact on the convolution result [38], [39]. To compress the network with high efficiency, reduce the number of weight parameters, and speed up the computation, we do not prune a single weight [38] because this strategy typically results in the sparse connection. However, graphics processor (GPU) is not good at dealing with sparse matrix. In other words, the number of the valid parameters is reduced, while the network speed is not accelerated, and the storage cost is not reduced. Therefore, the filter pruning approach in [24] is applied for further compression. The relative importance of a filter in each layer is measured by calculating the sum of its absolute weights. Those unimportant filters are removed directly. The filter level pruning belongs to structural pruning, which avoids sparse matrix, substantially reduces number of parameters, and accelerates speed. Figure 3 shows the effect of pruning/removing a filter. Suppose that the number of input channels of the i-th convolution layer is n i , and the size of the input feature map width is h i × w i . We use n i+1 3D filters F i,j ∈ R n i ×k i ×k i (k i × k i is the 2D filter size) to perform convolution on the input n i channel feature maps x i ∈ R n i ×h i ×w i to get n i+1 channel feature maps x i+1 ∈ R n i+1 ×h i+1 ×w i+1 . The number of multiplications performed by this convolution is n i+1 × n i × k 2 i × h i+1 × w i+1 . Therefore, removing a 3D filter in the convolution layer reduces a feature map, reduces n i × k i × k i weight parameters that need to be trained and stored, and reduces n i ×k 2 i ×h i+1 ×w i+1 operations refers to the number of multiplication operations between numbers. Moreover, the number of channels in input feature maps of the next convolution layer is also reduced by one, and n i+2 × k i+1 × k i+1 weight parameters that need to be trained and stored are reduced, and n i+2 × k 2 i+1 × h i+2 × w i+2 operations are canceled.
Consequently, the total reduced number of weight parameter is n i × k i × k i + n i+2 × k i+1 × k i+1 , which is larger than 1/n i of the total parameter number of the convolutional layers. The total number of reduced operations is , which is larger than 1/n i+1 of the total number of calculations for this convolutional layer. Therefore, removing the filters in convolutional layers not only largely decreases the number of weight parameters and the storage consumption, but also reduces computation costs and increases computation speed.
When pruning the filters, it is necessary to measure the importance of each filter in a convolutional layer, and remove the less important filters in each convolutional layer. For each filter of a convolution layer, the L1 norm of the filter weights is computed, which is the sum of the absolute values of the filter weights. The L1 norms of all filters of a convolution layer are sorted in descending order. Generally, if the L1 norm of the filter is small, its weight parameters are small, and accordingly this filter relatively weakly impacts on the final performance in practice. Thus the filters with small L1 norms can be removed. The process of pruning the m-th filter in the i-th convolution layer is as follows: Step 1. Calculate L1 norm s j = n i l=1 |k l | for each filter F i,j , and sort the filters in descending order according to s j .
Step 2. The m filters with small L1 norms and their corresponding feature maps are pruned, and the kernels corresponding to the pruned feature map in the next convolutional layer are also removed.
Step 3. Build a new kernel matrix for the i-th and i + 1th layers, and copy the remaining kernel weights to the new model.
Convolutional layers of our model are followed by batch normalized (BN) layers. BN layers regularize the input features, and the features in each channel correspond to a trainable parameter. When a filter of the convolution layer is deleted, its corresponding output feature channel disappears. Therefore, when the convolution layers are pruned at filter levels, the parameters of the corresponding channels of BN layers should also be removed.
Our model is based on the residual network (ResNet) [31], which is more complicated than FCN, and require to consider the pruning of the residual module. As shown in Figure 4, since the sum operation is performed on the output feature maps of the convolutional layer after pruning in the residual block and the output feature maps of the corresponding convolutional layer, the pruning of the convolutional layer on one side must be taken as the benchmark to conduct the corresponding pruning on the other side. Based on the pruning of the convolutional layer on one side, the corresponding pruning is performed on the other side. Since the feature maps output from the shortcut layer of ResNet are the master features learned by the network, which are more important than the added residual maps, the feature maps to be pruned should be determined by the pruning results of the shortcut layer, and the residual block is pruned with the same filter index as selected by the pruning of the shortcut layer.

C. TWO-STAGE FILTER PRUNING STRATEGY
We do not use layer-by-layer pruning and fine-tuning retraining [25], [38] to compress the model, because it is too timeconsuming and labor-intensive to train the network of deep architecture layer by layer. Thus we prune all the weight parameters at one time, and then train the model.
To quickly reduce the complexity of the network structure without degrading the accuracy, a two-stage pruning method is performed. We do not directly compress the model to a specified size because the model would jump too much and be underfitted. As shown in Table 2, if the model is directly compressed by 1/2, the accuracy decreases to 87.7%, and if the remaining ratio is assigned a smaller value, the accuracy decreases more sharply.
To balance the remaining ratio and the accuracy, the number of filters in all convolutional layers is compressed by 1/4 in the first pruning stage, and then the model is retrained on the dataset. In the second stage, the retained model is further compressed by a half, and then retrained to ensure the stable compression. It can be seen from Table 3 that the two-stage compression is a simple but highly effective method, and it yields higher IoU than the compression model obtained by direct compression.

V. TRAINING OF COMPRESSED HUMAN SEGMENTATION NETWORK
The training of the compressed human segmentation network is similar to the training before compression. Moreover, we add two auxiliary losses to the loss function as the extra supervision during training to improve the accuracy.

A. AUXILIARY LOSSES
Many researches show that neural network structure with deeper layers can achieve better performance. However, as the network deepens, it also encounters with optimization difficulties [40], [41]. ResNet effectively releases this problem by residual learning. Therefore, our network training is further optimized based on ResNet framework. An initial segmentation is generated in training process, and the supervised learning is conducted by an auxiliary loss, the output loss of the final network is used to supervise the network training. Therefore, the learning optimization of deep network is decomposed into two parts that are easier to solve. Auxiliary losses1 and losses2 in Figure 5 are added to improve the training. Figure 5 shows our supervised learning approach of the deep network. In addition to the Softmax cross entropy loss of the output of the last network layer, the feature maps output from the 4th and 5th parts in resnet-50 which is used in ours network are respectively passed through a convolution layer and an upsampling layer to obtain the corresponding classification output (score maps), which are used to calculate the auxiliary Softmax cross entropy loss. Each of these three losses acts on the resnet-50 network. The auxiliary losses help optimize the learning process, and the three losses are weighted respectively to obtain a more reasonable total loss. The loss function of the whole network is: In Eq. (6), L 1 (x), L 2 (x), and L 3 (x) are the master loss, auxiliary losses1, and auxiliary losses2. Each loss is calculated by Eq. (1) using the network's output score map. The normalized score of pixels belonging to the correct classification is found by the score graph, and the mean square error with 1 is the loss of the pixel. α, β, and γ denote the weights of the three loss. x denote output score map from each part of the network. We take four sets of values of α, β and γ for experiment, respectively α = 0.  Table 4 shows a comparison of the accuracy with and without the auxiliary losses. It is noted that the accuracy is improved by 0.6% after the auxiliary losses are added.  The evaluation is performed on Baidu dataset [36] and Flickr dataset [14]. There are 1800 portrait images in Flickr dataset, some of which have large variations in color, background, clothing, accessories, etc.

VI. EXPERIMENTAL RESULTS AND ANALYSIS
Our method is compared with some state-of-the-art methods in terms of model size, speed, and IoU accuracy. The evaluation includes three parts. Since our compression model is built form PSPNet-50, we compare fine-tuned PSPNet-50 before and after compression at Baidu dataset. We also compare our model with some segmentation approaches based on deep learning on Baidu dataset, including FCN-8S and DeepLab-v2. In addition, we compare our model with some compression models both on Baidu dataset and Flickr dataset.

B. COMPARISON BEFORE AND AFTER COMPRESSION
The comparisons of the human segmentation network based on fine-tuned PSPNet-50 before and after compression are shown in Table 5. The accuracy is slightly decreased from 94.8% to 93.2% due to three reasons. Firstly, in order to improve efficiency and reduce the training complexity, we directly retrain all the convolutional layers of the network after pruning rather than iterative training after pruning each convolutional layer. Training after one-time pruning only needs to be trained once, while layer-by-layer pruning requires training once again for each pruned convolutional layer. Although layer-by-layer pruning can attain better IoU accuracy, it has to consume much more time. Secondly, L1 norm is a little bit rough to measure of the importance of the filters. Thirdly, the importance of the filters in various convolutional layers are not the equal. Although it is convenient to simply compress them by the same proportion, it is possible that some relatively important filters are removed.

C. COMPARISON WITH UNCOMPRESSED SEGMENTATION NETWORKS
Our model is compared with some popular segmentation networks based on deep learning, which are also trained on human image dataset, for effective evaluation. Results are shown in Table 6. FCN-8S-VOC and PSPNet-50-ADE are  the FCN [10] and PSPNet-50 [1] trained by PASCAL VOC dataset [42] and ADE20K dataset [43], respectively. Since FCN-8S-VOC and PSPNet-50-ADE are not trained on human images, their accuracies are not good. FCN-8S is a powerful architecture of the variations of FCN [10]. We train FCN-8S on the human image dataset to build FCN-8S-Person. We also use the human image dataset to train DeepLab-v2_ VGG-16 [11] and DeepLab-v2_resnet-101 [11], which are the improved segmentation models based on VGG-16 [15] and resnet-101 [31], respectively. FCN-8S-VOC and PSPNet-50-ADE yield lowest and the second lowest accuracies because they are not trained on the human image dataset. Our model is the best compression architecture in terms of computational cost, parameter number, speed, memory consumption, and parameter storage. Although DeepLab-v2_resnet-101 yields 0.6% higher accuracy than our model, our model outperforms DeepLab-v2_resnet-101 in terms of the other 5 indices. For example, DeepLab-v2_resnet-101 requires more than 11GB memory for training with 473× 473 images, which is run out of memory on our computer. In fact, we have to change the image size to 352 × 352, it still yields the lowest segmentation speed (4.9fps/s), which is difficult to run in real-time mode. Figure 6 shows example the segmentation results on Baidu dataset. The models that are trained on the human image dataset perform much better than those that are not. In many cases, although FCN-8S-person is trained on the human image dataset, its edges are not accurate (shown in the rectangle boxes) because FCN model is not good at detail extraction. The segmentation performances of PSPNet-50 and DeepLab-v2 are improved with the fusion of various scale features. Despite our model is a compression model of PSPNet-50, it remains the pyramid pooling structure of PSPNet-50 to obtain global context information so that it has almost the same performance of the original PSPNet-50.

D. COMPARISON WITH COMPRESSION NETWORKS
Our model is compared with several state-of-the-art compression networks, including NISP [20], VGGNet-pruning [26], PSPNet-50-pruning [38], MBNet_DPLab [44] both on Baidu dataset and Flickr dataset. NISP prunes CNN by removing neurons with least importance, and then is fine-tuned to recover CNN predictive power. VGGNet-pruning is built from a VGGNet model, which is fine-tuned in [26] and trained on human image dataset. MBNet_DPLab is a lightweight semantic segmentation model that combines MobileNet-v2 and DeepLab-v3 [45] frameworks to simplify the design of mobile terminals. PSPNet-50-pruning is built from a PSPNet-50 model, which is fine-tuned in [38] and trained on human image dataset.
As shown in Table 7, MBNet_DPLab requires the lowest memory consumption and parameter storage, and has the highest speed, but the lowest accuracy. The reason is that DeepLab-v3 only compresses the weight parameters of the full connection layer. However, for the segmentation network without the full connection layer, DeepLab-v3 only prunes a small part of the weights of the convolutional layer to ensure the accuracy because the weights of the convolutional layer are more sensitive than those of the full connection layer. In our method, the convolution kernels of the filter are pruned to compress and accelerate. Filter level pruning belongs to structural pruning, which avoids the sparse connection caused by pruning a single weight, accordingly the graphics processor (GPU) does not need to process the calculation of sparse matrix, so our method can not only greatly reduce parameters, but also accelerate the running speed. Since our method only uses the L1 norm for filter pruning, it is not sufficient to fully   consider the importance of the filter. Although our model is not the best in all aspects, its performances are comprehensively optimized. Figure 7 shows an example of the segmentation results on Flickr dataset. The failure details are displayed in rectangle boxes. Table 8 further reports a quantitative evaluation. In this experiment, all models are not retrained with portrait images, but tested directly with these images. Because of the strong generalization ability, our method still achieves competitive performance as shown on Baidu dataset. It can be seen from  Table 7 and Table 8 that all models attain better performance on Flickr dataset. It can explain that face and the upper part of the human body occupy the majority features in the portrait image, so it easier to segment face and the upper part than whole body.

VII. CONCLUSION
This article proposes a method for human segmentation based on compressed DCNNs. First, PSPNet-50 is fine-tuned on the human image dataset to obtain the initial segmentation model with a high accuracy, and then the convolutional-layer level pruning and structural optimization are performed on the initial model, which substantially reduces the number of parameter. Finally, a two-stage filter level pruning strategy is employed. Compared with the methods for pruning and retraining layer by layer, our method does not only decrease the retraining time and the parameters, but also guarantees a high accuracy. In addition, during the training of the network, by adding two auxiliary losses in the network, the supervised training of the network is better realized, and IoU is improved.
The segmentation model in this article is more conducive to the applications on mobile devices. However, the algorithm in this article can be further improved. IoU is degraded due to the compression. The further works will focus on the following two points: (1) Besides L1 norm, other indices will be used to judge the importance of the filter, (2) The number of the reserved filters of the convolutional layers can be different according to the importances of the convolutional layers.  He performed his postdoctoral research at Yonsei University, Seoul, South Korea, and the Nanjing University of Aeronautics and Astronautics, Nanjing, China, from 2012 to 2015. He was a Visiting Scholar with West Virginia University, USA, from 2015 to 2016. He is currently an Associate Professor with Nanchang Hangkong University, and also a Visiting Scholar with Yonsei University. He has published more than 70 international journal and conference papers. He has been granted several scholarships and funding projects for his academic research. He is also a reviewer for several international journals and conferences. His research interests include computer vision, biometric template protection, and biometric recognition.
Dr. Leng is a member of the Association for Computing Machinery (ACM), the China Society of Image and Graphics (CSIG), and the China Computer Federation (CCF).