A Part-Based Deep Neural Network Cascade Model for Human Parsing

Human parsing is important for image-based human-centric and clothing analyses. With the development of deep neural networks, some deep human parsing methods were recently proposed, which substantially improve the parsing accuracy. However, some localized small regions (such as sunglasses) are not parsed well in these methods. In this paper, we propose a Part-based Human Parsing Cascade (PHPC) to segment human images, imitating the observational mechanism of how people, when first looking at a human image, quickly scan the entire photograph to first locate the face and then the body parts to see what clothing the person is wearing. The observational mechanism of human vision is used to establish a cascade relationship in designing our network, in which a head-parsing sub-network and a body-parsing sub-network are integrated to the cascade of human parsing networks. The head- and body-parsing sub-networks focus on the head and body classes, respectively, and add attention to the head and body in the final neural networks. Comprehensive evaluations on the ATR dataset have demonstrated the effectiveness of our method.


I. INTRODUCTION
Due to its importance to both human-centric and clothing analyses, human parsing has become an attractive subject for research over the past few years. Human parsing involves segmenting the person in a fashion image into regions according to their different body parts (e.g. face, left-arm, and right-leg) and the clothing (e.g. upper-clothing, dress, and trousers) that the person is wearing. Fig. 1 shows an example of human parsing. After human parsing, each pixel of the input image is given a label.
For human parsing research, researchers have mainly adopted one of two approaches: (1) a bottom-up approach [1], [2], where input images are first analysed using superpixel technique, and conditional random fields (CRFs) are then used to group and refine the initial superpixel results into larger segments and labels; and (2) a top-down approach [2], [3], where input images are first segmented into regions that are further classified into given labels.
The associate editor coordinating the review of this manuscript and approving it for publication was Jeon Gwanggil . Following the bottom-up approach, Yamaguchi et al. [1], [2] proposed to segment an image into superpixels and then predicted the clothing labels for each superpixel using a CRF model. This method performed quite well on the constrained parsing problem, where test images are parsed given userprovided tags that indicate the depicted clothing items. This VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ approach was less effective at unconstrained clothing parsing, however, where test images are parsed in the absence of any textual information.
Following the top-down approach, another group of researchers first aligned human parts by using the parselet representation as building blocks for a parsing model [3]. Parselets are groups of parsable segments that can generally be obtained from segmentation algorithms using low-level features. Dong et al. [4] built a deformable mixture-parsing model (DMPM) for human parsing to simultaneously handle the deformation and multimodalities of parselets. A DMPM seamlessly formulates the human parsing and pose estimation problem within a unified framework via a tailored And-Or graph, using parselets and a mixture of joint-group templates as the semantic components. Their work is limited by the suboptimal performance of many hand-designed intermediate components, such as handcrafted feature extraction and pose estimation [5].
Inspired by the remarkable improvement in accuracy introduced by the use of deep networks, deep human parsing methods have recently been proposed. Liang et al. [6] proposed a contextualised convolutional network, a fully convolutional network, to address the human parsing task. They integrated the cross-layer and global image-level contexts within the superpixel and cross superpixel neighbourhood contexts into a unified network. To increase the network capability, they incorporated Long Short-Term Memory (LSTM) layers into the convotutional neural networks (CNNs) in their extension work [7], which allowed memorisation of previous contextual interactions from local neighbouring positions and the whole image in previous LSTM layers. However, these parsing methods did not take into consideration of local and regional information, which are fundamental important because some items are so small that special attention must be drawn onto specific part regions to identify and describe such items.
In this paper, we propose a new human parsing network cascade that is inspired by the observational mechanism of how people, when first looking at a human photo, quickly scan the entire photograph to first locate the face and then the body parts to see what clothing the person is wearing. We propose in this paper a Part-based Human Parsing Cascade (PHPC) of networks. To imitate human observation, we integrate a head-parsing sub-network and a body-parsing sub-network into a cascade of human parsing networks. The head-and body-parsing sub-networks focus on the head classes and body classes, respectively, and add the attention to the head and body in the final neural networks. We choose FCN-8s network [9] as our baseline network, as it is efficient and has shown great improvement on semantic segmentation. Semantic segment and human parsing are closely related, which we will discuss in next section. To evaluate the effectiveness of our PHPC method, we conducted several experiments on the ATR dataset [8], which we also train our PHPC model. For comparison purposes, we also trained a FCN-8s model using the method proposed by Long et al. [9] and a CRFasRNN model using the method proposed by Zheng et al. [10]. We will also discuss the effectiveness of super-pixel and CRF refinement in the discussion section. The main contribution of our work is that we propose a novel PHPC model that mimics human vision.

II. RELATED WORK
We review the related work of this study, including human parsing and also some recent deep learning based development on semantic segmentation -a research area closely related to human parsing. Both semantic segmentation and human parsing attempt to assign a label to each pixel in an image.

A. SEMANTIC SEGMENTATION
There have been a wide range of approaches using deep learning to tackle the semantic image segmentation. These approaches can be categorized into two main strategies. The first strategy to extract better meaningful features by improving mechanisms, such as using super pixels, multi-scale image size and optimized filters, etc. Mostajabi et al. [12] first obtained superpixels from the image and then used a feature extraction process on each of them. Chen et al. [13] combined CNN outputs from multiple scales image such that each feature vector represents a large contextual window around each pixel. Hariharan et al. [14] combine features from the intermediate layers to enhance the feature extraction. Yu and Koltun [15] proposed dilated convolutions to support exponential expansion of the receptive field without loss of resolution or coverage. Pinheiro and Collobert [16] employed an RNN to model the spatial decencies during scene parsing. Another strategy is to incorporate CRF into CNN to refine the result. Chen et al. [17] exploited a pre-trained CNN to generate deep features for CRF learning and illustrated that CRF learning with CNN features yields astounding results. Subsequent work [10], [18] have taken the idea further by incorporating a CRF as layers within a deep network and then learning parameters of both the CRF and CNN together via back propagation. For example, Zheng et al. [10] formulated a CRF as an RNN and then plugged into the network as a part of a CNN. However, these approaches have not employed higher order potentials, which have previously been shown to significantly improve segmentation performance. Arnab et al. [18] combined object-detection based potentials and superpixels based potentials into the CRF embedded within a deep network.

B. HUMAN PARSING
Similar to semantic segmentation, human parsing is also to predict the label of each pixel in the image, but focus on human images, namely the segmentation of human body parts and clothing region from the background. Much research has devoted to human parsing in recent years. Most previous methods often rely on much complicated preprocessing, such as human pose estimation, bottom-up hypothesis and template dictionary learning. For example, Yamaguchi et al. [2] performed human pose estimation and attribute labeling sequentially and then used a retrieval-based approach to improve clothes parsing. For a query image, they found similar styles from a large database of tagged fashion images and used these examples to parse the query. Their approach combined parsing from pre-trained global clothing models, local clothing models learned on the fly from retrieved examples and transferred parse masks (paper doll items transfer) from retrieved examples. Similarly, Liu et al. [5] also based on the retrieval-based method. They retrieved the best matching clothing region of the test image from the annotated-parsed human image corpus and then used convolutional network to learn the inference and displacement coefficients. Dong et al. [4] proposed to use Parselet hypotheses to build the parsing model. Liang et al. [8] formulated the human parsing as an active template regression problem, where the template coefficients for each label mask and their corresponding locations were predicted using convolutional networks. But none of them is able to train in a fully endto-end way over raw image pixels. In the recent past, a few methods have started using deep convolutional networks to train the network from end-to-end. Liang et al. [6] based on the fully convolutional network and proposed the contextualized convolutional network, which integrated the crosslayer context, global image-level context, within-superpixel context and cross-superpixel neighborhood context into a unified network. In their extension work [7], they incorporated short-distance and long-distance spatial dependencies into the feature learning by a Local-Global Long Short-Term Memory (LG-LSTM) layers. In [11], they split the feature map into several cells and only consider the local neighboring positions. So they proposed a Graph Long Short-Term Memory (Graph-LSTM) network, which is more naturally aligned with visual patterns in the image. However, these methods have not considered effect of the scale and localization of objects on parsing efficiency. Xia et al. [23] proposed to detect objects and parts regions based on the parsing results and then zoom into proper scales to refine the parsing. Gong et al. [24] proposed a Part Grouping Network (PGN), which jointly unify semantic part segmentation and instance-level human parsing, in which these two correlated task are able to mutually refine each other. Ruan et al. [25] conducted a great deal of rigorous experiments to clarify the properties affecting the performance of human parsing, including feature resolution, global context information and edge details, These methods actually has demonstrate the effectiveness of focusing on part regions, but all attention at the object level.
In sum, the existing methods of human parsing advance substantially in segmentation accuracy, but with a known drawback that not all fine-grained parts of the human are segmented well by a single classifier [26]. Fig. 2 shows the framework of the part-based human parsing cascade (PHPC). As shown, the PHPC consists of three networks: (1) an image-level parsing network, (2) a headparsing sub-network, and (3) a body-parsing sub-network. These three networks were all built on the Fully Convolutional Neural network (FCN) and generate three feature maps. In Fig. 2, w h and h h indicate the width and height of head-part feature map, while w b and h b correspond to width and height of body-part feature map. Lastly, we combined these feature maps to refine the parsing results. First, our image-level parsing network (a) generates an initial parsing result for the whole image. Second, based on the initial result, we detect the head and body regions of the image. Third, we input the head part and body part into our (b) head-parsing sub-network and (c) body-parsing sub-network, respectively. To capture the details for small items, the head and body sub-images are scaled up and double the original image size for sub-networks (b) and (c). Finally, we combine all of the feature maps.

III. THE PROPOSED PHPC NETWORKS
In sections III-A and III-B, we introduce in detail the image-level parsing network and the head-and body-parsing sub-networks. In section III-C, we introduce the fusion of all the networks.

A. IMAGE-LEVEL PARSING NETWORK
For the image-level parsing network, we used the FCN proposed for semantic segmentation by Long et al. [9]. Compared with ordinary CNNs, FCN replaces all of the fully connected layers with convolutional layers. The FCN can therefore operate on an input of any size and produce an output of the same size, so it can be trained end-to-end, providing pixel-to-pixel labels from raw images.
In our method, we used the VGG 16-layer network [19] as our base network. There are 13 convolutional layers with Rectified Linear Units (ReLU), 5 pooling layers, and 3 fully connected layers in the VGG-16 network. To use the network for segmentation application, the 3 fully connected layers were converted to convolutional layers, resulting in feature maps that are 32 times smaller than the original size. The number of feature maps (also called channels) is the same as the number of class labels. Next, upsmapling and skip layers are added to convert and fuse feature maps from different convolutional layers to obtain the final feature maps with the same size of the original image. In our method, inputs to the image-level parsing network were 384 × 384 colour images, passing through a stack of convolutional and pooling layers.
We defined loss (L) by averaging the cross-entropy loss over all image pixels. More specifically, the loss function is defined as follows: where y i,j,k represents the likelihood for pixel (i, j) to belong to label k, y i,j,k represents the ground-truth value if the label of pixel (i, j) is k; N is the total number of labels, H and W are the height and width of input image. The likelihood y i,j,k can be computed by the softmax function: where z is the output of the last layer of the network. The softmax function ensures a diffuse network output, so that the class with a high probability score is highlighted and the classes with lower scores are suppressed. During training, the goal was to adjust the neural network weights so that the predictions matched the ground truth by updating the weights in the direction of a desceasing loss function value. We trained the parameters of the network to minimise the loss using Stochastic Gradient Descent (SGD), a common optimising algorithm in network training. SGD computes the gradient in every layer, updating the parameters layer by layer until the loss function (1) converges. The label prediction for pixel (i, j) is obtained by arg max k y i,j,k .

B. HEAD AND BODY PARSING SUB-NETWORKS
By inputting an image into the image-level parsing network, we obtained an initial parsing result, which provided a label for each pixel. The feature maps generated by the imageparsing level network provided 18 labelled parts, so the head and body regions can be easily localised on the input image by combining some of the labelled parts together. The head and body parts of the image can be obtained using the smallest rectangle bounding boxes to crop the image with all corresponding classes. Table 1 shows the classes of each network.
The head and body part images are resized to double the original size, and then input to the corresponding subnetworks for training. The head-and body-parsing subnetworks have the same design as the image-level parsing network (discussed in section III-A), except that the output layer is different because of the different ground-truth.

C. SUB-NETWORK COMBINATION
We obtain three feature maps from the image-level network and the head and body parsing sub-networks. Let  Fig. 3 shows the comparison results of the before and after the combination of sub-networks. As shown in Fig. 3(b) and Fig. 3(c), the heat map of sunglasses is not very obvious, but after combination it becomes much more obvious (see Fig. 3(f)) because of the fusion of the head parsing subnetwork (see Fig. 3(d)).

IV. CONFIGURATION AND IMPLEMENTATION DETAILS A. DATA SET PARTITIONING AND PRE-PROCESSING
We trained the proposed PHPC on the ATR dataset [8]. This dataset contains 7702 images, where each image is paired with a ground-truth -a mask of pixels in 18 semantic labels. We split the available data into two sets (as shown in Fig. 4): the first set of 6898 images for training and the second set of 804 images for testing.
We augment the training data by randomly mirroring and cropping the images. The images are also normalised by subtracting the mean RGB value of all the training data. To balance computational efficiency and practicality (e.g., GPU memory), all images are resized to a resolution of 384x384 with 0 padding.

B. TRAINING DETAILS
The final network models, namely the image-level parsing network (FCN), body and head sub-networks, were built based on caffe and trained with mini batch stochastic gradient   descent with a momentum of 0.9, weight decay of 0.0005 and fixed learning rate of 0.0001. The setup for training PHPC networks is shown in Table 3.
To train the three sub-networks, we sped up the training processing by fine-tuning on parameters pre-trained on Ima-geNet dataset, and trained by phase, which is the same with fine-tune strategy in the work of Long et al. [9]. We trained the model on NVIDIA Tian X GPU for 30 epochs, and each epoch meant one pass of the full training set. We did not separate the images into batches for iterations. In other words, each iteration contained one single image in the training set.

C. EVALUATION METRICS
Evaluation metrics were defined to assess the performance of the network. We defined the Percentage of Correctly Localised Parts (PCP) metric to evaluate the location accuracy for part detection as follows: where the detection part is correctly localised if and only if the overlap of the bounding box of the detection part and bounding box of the ground-truth is over 50%. More than 50% overlap is used because the input to sub-networks needs not to be very accurate and the sub-networks will parse the detection region again.
In addition, three metrics commonly used for evaluating semantic segmentation performance, including per-pixel accuracy, average per-class accuracy and mean accuracy on intersection over union region (IoU) are used in this study. The same metrics were also used in [9].
Let n ij be the number of pixels of class i predicted as belonging to class j, and n cl be the number of classes. Therefore, j n ij is the total number of pixels in class i. Per-pixel accuracy is computed as follows: Mean per-class accuracy is computed as follows: Mean accuracy of IoU is computed as follows: We did not set aside a validation dataset in this study. Instead, we trained the models over a set of epochs by updating the networks weights. The network weights are saved after every epoch. Inference was carried out to yield pixelwise predictions for the test data using all models optimised during the training process. Fig. 5 shows the achieved mean IoU of the test data set over every epoch. The mean IoU accuracy appears to improve quickly during the first 5 epochs and becomes stable (converged) after 10 epochs.

D. SUPERPIXEL AND CRFS REFINEMENT
As reviewed in section II, superpixel and CRF were reported as effective strategy to refine the segmentation results. We experiment to add superpixel and CRF as a postprocessing step to the final results of PHPC, as shown in the dotted region of Fig. 1. To add superpixel and CRF refinement, we used the Simple Linear Iterative Clustering (SLIC) algorithm [20], a modified version of the K-means algorithm, to calculate superpixels in the input images. The SLIC algorithm segments image I to a set of superpixels R = {R 1 , R 2 , . . . , R m }. Each superpixel R m is associated with the possible labels L = {l 1 , l 2 , . . . , l N }, R m ∈ L. In the last section, we obtained 18 label probability score maps. Given image I , for every superpixel R m , we computed the probability score of R m belong to label l k by averaging the feature score for all inner pixels of R m belong to label l k , as follows: where c is the number of pixels within superpixel R m . We assigned superpixels to labels with the maximum probability score P. Considering the relationship between neighbouring superpixels, we also adopted a CRF model based on the superpixels for better segmentation. Given an image I , our objective is to minimise the following energy function: where ψ m is a unary energy involving the superpixel R m , and ψ i,j is a pairwise energy involving a pair of superpixels R m and R n , l y is the true label and l y is the prediction label. The unary energy is compute by the following function: The pairwise term ψ m,n models the similarity between two superpixels. We only adopted a pairwise term for adjacent superpixels and considered the similar appearance for adjacent superpixels. The pairwise term is defined as follows: where u(l y , l y ) = 1 if l y = l y and otherwise u(l y , l y ) = 0. The appearance kernel is inspired by the observation that nearby pixels with similar colour are likely to have the same label. The degrees of nearness and similarity are controlled by the parameters ω, δ c and δ t . We decided the most likely assignmentŷ = argmax . We minimised the CRF energy (9) using alpha-expansion [21]. We trained the CRF model for post-progressing using Pystruct tool.

E. COMPARATIVE STUDY
To evaluate the effectiveness of our proposed PHPC network, we conducted a comparative study. We compared the accuracy of PHPC (Fig. 2) with two other networks: (1) basic FCN-8s image-level parsing model ( Fig. 2(a)) and the CRFasRNN network proposed by Zheng et al. [10], as shown in Fig. 6. For comparative purposes, we describe the CRFastRNN model here. As shown in Fig. 6, CRFasRNN refines the coarse segmentation from the FCN by integrating CRF optimisation into the model for an end-to-end training.
Let X i be the variable associated with pixel i, where X i ∈ L. Given an image I, the probability score of pixel i being assigned to l i is initialised from the outputs of FCN, that is where U i is the i-th feature map of the outputs from FCN. The best label assignment l i is obtained by minimising the total CRF energy function: With reference to Equation (9), both energy functions (9) and (13) have the same formulation, including unary and pairwise energies. The key difference is that in our PHPC, CRF is based on the superpixel and modelled as a postprocessing step (see section IV-D), while the CRFasRNN is based on every pixel and uses the mean-field approximation to minimise the CRF energy for an end-to-end model. For pixel-level formulation, the computation of pairwise energy is very large. In view of this, mean-field inference is used to approximate the distribution of P(X |I ) as a simpler distribution Q(X |I ): The algorithm of the mean-field inference updating is shown below: The above algorithm shows that the updated equation of mean-field inference of a DenseCRF model can be broken into a series of small steps, as neural network operations. The message parsing step involves a bilateral filter, which can be viewed as convolutional. The weighting filter outputs and compatibility transform steps can be viewed as convolutions with 1 × 1 kernels. The adding unary potential step is a common operation in neural networks. The initialisation and normalisation steps are both equivalent to the softmax operation. Fig. 6(a) illustrates the equivalent network layers of a mean-field iteration. By performing multiple mean-field iterations ( Fig. 6(b)), where the output of one iteration becomes the input of the next iteration, the mean-field inference algorithm can be formulated as an RNN. Therefore, the model is called CRFasRNN network, and it is able to provide end-to-end training. We compared our PHPC networks with FCN-8s [9] and this CRFasRNN (Fig. 6), and the detail results are discussed in the next section.

V. EXPERIMENTAL RESULTS AND DISCUSSIONS
We evaluate the proposed PHPC networks in this section. For comparative purposes, we also trained FCN-8s and CRFas-RNN on the ART dataset using the same data partitioning and preprocessing, as outlined in section IV-A. We evaluated the overall effectiveness of PHPC networks in this section using the metrics defined in section IV-C.
The core idea of our PHPC lies in the detection of the head and body regions of the image, so we first evaluated the part detection performance on the test dataset in section V-A. In section V-B, we used the trained networks to inference pixelwise predictions on the test data. To do so, the weights determined in the training were first loaded into the network and then the inference was applied. Finally, the output prediction metrics were evaluated and compared. We evaluated the effectiveness of our PHPC without refinement and compared the results with the inference results of the FCN-8s  and CRFastRNN networks. We discuss the effectiveness of superpixel and CRF refinement in section V-C.

A. EVALUATION OF HEAD AND BODY PART DETECTIONS
We calculated the PCP for all the test data using Equation (4). Table 4 shows the localisation accuracy for the head and body parts of our PHPC model and that of FCN-8s. As shown, the head localisation accuracy of FCN-8s and our PHPC is close to 100% while body part localisation accuracy of PHPC is higher 1.37% than FCN-8s for test data. This illustrates that our PHPC method is effective in localising/ detecting the head and body regions of the input images.

B. EVALUATION OF HUMAN PARSING RESULTS
The per-pixel accuracy (pixel acc), per-class accuracy (mean acc) and mean accuracy on intersection over union region (mean IoU) are calculated by inference the trained PHPC model on the test data and listed in Table 5, which also compares the metrics of the FCN-8s and CRFasCNN models. As shown, the proposed PHPC model achieves the best results in all performance metrics. In comparison to FCN-8s and CRFasRNN models, the mean IoU of our PHPC has improved by 4.58% and 3.49%, respectively, and the mean acc has improved by 3.21% and 3.02%, respectively. Table 6 also shows per-class mean accuracy on intersection over union region (mean IoU) comparison of our PHOC model, FCN-8s and CRFasCNN. It is obvious that our PHPC significantly improves the IU score in every category, as compared to FCN-8s and CRFasRNN. In particularly, the accuracy of small items improves significantly. For example, the accuracy of sun-glasses is 11.433% greater than that of FCN, and 7.278% greater than that of CRFasRNN. For another example, the accuracy of belt is 3.430% greater than that of FCN-8s and 3.416% greater than that of CRFasRNN. Also, compared to the CRFasRNN, the accuracy of scarf has improved by 5.938%. The reason why our model performs better than FCN-8s and CRFasRNN is that our model strengthens the attention of head and body part, and combines the parsing result of subnetworks (focusing attention to local parts) into the image-level parsing result.    Fig. 7 shows some parsing results generated by the different models. As shown in Fig. 7 row (i), both FCN-8s  and CRFasRNN do not detect the sunglasses, but our PHPC model parses them accurately. Fig. 7 row (ii) also shows that our PHPC model performs better on parsing the belt than FCN-8s and CRFasRNN. Fig. 7 row (iii) shows that FCN-8s and CRFasRNN confuse upper-clothing and dress, but our PHPC model segments the upper-clothing and trousers properly. In Fig. 7 row (iv), the skirt was not accurately parsed by FCN-8s and CRFasRNN models, but our PHPC model parsed the skirt well.
We also compare the speed of the three networks in Table 7. As shown, compared with FCN-8s and CRFasRNN, the running time needed to parse an image using our PHPC networks is much longer. This is because our PHPC networks contain three networks in total and these networks are arranged as a cascade and process the input image in turn. The output of the image-level parsing network is used to generate the input of the head-and body-parsing sub-networks, which substantially extends the running time needed.

C. EVALUATION OF HUMAN PARSING RESULTS
As discussed in section IV-D, we experimented on using superpixel and CRF as a post-processing step to further refine the results. A comparison of the overall accuracy of our PHPC models with and without superpixel and/or CRF refinements is shown in Table 8. Although the literature reported that superpixel and CRF could improve the segmentation accuracy [1], [2], [10] our results do not show improvements in accuracy. Instead, the overall pixel accuracy, overall per-class accuracy and mean IoU accuracy all dropped after refinement process, while the CRF refinement results in a bigger drop than only using superpixel for refinement. Therefore, for PHPC model, we think superpixel and CRF as postprocessing is not effective, so such refinement (dotted region) are not suggested in Fig. 2. VOLUME 7, 2019 By looking at the per-class mean IoU comparison in Table 9, we can see that the accuracy of big items, such as upper-clothing, trousers and dress increased, but the accuracy for small items decreased. This demonstrates that the proposed superpixel and/or CRF refinement is only effective for big items. The accuracy of small items may have decreased because that the superpixel and CRF model we used is only a post-processing step (as discussed in section IV-D), not an end-to-end model like CRFasRNN. Fig. 8 shows some of the parsing results of the models with and without refinement. The post-processing step may incorrectly update or group small item pixels into the groups of large items.
To conclude, the CRF refinement based on superpixel as a post-processing step for the final results in the current model, it appears that this refinement does not improve the parsing accuracy for small items because the CRF model was not integrated or trained end-to-end. We will improve the refinement model and integrate all of the components of our method as a unified network in our future work.

VI. CONCLUSION
In this paper, we proposed a novel part-based human parsing cascade (PHPC) of networks for the parsing of human images, which consisted of an image-level parsing network, two part-based paring networks and a combination module. The inputs to the part-based parsing networks are partial images detected based on the results from the image-level parsing network and then scaled up to double the original size. We have demonstrated this network design is beneficial for extracting more detailed features from input images. By doubling the partial raw image in localised area of interests, the part-based parsing sub-networks decreased the impact of the background and improved the parsing accuracy of small items. The experimental results have demonstrated the effectiveness of the proposed PHPC networks.
The proposed PHPC networks do, however, have some known limitations. First, the inputs for the part-based parsing sub-networks rely on the output of image-level parsing network: the image-level and part sub-networks do not share any parameters. As a result, the running speed of the PHPC is slow in comparison to other end-to-end parsing models.
YANGHONG ZHOU received the bachelor's degree from Fujian Normal University, in 2011, and the master's degree from the University of Electronic Science and Technology of China, in 2014. She is currently pursuing the Ph.D. degree with The Hong Kong Polytechnic University. Her research interests include deep learning and image analysis.
P. Y. MOK received the B.Eng. degree (Hons.) majoring in industrial and manufacturing systems engineering and the Ph.D. degree from the University of Hong Kong, in 1998 and 2002, respectively. She is currently an Associate Professor with The Hong Kong Polytechnic University. Her current research interests include fashion pattern engineering, fashion 2D and 3D CAD, digital human modeling, 3D scanning and sizing, cloth simulation, deep learning, computer generated textile, sketch and pattern designs, computer vision and computer graphics in fashion applications, advanced data analysis, and artificial intelligent applications in the fashion industry.
SHIJIE ZHOU received the bachelor's degree from the Lanzhou University of Technology, in 1995, and the Ph.D. degree from the University of Electronic Science and Technology of China, in 2004. He is currently a Professor with the University of Electronic Science and Technology of China. His research interests include network security, traffic simulation, and artificial intelligence. VOLUME 7, 2019