Neural Network Approach for 2-Dimension Person Pose Estimation With Encoded Mask and Keypoint Detection

We present an innovative approach for 2D person pose estimation by developing a convolutional neural network for human 2-channel mask prediction and human 2D pose estimation. Conceptually our idea is simple, inspired by prior image segmentation research in general. We establish a perception that explicitly encoded mask data can be served as a critical feature for person pose estimation. We propose a convolution neural network model by combining the image segmentation technique with the bottom-up approach for human pose estimation. We observe that the construction of a two stage-network for training in an end-to-end manner is beneficial to one another: for person mask prediction and 2D person pose estimation. At the pose estimation stage, we detect heat-maps against the person keypoints location from the mask information and their mutual connection relations. They are then used to estimate an ultimate pose in a way to remove the unwanted or occluded keypoints, as those keypoints may propagate across the network and lead to redundant pose estimation. We train and test our system on the MS-COCO dataset, and the experimental results validate the superior efficiency of the proposed methodology.


I. INTRODUCTION
Multi -person pose estimation is used broadly in different applications of computer vision. The purpose of this research topic is to estimate different parts of a human body in an image or a video and (after detecting all human poses automatically) form a skeleton structure of a human body. Pose estimation is a challenging problem as several key factors are needed to be taken into account, such as the background, a variety of clothing, different lighting conditions, etc. In contrast with the single person pose estimation, the multi-person pose estimation is more challenging because neither the The associate editor coordinating the review of this manuscript and approving it for publication was Zhiwu Li . position nor the number of people information is provided. There are several different techniques used in research to address this problem.
Formally used techniques are based on hand crafted features, e.g., Edgelet [1], and HOG (histogram of Oriented Gradients) [2], [3], but they are inadequate to locate the exact location of human body parts. On the other hand, deep-learning-based techniques are more efficient for extracting features from provided data; they have achieved fabulous results, and have greatly surpassed previous approaches based on Non-deep learning. Deep-learning-based techniques offer pixel-to-pixel correspondence by convolutions in pose estimation, and there are many improvements needed for the development of a real-time pose estimation network. Person keypoints and their location are two challenges to solve in pose estimation tasks. For answering these two problems, two well-known approaches have been introduced, known as the top-down and the bottom-up approach.
Prior works such as described in [4], [5] show that the top-down approach can achieve better accuracy. On the other hand the study in [6] reports that the bottom-up approach can achieve the same accuracy for multiscale results analysis. Both pipelines are explored with deep learning techniques. It is hard to differentiate between these methods to find out which way is better than another. The reason is that, in realworld implementation, several key factors are involved, such as the speed and the accuracy of the system. The top-down method's main drawback is that it lacks speed as the network is first required to detect all human by a human detector. Then every human is treated one by one to estimate the person pose on the bases of keypoints in an image, which increases the overall pose estimation time linearly. On the other hand, the bottom-up approach uses a reverse technique: first, human keypoints are detected and then based on those keypoints, a person pose is estimated. Another factor is the hardware constraint. During the network training practice, given the same network and the same hardware, the regular resolution for a single person in a bottom-up approach is less than the top-down approach. Therefore, what certainly matters in processing is the hardware constraint. The Mask R-CNN [7] can perform segmentation tasks to predict the mask as well as the pose estimation in the same instance, although the Mask R-CNN approach uses the pose as a soft segmentation mask. Therefore, the network produces keypoints as a soft segmentation mask and a person segmentation analog side. The basic intuition of this paper is that the explicitly encoded mask information can serve as a critical feature for the human pose estimation in the structure of generative methods.
Consequently, we estimate the mask segmentation and the person poses sequentially. We combine the image segmentation with the bottom-up approach using CNNs techniques, for person pose estimation. The key idea of our work is to take advantage of the encoded mask information of a human body. We find that a two-stage cascaded-network can increase the accuracy of mask estimation and the human 2D pose detection. Fig. 1 shows the flow of our work. Fig. 5 and Table 3 report qualitative and quantitative comparison results of the proposed method, correspondingly. The main contributions of this work are stated as follows: • We introduce a multi-person pose estimation approach, which contains two stages: (a) a mask prediction stage and (b) a pose prediction stage, executed sequentially. The results show that our method improves accuracy at a high margin.
• We develop a unique multi-stage convolution neural network that extracts person keypoint features from encoded mask data, generates heatmaps and learns the relationship between different body parts sequentially for mutual connections.
• We show that the mask and pose estimation stages are mutually beneficial for the end-to-end pose estimation network. The proposed system consists of two parts: the mask segmentation part and the pose estimation part. Person mask and mask forecasting feature maps contribute to the creation of body parts heatmaps. Using back-propagation in an end-to-end training, a person pose prediction stage also improves the accuracy of the person mask estimation.
• We introduce a simple but effective method to abandon occluded or not correctly visible keypoints for improvement of overall pose estimation accuracy.

II. RELATED WORK
Several methods are developed for human pose estimation; traditional methods adopt pictorial structure technique [8], [9], whereas recent research introduces deep learningbased methods. Early works such as described in [7], [10] identify that mask information is one of the essential factors for human pose estimation. More recently, Wang et al. [11] propose a wavelet frame-based fuzzy clustering algorithm and apply it to image segmentation problem defined in fixed Euclidean and uneven fields. A study in [12] by Bhandari et al. introduce a multi-level thresholding technique to increase the quality of segmented images by calculating the 3D Otsu alongside the fusion phenomena. The study advocates that in contrast with the current segmentation methods, the fusion-based multi-level thresholding approach is a more effortless procedure with higher quality outcomes, VOLUME 8, 2020 and decrease the implementation time with increasing threshold levels. Human pose estimation is categorized into two groups depending on the structure of the network, namely the top-down and the bottom-up pipeline. We concisely analyze and discuss both groups of multi-human pose estimation pipelines. Earlier for person pose estimation in an image or video, the rough location of a person is provided before the estimation process, e.g., by a bounding box or rough target location [13].
Recently presented models solve it by joint detection since these joints (keypoints) have more precise positions and are associated naturally. The single person pose estimation is classified in two frameworks. The first one directly regresses keypoints from features, which is a linear regression-based structure, called DeepPose [14]. Carreira et al. [15] use a self-connecting approach. In some works, such as that in [16], scientists generate heatmaps first and extrapolate keypoints locations through heatmaps, known as a heatmap based framework. Convolutional Pose Machines (CPM) [17] detect body keypoints and then associate those keypoints to each body part independently for final pose estimation. Banzi et al. [18] propose a semi-supervised latent tree dependency model (LDTM), which transforms internal joint location to the unambiguous demonstration. The study further combines the established hand topology with the pose estimator using the data-dependent system to mutually learn the latent variable of the posterior pose presence and the pose configuration, correspondingly. The studies in [19], [20] show some efforts to decrease the computational burden in pose estimation networks. As compared with the single person pose estimation, the multi-person pose tracking is more challenging, since the person number and their position are unidentified in an image or video. Detection of human positions and their keypoints detection are two main challenges in this task. Two famous pipelines have been anticipated to overcome these two challenges.

A. TOP-DOWN FRAMEWORK
A most commonly used framework is top-down, which uses a human detector first to detect persons in a given image and then estimate poses against every exposed person individually. The first proposed top-down framework is Deeppose by Toshev and Szegedy [14] using a face detector. Further, they use a cascaded Deep Neural Network (DNN) regressor to assess the person keypoints and test their approach on the FLIC [21] dataset. Radosavovic et al. [22] implement all the available data that are labeled and unlabeled, utilizing Omni-supervised learning for model training. The technique proves that both related and unrelated information is valuable and practical for pose prediction. Some methods integrate person detectors and estimation, such as Rmpe [23], while others simultaneously predict human bounding boxes and keypoints in a unified network Mask R-CNN [7]. The studies in [7], [23] have examined the detection of humans and the alignment of bounding boxes. Mask R-CNN [7] concurrently predicts human bounding boxes as well as keypoint locations, which makes the detection process faster, while Rmpe [23] by Fang et al. shows that pose estimation is highly dependent on the accuracy of the human detector. They implement a Symmetric Spatial Transformer Network (SSTN) parallel with Single Person Pose Estimator to estimate the human area more precisely, and propose Non-Maximum Suppression (NMS) to address the occluded issues in pose estimation. As severe occlusion or confusing joint detection is still challenging, Papandreou et al. [24] use the Non-Maximum Suppression approach to remove the false-positive while caring true-positive.
A different type of research shows that the top-down framework method is heavily dependent on the human detector. For instance, Chen et al. [25] suggest a cascaded pyramid network, showing that with an improved quality of a human detector, human pose estimation achieves better accuracy. Their system contains two portions: GlobalNet and RefineNet. In the practice of person pose estimation, a commonly used human detector is built on Faster R-CNN [26], with a variety of its modifications based on ResNet [27], Inception-Resnet [28], VGG [29], and inclusive configurations such as Feature Pyramid Network (FPN) [30].

B. BOTTOM-UP FRAMEWORK
In bottom-up approaches [4]- [6], [31], [32], person keypoints are detected first for each human in an image, and then these keypoints are associated using different techniques to estimate a human pose. Pishchulin et al. [4] propose deepcut as the first bottom-up framework built on deep learning techniques. This work explains the pose estimation problem by answering the minimum-cost multi-cut question and represents the joint-candidates as peaks and relationships among them as ends. Insafutdinov et al. upgrade deepcut [4] by using a deeper network that generates the body parts proposals more effectively, and name it deepercut [31]. Both the speed and the accuracy have enhanced in their system compared to deepcut [4] by using different expansion approaches. Even after significant improvements, deepercut [31] still lacks speed, particularly when it comes to solve the problem of minimum-cost and multi-cut in real-time. Cao et al. [5] suggest an efficient method based on a nonparametric representation approach called Part Affinity Fields (PAFs). This work generates heatmaps across the keypoints region, and subsequently uses a greedy algorithm for pose parsing after the prediction of heatmaps and PAFs for the real-time enactment. Zhu et al. [33] make several changes in [5] for the improvement of overall output results. The amendments contain a deeper network and redundant PAFs that assist and subordinate these immature connections with a wrecked origin connection. Li et al. [34] further improve the work in [5] and propose a better technique for pose parsing by treating each pose individually for pose refinement, although deserting a single redundant keypoint is still a problem.
The essential advantage of the top-down pipeline is noticeable; it uses a very high level of distinct prediction for persons in the image, though the accuracy of the network is highly dependent on the human detector. Furthermore, when the number of people grows in the input image, the overall prediction also increases linearly due to the pose estimation process for every person individually. On the other hand, the bottomup approach keeps the total computations constant under this scenario. Although the relationships between body parts may fail to use contextual data, they may not be as reliable as a human detector, since human detection is becoming more and more accurate due to recent discoveries in detection and classification works.

III. PROPOSED METHOD
Given a color image I ∈ R (h×w×3) , where R represents the pixel values of the image, our aim is to estimate 2-dimensional human pose. In the procedure of human pose detection, it is essential in our approach to detect the human mask M ∈ R (h×w) in an input image first, since the mask contains essential information of a human body in an image. We train the network to classify the human masks from an input image and human keypoints annotations from the dataset. We characterize keypoints K locations X K ∈ R 2 with K being an element of a set [1, K ]. Every keypoint K denotes body parts of human, such as the left shoulder, right wrist, etc. Our experiment uses K = 17 keypoints, which are defined by the COCO dataset [35]. The goal is to train person mask segmentation stage and person keypoint detector from a given RGB image (I ), mask (M ), and keypoint (X K ), where the mask and keypoint annotations are from the dataset.

A. NETWORK ARCHITECTURE
The proposed convolutional neural network consists of two stages. Stage one predicts the human masks, while the second stage of the system, which is implemented in two parallel branches, estimates person poses. Branch one is for person keypoint location detection, and the second branch learns keypoint relations with one another, as shown in 3. The mask prediction stage generates mask information for the second stage from the input RGB (color) image. These encoded mask layouts could be addressed naturally in the pixel to pixel communication carried by the convolutions, similar to [36], [37]. The network uses VGG-19 [29] up to conv4_4, for the production of 128 channel F. F feature is further convoluted by the fully convolutional network (FCN) [38] to forecast a 2-channel human mask of size m × m. At this stage, the output size of the mask for an input image is m * = m ÷ 8, where m is the annotated mask of an original image, divided by eight due to three pooling layers in the network. Table 1 reports the network configurations in details i.e., the number of layers and their parameters. Cross-entropy is used to calculate the loss of the mask prediction stage. Suppose that a single batch contains N samples. Then the loss function for the mask prediction stage is: where y serves as a ground-truth label and K j are the output score of the j th label of the mask forecasting stage j ∈ (0, 1).
The human pose estimation stage follows the mask generation stage that predicts keypoints heatmap and Part Affinity Fields (PAFs) for joint connections, similar to [5]. We also execute the second stage of our network (the human pose estimation stage) with several sequential stages (S 1 , S 2 , S 3 , . . . S t ). In every sub-stage of S t , the human pose is predicted from the given data at mask prediction stage and from poses anticipated in the earlier stage, which is S t−1 . The first stage of the human pose estimation stage uses the input data from mask prediction. Particularly we concatenate 128 channel feature F with predicted mask M to generate the input of 130 channel-map F * for the human pose estimation stage. Consider that K representing keypoints is equal to [K 1 , K 2 , K 3 , . . . , K J ], and L represents the Part Affinity Fields that is equal to [L 1 , L 2 , L 3 , . . . , L c ]. K j ∈ R (h×w) , where j ∈ (1, . . . , J ) one per keypoint, represents the 2-dimensional confidence score of the j th keypoint that happens at every pixel in image (I). Correspondingly L c ∈ R (h×w×2) , c ∈ (1, . . . , C) one per connection defines the 2-dimensional vector area demonstrating the relationship between two keypoints or body parts, also called a limb. Note that the distance between two body parts is called a limb. Every sub-stage S t of the pose estimation calculates K + 1 human pose 2-dimensional heatmaps, where the input of S t is F * and K + 1 heatmaps; similarly, for the second branch, the data is F * and L + 1 from early sub-stage S (t−1) . Hence the entire data for each sub-stage S t of pose estimation is F * + (K + 1). Similarly for PAFs F * + (L + 1) in the second branch of the network, both the branches share weight to improve the efficiency of the network. Note that F * serves as an input for the first stage S 1 only. Fig. 2 represents a step by step working sequence of the proposed method. Fig. 2(a) stage 1, shows the predicted mask for an input color image and Fig. 2(b) stage 2 represents the expected heatmap at the right wrist, besides a single connection between the right wrist and the right elbow. Additionally, Fig. 3 demonstrates the central network architecture of the proposed method.

B. CONFIDENCE MAPS FOR JOINT DETECTION
The keypoint, also known as confidence map, calculates the potential for body part identification in different locations. The accurate confidence map is formed via the Gaussian peak at part locations. Suppose that K * (j,y) denotes a confidence map of body part j that belongs to human y, and the location coordinates of that particular part are X (j,y) . The confidence score of location l is as below, The spread of the Gaussian peak is controlled by σ . Further, if there are multi-humans, then the confidence map of parts is computed as below, We use Depth First Search (DFS) to order all limbs, e.g., from right shoulder to the right elbow or left knee to left ankle, VOLUME 8, 2020  etc. In contrast with Breath First Search (BFS), DFS is more suitable for puzzle-like problems when it comes to decision making. For a single connection between the two body joints, see Fig. 4, which represents a unique connection between the elbow and wrist. If the connection field path between two keypoints is vector V in a 2δ inclusive rectangle as shown in Fig. 4, then the value of the field path L * c,y at a point l is equal to V only if l location remains in the area between 2δ otherwise 0. Vector V represents a unit vector for the direction of c, computed as V = (X j 2 ,y − X j 1 ,y )/||X j 2 ,y − X j 1 ,y ||.
The connection ground path for c of an input region with multi-human is calculated as where q c (l) represents non-zero vectors at a location (l). While testing, field paths (PAFs) are used to connect detected parts. For each pair j 1 and j 2 , we make sure that j 1 and j 2 produce a connection as a limb. Suppose that X j1 and X j2 are path dimensions of j 1 and j 2 . Then the direction of the path is P = X j1 − X j2 . Likewise, we continuously experiment with some points along the line connecting j 1 and j 2 to obtain the sampling set A, and the confidence score of a connection is calculated as where | A | is the number of sampling points | A |∝ P .

C. LOSS FUNCTION
We monitor loss for each stage to solve the problem of vanishing gradients [23]. At the same time, we use a weightloss-function to evade correcting the true-positive forecast  during training process; the further process is done as [24]. Subsequently, the loss function at a particular stage t is computed by where the ground-truth keypoint for part j is K * j , and K * B is for a background. Likewise, for connection c, the ground-truth ground path is L * c . W is a binary-mask with W (l) = 0. For instance, if an annotation is absent for a human keypoint at a location l, then λ is fixed to 0.04 to measure the foreground and the background loss. The entire loss function is computed as After detection of heatmaps across K and PAFs L, we finally estimate full human pose by connecting body keypoints. Suppose that we have roughly one mask against each person in a given image. The initial pose creation is similar to [34], except for two main differences. First, the system is restrained to each detected mask in sequence. Second, at the end of the pose completion before the final pose plotting, we perform a search for the concerning pose to remove redundant keypoints, by keeping in mind that there possibly are redundant keypoints, since a person or perhaps a part of a person is evident in two or more masks. In the pose estimation stage the network may detect redundant keypoints, (maybe the second keypoint of a mutual connection is not visible in the image due to overlap or missing). In that case, they may further lead to occluded pose estimation. Besides, one confidence map (keypoint) may contain the background data. In that case, the calculation of keypoint is to deduct the instant of K confidence-map by 1. Note that, even if a single 2D Keypoint of a pair, as shown in Fig. 4 (representation of a limb connection between elbow and wrist), is absent i.e., the keypoint at elbow or wrist is not observable or may be impeded. The single identical keypoint (confidence map) for that particular pair is set to zero to avoid redundant pose parsing.

D. NON-MAXIMUM SUPPRESSION (NMS)
A set of joint candidates is created with the help of the Non-Maximum Suppression (NMS) technique. Perhaps there is more than one joint candidate for every joint, because of several masks in an image. Given this, the most confident estimated pose Z * is first selected as proposed in [34]. Other poses related to Z * are then abolished by performing redundant pose abolishing criteria. We repeat this process until only one pose remains against per person. We compute the confidence pose by taking measures of the area covering pose, the confidence of a joint, and the confidence joint connection. Consider that there is Z pose with joint J , then the set of Z is where Z j denotes the location of a J joint. Suppose that s 1 (Z ) and s 2 (Z ) are the average arithmetical confidence scores of the joint and connection, respectively. Then the confidence pose Z * is calculated as (11) m (Z ). m(Z ) represents the minimum mask and the mask of Z correspondingly, and M is the area covered by the mask. Furthermore α, β and λ are set as the weight to 0.2, 0.2 and 0.6, respectively. For redundant pose removal, we use distance matric to calculate pose similarity between the redundant pose and the confidence pose, i.e., where η Z , η Z * calculates visible joints, and d(Z , Z * ) the percentage of matchless joints in Z and Z * . Then the threshold is VOLUME 8, 2020 set as the abolishment criteria for removal of redundant poses,

E. FINAL POSE PLOTTING
The purpose of this stage is to fix the disengaged or occluded joints that stem from the same poses. Inspired by the single-pose estimation works, we implement an elementary method for every absent point in pose Z . We calculate the point on which the corresponding disengaged confidence-map has a high confidence score. In the case that this point is already assigned to another pose, then the search is continued until the next available spot with the highest score. Then this point is added to the list of pose Z for point association. If there is a redundant keypoint that does not belong to any pose in the image, this particular keypoint is set to zero as it may affect other poses.

IV. IMPLEMENTATION
For the proposed model implementation, we initialize training by giving an RBG image as input to VGG-19 to produce feature F, which is the input of the mask generation stage. Convoluted feature maps F * (described in the network architecture section) are the input of the first stage of the pose prediction stage, where the network predicts heatmaps (confidence maps) against each person keypoint location and their PAFs. The input image is resized with the ration of τ , where τ is equal to min(m ÷ h, m ÷ h). The resized image is an RGB image of size m × m from the upper-left angle, and the rest hollow part is filled with (128,128,128) gray values. Similarly, for the human mask M , the only difference in operation is that the hollow area is filled with zeros.
The proposed method is trained in Keras with TensorFlow on a laptop with a single GPU 1060 6GB. We set Gaussian variance σ = (1.5, 1.5) T to find the heatmaps across K keypoint (confidence-maps). The initial lre = 4e − 5, and momentum is set to 0.9. Overall training-loss is the product of the mask weight sum and pose prediction stages. We find that six sub-stages are balanced to predict human keypoint locations effectively. We also witnessed that there is a large quantity difference between the human mask and keypoint heatmaps prediction. Therefore, considering these variances, we fix weight for mask and heatmaps as [0.05, 1] correspondingly. We trained our model on the COCO dataset for 100000 iterations, with a batch size of 64 and 16 subdivisions, stride = 8 for both stages, since there is no further pooling in the second stage. The results in Fig. 5 show that the mask prediction can significantly improve the overall accuracy of the system.

V. EXPERIMENTS
We perform many experiments to train and evaluate the proposed method. The mask prediction stage is evaluated on the COCO object detection evaluation matric 2 , which is publicly available. At the second stage of our network, pose estimation is evaluated on the COCO evaluation matric 1 for human keypoint evaluation. The experiment results show great performance, particularly in the sense of accuracy.

A. MAIN RESULTS
Both the stages of our proposed network are trained and tested on the COCO dataset [35]. Formerly COCO dataset offers 80 different classes for image segmentation tasks, including various objects in the wild. We use only two classes for this experiment: [1 + 1] BG and Person, respectively. For evaluation, the standard COCO metrics 1 uses AP over IoU thresholds at AP 50 , AP 75 , and AP S , AP M , AP L at different scales. As for the keypoint detection, the train, validation and the test set, containing more than 200,000 images and 250,000 people instances, are labeled in the wild with keypoints mostly from medium size body image to large scale. Annotations on the train and Val (with over 150,000 people and 1.7 million labeled keypoints) are publicly available. The COCO benchmark for keypoint evaluation uses Object Keypoint Similarity (OKS) to compute AP of 10 OKS intervals. Table 2 reports AP as AP, AP 50 , and AP 75 in percentage for the mask stage evaluation. We use evaluation matric 2 from COCO for object detection evaluation. For further details, please visit the COCO dataset website. 1,2   5 demonstrates qualitative results from our experiments. We report quantitative evaluation in the following one by one.
1) Human Mask Prediction: The second stage of the network is unplugged to evaluate the mask prediction efficiency of the proposed system, which is the FCN for human pose prediction, and to train the first stage of the network. The Average Precision is used for the evaluation of mask IoU, as reported in [30], [39]. Table 2 shows the quantitative results, with and without the pose prediction stage, while a few qualitative results are in Fig. 5, where one can visualize a noticeable difference that the proposed architecture can enhance human mask prediction for the pose estimation stage when both stages are present in single network architecture, in an end-to-end manner. 2) Human Pose Estimation: We assess the human pose estimation stage with and without the mask generation stage. We first train the whole network shown in Fig. 3, to evaluate the network performance with mask generation stage. Secondly, during execution of another experiment we drop the final two layers of the mask generation stage, and neglect the calculation of the Loss in (1), to evaluate the network performance without the mask prediction. The human pose prediction configuration without the mask is similar to the Realtime multi-person 2D Pose estimation [5].
For the evaluation of pose estimation, we use Average precision (AP) as described in the main results section.  The quantitative results with and without the mask prediction are reported in Table 3. It is evident that, without the mask prediction stage, pose estimation is accomplished badly in the proposed architecture, and the overall accuracy is compromised. The finding states that the mask prediction stage can enhance the performance of the pose estimation stage. The reason behind this phenomenon is that the prearranged/predicted mask can provide restraints for human keypoint exposure. Furthermore, to evaluate the overall performance of our approach, we compare the proposed approach with known methods, Realtime multi-person 2D Pose estimation using part affinity fields also known as CMU [5], Mask R-CNN for human keypoint detection [7], and Convolutional pose machines [17]. Fig. 6 and Table 3 demonstrate that our approach attains better performance qualitatively and quantitatively. Table 3 includes quantitative comparison results with the other methods. It is shown that the proposed method is slightly better than Mask R-CNN [7] keypoint detection with the mask, but Mask R-CNN, when it comes to pose parsing, fails to connect appropriate joints as shown in Fig. 6(b). We can also observe in Fig. 7(a) that at AP 50 with threshold OKS = 0.5, our approach is way better than Convolution Pose Machine(CPM) [17].
Since the mask detection stage in our approach explicitly encodes the masks that contribute to the pose estimation stage for keypoint detection, the pose estimation for medium-scale and the large scale person is greatly improved. However, for small scale people, pose estimation is still a challenge. For Realtime multi-person 2D Pose estimation [5] as shown in Fig. 6(c), when solving the pose estimation problem, the network confuses when two or more people are overlapped in an image. Additionally, unwanted keypoints that do not belong to the same human-body can miss-propagate through poses, leading to redundant poses. In contrast, our approach is more stable in such scenarios. Compare with the work in [19], our proposed method achieves far better overall AP, as shown in Table 3 Osokin's lightweigh Open-Pose on CPU and Osokin's 2-stage network retrained with all the refinement stages, the paper lack including results for VOLUME 8, 2020   Table 4 reports that if we perform NMS operation, the output results are refined by 3.5% against that reported in [5]. At the last stage, before plotting the final pose, if we search and remove redundant keypoints, pose estimation is further enhanced by 3.3%. Similarly, against Mask R-CNN and CPM Table 3, the proposed approach achieves 2.1% and 2.5% overall improvements, respectively. Table 3 reports quantitative results against CMU [5], Mask R-CNN for human keypoint detection [7], and Convolutional pose machine (CPM) [17].

VI. DISCUSSION AND FUTURE WORK
This paper proposes a novel method for pose estimation from simple color images in the wild. Our technique consists of two cascaded stages: the first stage predicts the mask and the second estimates an ultimate person pose. We test the proposed approach on the MS-COCO dataset and report the results in this paper. MS-COCO provides annotations for mask segmentation as well as for person keypoint annotations, which are the basic requirements of our method. The main idea of this work is that, in human pose estimation, the explicitly encoded mask data serves as a critical feature in the structure of generative methods. Fig. 5 shows that with back-propagation of an end-to-end training manner, person pose estimation also improves the efficiency of the person mask generation. Compared with former top-down approaches, the proposed plan is not prone to bounding-box move or stiffness. As a result, it is more robust. In addition, it does not confuse to analyze which joint belongs to which person, due to adding the mask prediction technique that handles the overlapped people problem more efficiently in packed images. We hope that the image segmentation technique and the bottom-up approach that are comparatively less explored together will be further explored for practical pose estimation.
Our proposed method has a few limitations: (a) we have not tested our system under the real-time scenario, and (b) the speed of our network is slow, especially when there is a large number of persons in a single image. The reason is that sequentially predicted masks for every single person in an image increase the overall computation time of the system. We use back-propagation that calculates the gradient in the weight-space of a feed-forward neural network, while concerning the loss function. Other operative heuristics, such as that in [40], [41] can be explored in the future train the network for realtime performance. Perhaps a multi-stage design in a cascaded manner of both the mask prediction stage and the human pose estimation stage can improve the speed. Furthermore, although we have greatly enhanced the mask prediction stage, as we can see in Fig. 5(c) there are still a few small errors while discarding the background of an image. Considering the dimensional reduction and bias error, an additional penalty term regularization such as that suggested in [42], may be added in the loss function to improve the system optimization. A better design of mask encoding stage can improve the overall accuracy of the network.
EMAD ABOUEL NASR received the Ph.D. degree in industrial engineering from the University of Houston, Houston, TX, USA, in 2005. He is currently a Professor with the Industrial Engineering Department, College of Engineering, King Saud University, Saudi Arabia, and an Associate Professor with the Mechanical Engineering Department, Faculty of Engineering, Helwan University, Egypt. His current research interests include CAD, CAM, rapid prototyping, advanced manufacturing systems, and collaborative engineering.
HAITHAM A. MAHMOUD (Senior Member, IEEE) received the Ph.D. degree in industrial engineering from Helwan University, Egypt, in 2012. He worked as an Engineering Consultant for several industrial organizations in Egypt. He is currently an Assistant Professor with the Department of Industrial Engineering, College of Engineering, King Saud University, Riyadh, Saudi Arabia, and the Mechanical Engineering Department, Faculty of Engineering, Helwan University. His current research interests include optimization modeling, theory, and algorithm design with applications in waste management and energy management, financial engineering, and big data. VOLUME 8, 2020