Improvement of Human Pose Estimation and Processing With the Intensive Feature Consistency Network

The modeling of human body kye-points is the most significant aspect of pose estimation appropriately. Computer vision algorithm identifies human pose, body-movement, and action in many ways. Most of the previous works taken advantage for finding accuracy or efficiency in terms of speed. However, many techniques suffer for intensive computational demands with low-latency or higher proceeding speed. We have designed a unique approach for single-person pose estimation and action recognition which is well suited for fitness application and mobility activities. The proposed framework has been developed with a base network that provides an initial pose to further refinement through Intensive Feature Consistency (IFC) network. The IFC network enforces high-level constraints on the global body intensity correction and local body part adjustments. The proposed module reduces the impact of body joint movement diversity by interpreting long-term consistent view. We have illustrated the effectiveness of proposed framework through pose estimation accuracy improvement with two benchmark datasets. Which is specified state-of the- art performance of IFC network under the required real-time processing speed on the CPU platform. The IFC network has improved 99.1% of PCK body and 94.7% of PCK torso accuracy under 31 FPS, which is comparatively higher than the existing work.


I. INTRODUCTION
The wide range of artificial intelligence applications are used for human pose estimation, such as action recognition, sports performance analysis, human-computer interaction, and augmented reality have been facilitating an active research field [1] in recent years. The human pose estimation and tracking in the video frames are much more difficult The associate editor coordinating the review of this manuscript and approving it for publication was Orazio Gambino . because of extensive variation of human body movement, numerous degrees of self-determination, and body joint occlusions. The intensive and useful feature effects for long-term consistency on human pose. The coherence of intensive feature consistency makes human pose estimation graphs smoother in video frames. A survey [2] shows that computer vision applications have improved human pose estimation results significantly in the field of AI applications. The pose estimation network modifies joint motion offsets by ensuring feature consistency across the frames from VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ previous to next frame. The single person pose estimation model creates heatmaps with basic information of each articulation for body joint coordinate [3]. Which is generated by utilizing heatmap choice scales with minimal overhead. The heatmap techniques and regression algorithms are less computational with more scalable. Those methods are stable for predicting the mean values of joint coordinates and solving the frequently fall underlying uncertainty. A Hybrid-Pose model [4] refined the joint location and improve human pose with body parts. We have addressed these issues in this study by adopting the BlazePose [5] model for single person pose identification and expanding with the intensive feature consistency (IFC) network architecture. Besides, we have considered a discriminative function to predict accurate and stable joint coordinates, which improved human pose estimation accuracy under real-time inference acceleration. The pose estimation consequence technique employed through the detector and tracker-based applications. It performs very well in real-time speed on tasks like MediaPipe Hands: hand landmark prediction [6] and the BlazeFace: face landmark prediction [7]. A compact body pose detector precedes in the MediaPipe of pose tracker of the network. The pose tracker predicts 33 joint coordinates of a person in the frame, the current frame refines body joint coordinates at the region of interest (RoI) area. The RoI is calculated based on the posture detection using the detected body landmarks.
The MediaPipe is a framework of machine learning (ML) solutions to perform with lucrative inference over subjective sensory data. Which utilized to identify and track the positions of 33 skeleton points of body joint landmarks from RGB inputs under the real-time proceeding speed. Recently, many researchers utilized this tool for their active research [8], [9], [10]. The pipeline initially locates the region-ofinterest (ROI) inside of the frame using a detector. The tracker uses the ROI cropped frame as input to estimate body joints and pose marker. The MediaPipe graph recognizes posture landmarks and skips previously identified landmarks as needed for new pictures. The intensive features with long consistency help to localize landmarks in the current frame and next frame. If the detector already recognized the pose landmark from the previous frame, it would consider as input for the next frame to start a new cycle of pose detection. Then they select the ROI frame and calculate it from detection or previously detected landmarks. If the tracker does not find object in the current frame, then the detector restarts searching in previous frames. In the video domain, successive frames have a high degree of regular consistency, allowing the compact networks to conduct effective pose prediction when the intensive features are transmitted in-between frames. Therefore, the proposed technique is graphically more precise to improve human pose estimation accuracy.
The primary process of human pose estimation is body joint key points localization and grouping those joints into valid posture configurations. The first stage focuses on locating each body joint of a person, such as head, shoulder, arm, hand, knee, ankle, and so on, as shown in Fig. 1. Our technique extracts intensive features for long term consistency to detect accurate and stable human pose results. The following list represents the significant contributions of our proposed work to mitigate the impacts of perspective variability for human pose estimation performance. ❖ We have designed an efficient IFC framework to address intensive feature's information from stable body joint key-points and their coordinates visibility properly for single person pose estimation improvement. ❖ The intensive feature's learning method has been divided into two sections: global body intensive features extraction and local body part adjustment, for further optimization. ❖ We have intended a discriminative module to sophisticate 33-skeleton joints for body and torso configurations. Which alleviates the effects of body joint coordinates diversity in the IFC network for estimating stable offset on the frames of pose appearance. ❖ The experimental result shows state-of-the-art performance by addressing human pose estimation in the video domain under real-time proceeding speed one frame per second (FPS). The remained of this paper as follows. In Section II, we reviewed the related work as background knowledge. In Section III, we show the details of the proposed IFC network. Data analysis and Experimental results have been presented in Section IV. The discussion clarified in Section V. Finally, the conclusion has been drawn in Section VI.

II. RELATED WORK
The most common and successful classical methods were Pictorial structures [11], [12], [13], [14] to articulate the human pose estimations. The spatial correlation of body parts presents the tree-structured graphical model, which works well when the limbs are visible in the frames. However, the tree structure fails to express pose properly when the variables of limb's connection are invisible in-between frames. To boost up the performance, deep learning-based algorithm used convolutional neural networks, DeepPose [15] is one of the first human pose estimation methods for detecting human body key-points. Later, many approaches typically generated heatmaps that describe the probability of each keypoint at various places. Many researchers experimented deep convolutional neural network-based regression techniques, such as regressing joint coordinates or regressing joint heatmaps [16], [17], [18], [19], [20]. In addition, the deep learning algorithms predict the poses from input images, videos, and live events. Although many researchers achieved good results, but physical exercise mobility activities still have limitations.
Most of previous pose estimation methods were trained and evaluated by pose estimation datasets such as Human3.6M [21], which contains simple postures like sitting, standing, and walking. The 2d pose estimators, such as Stacked Hourglass Network [22] achieved good performance. The sequence-to-sequence regression model [23] used a recurrent neural network to leverage temporal information for 3d human pose prediction. Pavlo et al. [24] employed temporal convolutions to collect long-term features information. A deep convolutional neural network [25] achieved notable performance for diversity image by combining low-level input features and higherlevel weak spatial features to utilize various features information.
The human pose estimation network for new feature learning method [26] based on dual-source structure corresponding to the partial feature patches and body patches performs for pose estimation. The feature appearance of the local body part element in the frame combines the local and global features to obtain more reliable estimation data. The Part Affinity Fields (PAFs) [27] network recorded the location and orientation of a specific limb at each point in the picture domain. Within a sequential pose prediction framework that allows global features information to update the approximations of body parts and their relationships. These fields are learned and forecasted together with confidence maps for each body joint parts.
Video pose estimation extracts less attention due to the scarcity of unique features consistent in the video frame. Previous researchers focused on obtaining temporal information, such as optical flow [28], [29], [30], [31] to support the refinement of framewise results provided by massive networks. Although there have been efforts to incorporate spatial feature information from the video to produce robust predictions with less noise. The sequenceto-sequence learning methods [32] encoded 2d video pose sequence into a fixed-size vector and then decode it into the 3d poses. The objective of that network is to enhance long-term features consistency by tracking pose estimation features information throughout the training periods. The long-term consistency feature used either to direct optimization technique or short-term consistency propagation method.
The optimization method needs to enhance the long-term features consistency by optimizing intensive features across the video frames.
The human pose estimation process joints each frame to choose optimal performance while maintaining feature coherence in the video frames. The alternative methods investigated by Andriluka et al. [33] and Ramakrishna et al. [34] to treat each body joint key-points individually in the frame to obtain best feature at initial steps. Those networks employed with object detecting and tracking methods for extracting the consistent frames. The Pfister et al. [35] proposed a constructive network with the receptive fields that captured global features information with body relationships and implicitly of body joint. This technique required entire video stream for temporal and spatial features information.
The short-term consistency model focused on neighboring frames, while long-term consistency is obtained by propagating short-term features. By integrating short-term coherence and spreading short-term coherence over a long the period, Chen et al. [36] created an efficient video pose transmission network that assures features consistency over the long period. Ruder et al. [37] proposed single image transferring system for video series. They added the temporal constraint that penalized departures between two frames while taking on the optical flow from the original video frame. Pavlakos et al. [38], for example, used a coarse-to-fine supervision system and a volumetric model to estimate 3D human pose from a single image. Other methods, for instance, Martinez et al. [39] and Chen et al. [40], employed networks to predict 3D pose estimation results based on convolutional neural network.
Considering above discussion, this research aims for identifying and classifying the human body joints through intensive feature's long-term consistency and create pose estimation in real-time application. We focused on capturing each joint coordinates (body and torso), which is called as key point. The connection between two points recognized as a pair of single persons pose estimation, which provides the foundation for others who are interested to study for intensive feature long-term consistency related video pose estimation and tracking challenges. This pose estimation method allows for the noninvasive collection of pose data from RGB images, videos, live stream webcam which is more practical than using observation, sensors, or depth cameras. The pose actions of physical exercise such as yoga, dance, and fitness activities remain an unsolved subject from previous work. We have designed an intensive feature consistency network to alleviate those issues by refining the global body correction and local body joint parts adjustment. The proposed tracking mechanism continuously trail body joint paths to estimate from low to high resolution features.

III. PROPOSED POSE ESTIMATION METHOD
This section introduces the overall framework and work procedures of proposed intensive feature consistency (IFC) VOLUME 11, 2023 FIGURE 2. Proposed framework for human pose refinement and improvement process. The given inputs are resized by the convolution and pooling layers, base network predicts an initial pose, which is refined through intensive feature consistency network and discriminator. network in the Fig. 2. Which refines 33-skeleton body pose landmark for single person pose estimation and improvement. Specifically, subsection A illustrates the flow chart of pose estimation pipeline using MediaPipe machine learning tools. Subsection B describes proposed framework and its architecture comprehensively. Moreover, subsection C explains intensive feature consistency network and its working process. Subsection D depicts loss function of the proposed IFC network. Besides, subsection E exhibits single person pose landmarks and its coordinates distribution system. Subsection F demonstrates forward propagation and subsection G presents the backpropagation process of the proposed method.

A. POSE ESTIMATION PIPELINE
MediaPipe is a machine learning tools, which utilized highresolution body posture tracking system that infers pose landmarks and background segmentation masks on the body pose on video frames. The MediaPipe solution employs the 2-step detector and tracker in the frame. The detector identifies the objects' areas of interest (ROI) in the frame. The ROI crops object area from input frame, then the tracker predicts the posture landmarks and segmentation mask inside of the ROI for single person located inside of the frame for 4 meters of camera distance. Figure 3 represents the human body key points localization, detection, tracking and pose estimation approach flow chart. At first, the input window has a detector which distinguish the object. The ROI setup locate the object position through the detection window. The second step is the proposed work refines detected 33 key-points which is extending of 17 key-point COCO topology. joint parts from the current frame. These additional key-points provide vital information about face, hands, and feet location with scale and rotation. The body parts are categories as global body key points (body) and local body part (torso). The object still available in the frame or not, a tracker identifies this issue. The next frame is used to activity monitoring algorithms for fitness mobility applications. If the input video does not end, the process becomes repeat to the following of previous frame. Otherwise, the steps become end and shows the result.

B. PROPOSED FRAMEWORK
The proposed framework has been designed with the several modules: input sampling convolution layers, baseline network, intensive feature consistency network, pose refiner, and discriminator. The initial module scales the input frame by the convolution and pooling layers. The IFC network extracts intensive features for long term consistency and 28048 VOLUME 11, 2023 confidence maps to refine the pose landmarks through the pose detection discriminator. The proposed intensive feature consistency network refines the predicted 33 body joint key points from the base network.
The proposed framework aims to detect accurate and stable 33 skeleton key-points of body joint coordinates for better quality human pose estimation. The base network predicts an initial pose that defined as P f i , which is transferred to further processing and refining through the intensive feature consistency network. Which generates the refined poses P IFC by the global body intensity extraction and local joint part adjustment. Moreover, we enhance the high level of intensive feature extraction for long term consistency over the body configuration. The intensive feature consistency network and the pose estimator are jointly trained with the pose estimation generator to distinguish the ground truth pose P f g t and refine pose P IFC . The intensive feature base a discriminator is trained by the adversarial learning. The discriminator compares the refined pose P IFC and the ground truth poses P f g t , which decides to processes the forward or backward propagation. The fully connected layer connects the posed landmarks and forward propagation landmark. The estimated real pose P R transfers to pose landmark as final pose.
Equation (1) stands for the input of the network. Here P n is the single input object, and h, w, c is the real size of object.
The equation (2) represents the input frame of the network, where f is the successive frame and n is the successive frames number.
Each frame, The baseline network's pose heatmap (Ĥ) is shown in equation (3), whereas the P f n illustrates the baseline pose, which considered as ground truth G = P f gt=1 pose for the IFC network refine pose P IFC comparison.
Ground truth heatmap, Heatmap of each key points are calculated through the following equation (4). Hance, s is the stride convolution and k is the number of body joint key points of the object.

C. INTENSIVE FEATURE CONSISTENCY NETWORK
We have developed an intensive feature consistency (IFC) network through global body intensity extraction, local body joint adjustment, confidence value and it's visibility acknowledgement as shown in figure 4. The IFC network refines baseline pose body key points by eliminating influence of flexibility and diversity body configuration of base network. In addition, to enhance the IFC network competence over the body configuration, a significant refiner has been incorporated under the consistent views as a highlevel constraint. Which effectively distinguishes the reliable and scalable global body joint key points and local parts accurately. The proposed IFC network refines the intensive features consistency and estimated poses with the global and local intensive feature transformations. The intensive feature long tern consistency discriminator is used to enhance the performance of the proposed IFC network. The proposed IFC network enforces advanced contractions for the intensive features extraction during the training period.
Generally, the human pose estimation mapping function defined as mapping function, P inl ∈ R 3×k . Which is formulated as the following equation (5) of refine the human pose for IFC network.
where α denotes the confidence parameters of the module function, and k signifies the number of body joint key-points of the object. The objective of this module is to estimate each body joint for global body (body) and local part (torso) as close to the ground truth as possible.

1) GLOBAL BODY INTENSITY ADJUSTMENT
We utilized the instance normalization [41] layer to normalize the base network pose for intensive feature abstraction. We focused on global body parts to determine the exact body joint location and achieved intensive feature consistency for the body configuration before and after transmission as shown in figure 5. The global body intensity module tracks and turns the body joints into a continuous interpretation by examining the object position in the frame. We considered the global positions of the three-body joint for each calculation, which are traced by a smoother following [42], [43] methods from their nearest neighboring vertices systems. Then the VOLUME 11, 2023  Furthermore, we got a new coordinate where the element vector V E has been considered as the z-axis, V S as the x-axis, and the transformed vector V T = V E * V S = [Tx, Ty, Tz] T has been taken as the y-axis. The transformation module of the global body pose is sent to further refine process through the connected layer. The intensity element outputs feature map k ∈ R c×l×d corresponding to the global body appearance. Whereas the dimensions c stands for channel number, l represents the length and d is the distances respectively.

2) LOCAL JOINT PART ADJUSTMENT
Although the global intensity adjustment minimizes the keypoint diversity, some local body sections exhibit substantial variation. After global body transformation, some flexible body joints, such as the wrist, arms still maintain a dispersed distribution. We relied on human pose describing structural [44], [45] connections to distinguish the information between distinct local body parts. There are two types of human structural relationships Kinematic [46] relations define reversing connections like nose to neck, left wrist to left elbow, while symmetrical relations focus on bilateral symmetry of key points like left wrist to the right wrist, left ankle to right ankle. To create feature uniformity by significantly reducing key-point variability for each body component. We investigated five body parts for local joint adjustments: left and right arms, left and right legs, and the chest-thorax-jaw-head jointing region. As a result, after the local body joint configuration, the assessments are much more consistent, and thus the location distribution for each joint is more concentrated, as shown in figure 6.
We have considered two joint points for each k th body portion from the element vector E k . Then the local joint adjustment module transforms the element vector E k to the x-axis and the standard vector S k for the arm component parallelly to the z-axis. The two joints, such as the shoulder and elbow, work together to achieve this effect. The element vectors (E k ) and standard vectors (S k ) generate the transformation parameters. The arm belongs to the shoulder and elbow joints produces the vector V k for the arm position. The upper part of the leg (hip and knee joints) is used as sub-part to generate the E k vector. The connection of the chest and thorax joints is taken as sub-component to produce the vector of E k for the chest thorax-jaw-head joint chain portion.
For each landmark, the Z coordinate determines the specific value. It conveys the image pixels that are equivalent to X coordinate and Y coordinate. The z-axis runs perpendicularly between the person's hips and the head, as well as the camera focus point. The origin of the z-axis is around the center point of the hips, which concerns the camera, left-right, and front-back motion configuration. The negative value of Z and its point towards to the camera position, whereas the positive Z values ant its point front way. There are no upper or lower limit boundary for the Z coordinate value estimation.

3) CONFIDENCE VALUE CONFIGURATION
The proposed IFC network module interprets the input video streams when the static image mode customizes as false. The confidence detector identifies the confidence value in the first frame and recognizes the human body key points (body and torso) upon the successful detection. It monitors such landmarks in the successive frame without calling another detection until the object path is lost. It reduces the computational complexity and decreases the time delay. If the static image mode defines as 'true,' then every input is detected in the frame, which is costly for processing a batch of static image. The confidence value considers in between (0.0 to1) for a successful pose detection and improvement. In our feature confidence configuration setup, the lowest value is 0.2, and the maximum value is 1 for successful pose tracking. Otherwise, the detection will be invoked automatically in the next frame. Higher confidence value robust the performance at the cost of the high potentiality.

4) INTENSIVE FEATURE REFINEMENT
We take the patch map of each key-point heatmap in the frame with the peak value and enlarge it to the size of l * d, which multiplies with the matching intensity in each key point. After that, we acquire posture feature maps P IFC ∈ R c×l×d f =1 , which correspond to a person's keypoint pattern, such as the head, left shoulder, right shoulder, and so on. After vectorization, each key-point area has an intensity feature vector V f ∈ R Iv . The lower-level posture information is fused in the common layers into the similarity representation module without additional procedures. The module includes higher pose features in the key-point feature maps. The label direction graph considers each joint region as a vertex and the structural relationships between the edges of key points. Each dimensional vector reflects the charac-teristics of each key point area, and we accumulate the pose feature vectors V ∈ R Iv f =1 . Each vertex encodes intensive feature information from its neighboring vertices through a directional label in the IFC network. The refinement subnetwork of each body joint adjusts global body and local joints to make them further consistent. After the refinement, five local joints are inversely converted back based on the transform parameters and joined to obtain a full-body posture following the local body part rectification. Similarly, the full-body pose is inversely transmuted back to the original perspective depending on the global transform parameters to get the final position. Our technique is more precise and stable for extensive feature consistency maintenance. This method overcome the uncertain vital fault and helps to enhance better performance.

D. LOSS FUNCTION
The loss function is determined using the final posture heatmaps across the frames. The calculation is driven by both sets of joint heatmaps since the beginning appears twice in the feed-forward loops. The standard deviation of joint position is measured to compute the weight of each joint position. The origin of mean-square error (MSE) is calculated through the difference between each frame's output and the groundtruth value. We considered the mean-square error so that the weights of the joints would spread more eventually. The probability (P b ) of each body joint key-point coordinates is denoted by x, y, and z axis. The OpenPose [27] estimation reliably predicts p b (x|I). We focused on p b (z | y | x) in this study, which implies predicting visibility using continuous variables x, y, and z axis. The P b (x, y, z, I ) probability of the joint locations based on RGB input. P b (x,y,z,I ) = p b (x|y|z).p b (x|I ).p b (y|I ).p b (z|I ).p b (I ) (6) The weights of body joint key-point W k are calculated to normalize standard deviation, as shown (7) by denoting the weight vector of the joints as W k = [w k |k ∈ k]. The following equation (8) determines the training loss for the data feature. Equation (9) calculates the error, and equation (10) estimates the loss function. Ë nk -E nk denoted the error of key-point position. Distance between the estimated location and the ground-truth location.
Frame loss, Training loss, L t = 1 33 The total loss,

F. PROPOSED FORWARD PROPAGATION PROCESS
To generate some output of the network, the input data need to be fed only the forward direction. The input data should not be flow to the reverse direction during the output generation.
Otherwise, it could be form of a cycle and the output would not be never generated. This work network configurations and work procedure is also known as feed-forward network. The feed-forward network helps to the forward propagation. Forward propagation follows two essential steps: summing the product and sending it to the activation function. The hidden layers of the proposed network accept data from the input layer, process it because of activation function and pass it into the output layer or the successive layers. Initially, it multiplies weight vectors with the input vectors to get the product and run it through the activation function. The process continues until reach the last layer and make the final decision. Each layer receives the sum of the weight and input vector to ensure the output choice. This method continues until the activation function of the output layer is attained. For example, during the forward propagation of each node at hidden layers and output layer pre-activation and activation function takes place. For instants, the first node of the hidden layer, (pre-activation) function is calculated and then (final activation) function is calculated by the following equation (11).
Hence, X fp represents the initial input, b 1 represents the bias, W fp represents the weight, and U fp is the sum of production after applying the activation function. Every layer has the LeakyReLu activation function. We predicted the cost value and calculated the training loss, error calculation, and total loss of the function.

G. PROPOSED BACK PROPAGATION PROCESS
The core part of neural network training is back-propagation process. It is essential for fine-tuning the weights depending on the preceding epoch's error rate. We need to find out the way to upgrade the weights values of the network so that VOLUME 11, 2023 the cost function can be improves individually. In addition, any of the given path from an input to thes output neuron are essentially needs just aa the composition of functions, so that we can partial derivatives and the chain rule can define the relationship between the given weight and the cost function. Weight tweaking reduces error rate and increases the model's generalization, making it more dependable. The backpropagation computes the gradient of loss function, weights, and bias for the network. We used the delta for the cost calculation. It starts at the output and works its way to the last layer, using partial derivatives. Obtaining the delta cost at the end layer, the algorithm creates a new weight. We repeat the above steps to minimize the total error of our proposed IFC network. This algorithm continues to the previous hidden layer to compute the delta value and new weight value after the last layer. This process employs the chain rules for calculating delta value and multiply with learning rate from each layer.
We calculated the individual gradients on the loss function using the chain rule during the back-propagation process. It is essential to pay more attention to acquiring the gradient value of convolutional parameters. Finally, the human pose is calculated through the following equation (13).
The E P stands for estimated pose by the IFC network and G T represents ground truth pose from base network.

IV. EXPERIMENTS AND RESULT ANALYSIS
In the previous sections depicted the overall proposed network architecture. Hence, we have designed deep convolutional intensive feature consistency network which has been incorporated to the base network. We included the instant normalization and stride convolution for global body intensity and local joint adjustment. Which minimizes parameters and reduces the computational complexity of the framework while continuing the body key-point correction. Furthermore, enforce the body configurations restrictions to learn the posture refinement under the body joint consistent diversity. To alleviate the implausible result, we employed an adversarial learning discriminator to distinguish between the estimated pose and the forged pose.

A. DATASETS OF THIS WORK
We tested our work with several standard benchmark datasets, including Penn Action datasets [47], Sub-JHMDB dataset [48], UCF101 [49], etc. The Penn Action Dataset includes 2326 video sequences depicting 15 different actions, comprising 1258 clips for training and 1068 clips for testing. Each frame contains annotation of the key-point position and visibility of the body joints, including the torso key points and body key points. The Sub-JHMDB is another dataset for video-based posture estimation, a subset of JHMDB employed in our tests for a fair comparison with earlier work. Sub-JHMDB datasets consider the whole human body with 316 video segments total 11200 frames. The UCF101 dataset is the collection of 101 action-category videos from YouTube, total13320 videos with substantial differences in camera movements, object appearance and pose, object scale, viewpoint, background, illumination conditions, and so on. The UCF 101 action videos were divided into 25 groups, including four to seven action videos each. Videos from the same group have certain similarities, such as comparable settings, views, etc. To evaluate the effectiveness of the proposed framework, we have employed webcam videos and a variety of short videos taken from YouTube. For example, street dance, workout, body motion, and indooroutdoor activities for a single person.

B. TRAINING DETAILS
The proposed framework has been developed with the Medi-aPipe library in the python platform using Pycharm integrated development environment for computer programming. The processor is a Core-i7-4790CPU, the NVIDIA GeForce GTX 1080 Ti graphics card with a maximum processing speed of 3.6 GHz supported AMD. An installation RAM is 20480 MB with 21318 graphics memories. The training procedure employs CUDA, and the cuDNN version is 11.5, 64-bit Windows 10 Operating System. The computation volume of this proposed framework is 4.4 M flops. The adaptive moment estimation (Adam optimizer) updates the weight data. The Mean Absolute Error (MAE) between the previous and current key-point distance is measured to estimate training loss. The initial learning rate set to 0.001 values for training later gradually decreased with the number of the training period. The training period runs until the loss function value is minimized and completes the margining level.

C. EVALUATION METRICS
The Percentage of Correct Key-points (PCK) is a precision metrics that determines the expected key-point from the original joint within a certain threshold value. The PCK is normally set to the scale inside of the bounding box. We used the Percentage of Correct Key-points (PCK) [12] metrics to evaluate the model accuracy and compared result it with other methods. The PCK is a standard evaluation matric, which is measured the percentage of accurate localized body parts within a specific threshold value. Evaluation of articulated human pose estimation done using the percentage of correct key-points metric. Which reports the percentage of detections that fall within a normalized distance of the ground truth. In this paper, we calculated the average accuracy evaluation metric, which refers to A cck as the k th category classifier recognition accuracy. For this experiment, we considered the PCK@0.2, the distance between predicted and true joint less then (0.2 * torso diameter threshold) value. The detected joint has been considered accurate if the distance between the

D. ABLATION STUDIES
The ablation studies for baseline network and proposed IFC network's performances discussed here. We have investigated the accuracy, FPS speed and Flops results for baseline network (different version) which are compared with our proposed work. The baseline network has light and full versions that explained in the Table 1. Whereas light version has lower flops value only 2.7M, less accuracy 46.0%, 53.5% and 53.8% for sports, body-motion and workout category of UCF101 dataset but higher FPS of 35. However, the baseline network's full version has good accuracy of 62.6% for sports, 67.4% for body-motion and 68.2 % for workout compared to light version but lower FPS only 19 with large flops 6.9M value. Moreover, the proposed work achieved better accuracy (65.5% for sports, 70.5% for body-motion and 80.5% for workout activities) and PCK@.2 metrics (97.8 for sports, 98.1 for body-motion and 97.4 for workout) for UCF101 dataset under the real-time proceeding speed of 31 FPS. Moreover, proposed work has reduced computational complexity 2.5M flops in comparison to Blaze pose full version. The intensive feature consistency mechanism achieved excellent performance through our proposed IFC network architecture. It can be remark from the table 1 that regardless of its feeble performance on 4.4M flops and 31 FPS in comparison to the Blaze pose light version, the proposes IFC network has achieved best accuracy and PCK@.2 metrics under required real-time proceeding speeds. The competence of lightweight network in video domain strongly generated high-quality human pose and alleviate issue of body movement diversity the by IFC network. Besides, it enhanced the model accuracy across the frames. The effectiveness of IFC network is largely assessed in the field of object detection, tracking and pose categorization. We use framewise evaluation to signify single person pose estimation with global body correction and local part adjustment. Table 2 shows the state-of-the-art performance of Pose for Action [50], Thin-Slicing Networks [51], LSTM Pose Machines [52], DKD Efficient Pose [53], PosePropagation-Net [3], and our proposed work on Penn Action benchmark datasets. Among these models the Pose for Action has lowest evaluation results only 81.1% for PCK-body and 92.6 % for PCK-torso, which is designed by VGG-16 backbone architecture. The Thin-Slicing Networks and LSTM Pose Machines both models are developed by the following CPM architecture. Besides, the Thin-Slicing Networks has 96.5% accuracy for PCK body. In addition, the LSTM Pose  28054 VOLUME 11, 2023 Machines achieved better outcomes 97.7% on PCK-body and 92.6% PCK-torso then Thin-Slicing Networks. The DKD Efficient Pose with ResNet-50 is backbone architecture has average performance 97.8% on PCK-body and 92.9% PCKtorso metrics. The PosePropagationNet designed by two different backbone architectures, one of them is ResNet-18 and another one is MobileNet-V2. Whereases, the PosePropagationNe with MobileNet-V2 has good result 98.5% for PCK-body and 93.8% for PCK-torso. However, the PosePropagationNe including ResNet-18 as backbone architectures has better performance 98.8% PCK-body and 94.2% for PCK-torso compared to other four models. On the other hand, the proposed work developed with BlazePose as backbone architecture, which achieved best evaluation results 99.1% and 94.7% on PCK-body and PCK-torso metrics those are reported on the table 2. Considering the real-time applications, the inference speed accelerates the performance of devises. The proposed work has reached maximum accuracy with real-time proceeding speed. Therefore, the proposed architecture could be a good solution for the singleperson posture estimation that is particularly well suited for fitness and workout mobility activities, with the real-time inference speed applications.

E. RESULTS ANALYSIS
The fig. 7 shows the single person body joint detection rate for specific key-point by the proposed IFC network. The figure (a) expresses the shoulder detection rate, the figure (b) represents the elbows detection rate, the figure (c) signifies the wrist detection rate, the figure (d) specifies the hips detection rate, the figure (e) shows the knees detection rate, and the figure (f) displays the ankles detection rate through the previous models and our proposed IFC network. At all aspect the Pose for Action [50]      PCK torso value. In comparison, the Mean value of PCK body and PCK torso detection some of the models has good performance but still lower than our proposed work. the Sub-JHMDB dataset. Although the Sub-JHMDB dataset contain small size of objects, still our proposed worked has best state-of the-art performance among these models.
To evaluate the improvement of human pose estimation results with our proposed IFC network, we considered diverse sort of input videos with different views and angles which shown in Figure 10. The column (a) represents dancing video pose estimation results, column (b) displays the running and walking videos pose performance, column (c) shows the workout and physical activities pose estimation, and column (d) illustrate the webcam live stream videos pose outputs. The proposed IFC system makes the object identification, location, and tracking more easier compare the other methods.
The first column (a) demonstrates the pose estimation results for single person dancing video pose estimation and improvement. Whereas it can be observed that IFC network detected the 33-body joint key-points properly by alleviating of the body movement diversity. The second column (b) shows the performance for the running, walking and standing pose estimation results. While proposed IFC network accurately detect human pose although various body movement displayed in the input. The third column (c) represent the physical exercise pose estimation outcomes for indoor and outdoor scenario. Although, we applied some algorithm to calculate the angle of body key-point with successful human body pose. This function performs for any join coordinates of the body pose. Furthermore, the last column (d) shows the pose estimation performance for webcam live video. Whereas the proposed IFC network capable for object identifying, locating, and tracking the human body key points appropriately. Besides, the proposed network estimated human body pose in many aspects correctly. Moreover, the proposed IFC network improved single person pose estimation performance with higher accuracy and lower computational complexity under the realtime inference speed.
Nevertheless, recognizing the fine-grained human body joint coordinates are tough technique, especially when nursing with small objects, barely visible or invisible joints for the substance issues like shadows, overlapping, and obscured of the from object's position. When the camera view is reformed, the intensive feature consistency information becomes change, and comparable key-points became confuse through the appearance of discriminative model resulting the approach may unsuccessful. The IFC network pose estimation techniques now considerably more efficient because of our proposed method reduces AI models size and computational complexity. This is the basic requirements of the real-world application and implementation of human pose detection. It's a promising approach to pursue in the future to single human posture estimation and its performance improvements.

V. DISCUSSION
Usually identifying the object location, and tracking are complicated process through traditional methods. When the camera view become changed the feature information occurs inappropriate which make confuse the appearance of discriminative model, as a result it may causes the process ineffective. However, the fine-grained joint coordinates recognizing is also difficult process, especially for small size input, overlapping image, and obscured objects.
The pose estimation techniques now considerably more efficient because of AI technology which reduces the model's complexity and improves performance. We employed a deep neural network model using the regressor approach to forecast the location of body key-point. The proposed framework accepts different input image size, video, and webcam live stream to create X, Y, and Z coordinates for each body key-point. Furthermore, the deep convolutional networks with a large amount of label datasets that are used to train and evaluate performance. Adversarial learning is being used to refine body joint key-points position using an IFC network. The PCK metrics are considered for the accuracy evaluation for human pose detection application. The increasing number of training and testing datasets may improve the model's performances. However, the proposed method work for single person pose estimation and improvement. For the future work, it may progress for multi person pose improvement.

VI. CONCLUSION
Human body movement and key-point configuration is more flexible and unpredictable because of bending arms and legs. The arbitrary occlusions due to angle views, background contexts, and body key-point positions are different relatively to other objects. Which makes more complicated for human body key-point detection and tracking.
We have designed a unique approach for estimating single person pose with higher accuracy. Which invests the intensive features long-term consistency in the frames. The proposed framework detects and tracks 33-body joint key-points including body, torso, arms, leg, and face that is ideal for fitness application and mobility activities. We have utilized PCK evaluation metrics to learn the weight distribution during the training and testing operation. Proposed framework achieved state-of the-art-performance and got better result than most previous works.
The ability to extract intensive features for long term consistency opens new option for future research such as augmented reality, animation, gaming, robotics, and other fields. Autonomous driving is one of the examples of it where this technology has already proven its worth.