Key Algorithm for Human Motion Recognition in Virtual Reality Video Sequences Based on Hidden Markov Model

This paper provides an in-depth discussion of human motion recognition in Virtual Reality (VR) video sequences through hidden Markov models, which are four steps from VR video acquisition and pre-processing, foreground detection, extraction of human feature parameters, and hidden Markov model human motion recognition. A hybrid Gaussian model was used to build a background model in real-time based on changes in VR video information, and the image was subtracted by the background differential method. The optical flow method was used for foreground detection of the target, and the effects of sparse and dense optical flow were compared to obtain the motion characteristics and optical flow information of the target human body, respectively. Features were extracted for human motion, in terms of common geometric features of the body and optical flow information, respectively. In terms of common geometric information, the width-to-height ratio, perimeter-to-area ratio, center of mass, eccentricity, and feature angle were extracted, respectively. For the optical flow information, optical flow descriptors were constructed using a grid-based approach. And feature fusion was performed for the above two parameters by the k-means method to construct the word pocket model. The hidden Markov model parameters were obtained by using the hidden Markov model for the recognition of human motion and training the human feature parameters for each of the four motions. The recognition of the four common human body movements was realized by the forward-backward algorithm. The test results show that the motion recognition method in this paper has high recognition performance and good anti-interference performance. The time-sequence pooling is used to sort the effective video frame feature sequences to obtain the feature vectors that can represent the dynamic changes of video time sequence; finally, the time-sequence feature vectors are used to train the support vector machine for classification recognition. The recognition accuracy is 65.2% and 89.4% for the HMDB51 and UCF101 datasets, respectively.

the spatial and temporal dynamic changes of the human body in the image sequence. The importance and practicality of human body motion recognition research are becoming more and more prominent, which is not only widely used in intelligent monitoring, human-computer interaction and motion analysis, but also have an important development prospect in the identification and abnormal motion monitoring of security risk places [2]. Motion analysis is the process of recording the action content of each part of the human body in the sequence of actions performed by the operator with a specific mark, and converting the recorded actual situation into the form of diagrams to judge the quality of the actions, which is widely used in intelligent driver assistance systems, motion analysis in sports, physical rehabilitation and physiotherapy [3]. It is widely used in intelligent driver assistance systems, motion analysis in sport, and physical rehabilitation and physical therapy. The intelligent driver assistance system uses a computer to observe the driver's real-time movements and actions and sends an alarm when it observes a misoperation. Motion analysis in sports can be used to detect whether an athlete's movements are standard or within guidelines [4], [5]. At present, the current motion recognition technology can be divided into two main categories: a non-vision-based approach and a vision-based approach, which are classified according to the different ways of feature information collection. Among the non-visionbased approaches, the user is instructed to wear a sensor or install a sensor in the space where the user lives to collect human activity data. By wearing a sensor, the user can obtain the movement characteristics of the human body's limbs, head, and torso during movement, and then obtain the human movement [6]. This way of acquiring human movement data is more accurate, and most of the movies and games will use this method to acquire the three-dimensional human movement. The disadvantage of this method is that it causes inconvenience to the user's movement, and in the daily monitoring system, it can have physical and psychological effects on the user, so it is not very suitable for home, hospital, and nursing home settings [7]. As for the high-level motion recognition and understanding, mostly the sensor is placed in the human activity space, and the human motion information is obtained by perceiving the shift of the human position in the environment, although this method does not need to wear the sensor in the use process, which is convenient for the user, it is not able to accurately judge the specific action of the user. Therefore, the non-vision-based motion recognition, although simpler to implement, has a limited application scope [8].
It proposed a CNN-based noise-robust feature learning method that can reduce intra-class diversity while increasing the distance between classes during training. We placed CNNs in a multitasking learning framework to learn richer facial features with semantic properties and discriminative abilities by combining individual information and advanced human attributes such as gender and age [9], [10]. It proposed a multi-stream CNN structure that would include targets. The combination of six dynamic features provides richer information for motion recognition. In recent years, great progress has been made in recognition rates on Pascal, VOC, and ImageNet datasets by trying to use different CNN structures for target recognition tasks. In addition to CNN structures, unsupervised learning-based Auto-Encoder (AE) is also gaining attention [11]. The AE is a generative model that has the advantage over CNNs in that it enables cross-domain learning, and embedding nonlinear units in the AE network allows it to learn domain-invariant features between source and target data more efficiently [12].
The impact of domain-to-domain inconsistency on classification can be effectively reduced through the use of deep network structures. It proposed a Stacked Progressive AE structure (SPAE) to model complex nonlinear transformations from non-frontal images to frontal images to learn motion robust features asymptotically [13]. Each AE maps are large face motion variations to a virtual view of small motion angles so that the output contains only small motion variations. We also proposed Adaptive Deep Supervised Network Template (ADSNT) based on the AE, even though the training data is not available [14]. The effective target features can still be extracted for similar facial image reconstruction despite being heavily corrupted by noise [15]. The above review of the development of target recognition algorithms shows that, from the recognition results of various types of algorithms on publicly available datasets in recent years, deep learning-based methods are highly effective in the target recognition, especially CNN model-based methods, which have much higher recognition accuracy than traditional recognition methods based on representation learning and manual features, provided adequate training samples are obtained [16]. However, the lack of interpretability of the working principle between the layers in the deep network makes the recognition effect poor in the absence of labeled data. However, the recognition method based on sparse representation only needs a very small number of training samples to complete the model construction, which has obvious advantages in the case of small-scale training samples [17].
The feature parameters commonly used in feature extraction to judge human motion include the center of mass, aspect ratio, perimeter area ratio, feature angle, block area feature, and optical flow information. The selection of feature parameters is particularly important and can directly affect the result of motion recognition. When inappropriate feature parameters are selected, the accuracy of judgment will be affected. When too many feature parameters are selected, the dimensionality of the whole model will increase, which increases the complexity of the operation and affects the speed of the whole system [18], [19]. When too few feature parameters are selected, the accuracy of recognition will be affected. Therefore, we must select the appropriate feature parameters for identification. The camera acquisition as well as the pre-processing module, then the motion target detection module extracts the human body target in motion, next extracts the feature parameters and uses the hidden Markov model to identify the human body motion [20]. To accomplish the expected goal of the system, the project can be divided into three main parts. The supervision information is used to obtain the time structure of the sequence to maintain the overall difference of the sequence, and to delete redundant information to reduce the feature size. Through in-depth research on the mathematical model of the sparse representation of local structure, it is found that the structural information of the representation coefficient can be used to measure and judge the target category without relying on the residual measurement based on the Gaussian assumption.

II. HIDDEN MARKOV MODEL HUMAN MOTION RECOGNITION DESIGN A. IMPROVED HIDDEN MARKOV MODEL ALGORITHM
The Hidden Markov Model (HMM) is a probabilistic model describing the statistical properties of stochastic processes. The implicit Markov model is based on Markov chains and consists of a double stochastic process, one stochastic process is used to describe state transfers with finite Markov chains [21]. The other stochastic process is used to describe the statistical correspondence between each state and the observed value. The implicit states of an implicit Markov model cannot be observed directly, but are reflected indirectly through observation vectors that affect their status. It has been widely used in image processing, speech recognition, and other fields, and has gradually become an effective technique in the field of posture assessment, as shown in Figure 1.
The algorithm is trained so that P(O|2) keeps getting larger and larger until it converges, and eventually the optimal model parameters are obtained. The algorithm is to compute P(O|2) using the forward_backward algorithm based on the given initial parameters. If P(O|x) is less than 8, the algorithm repeats the training until the end condition is satisfied. From the forward-backward algorithm, it can be deduced that: The probability that the model parameter 2, as well as the observation sequence 0, is known to be in states: The probability that the model parameter, as well as the observation sequence, is in state S at moment t, and is in state S at moment 2: The revaluation formula in the Baum welch algorithm is as follows: In the hybrid Gaussian background model, we assume that each pixel is uncorrelated with each other. It thinks of the change in each pixel in the video as a process of constantly generating new pixels. As described above, a mixed Gaussian model is based on a single Gaussian model that expresses the gray value image of the background through multiple Gaussian distribution functions in a complex background. We assume that there are n sets of data X, where the distribution of these data is not part of the single Gaussian model distribution. To describe the data in the above column, we assume that the data are generated by m single Gaussian models, where we assume that the ratio of each data in the mixed Gaussian model is one [22]. The above distribution pattern we can call the Gaussian mixed model. The formula for the cloth density function in the mixed Gaussian model is as follows.
Then the probability density function of the jet individual Gaussian model is: The process of updating the weights is as follows: The process for updating expectations is as follows: The process for updating the covariance matrix is as follows: In a target-tracking task, the sample to be represented is often not just one, but multiple candidate target samples obtained from the acquisition. The joint sparse constraint is often used to model the target in multi-task learning-based target tracking methods, which argues that all candidate samples should correspond to the same base signal in the dictionary when represented sparsely, but this approach ignores the differences in the spatial similarity between different samples [23]. Graph-guided sparsity solves this problem effectively with the help of the concept of graphs in graph theory. In graph-guided sparsity, the samples to be represented are treated as nodes in the graph, and the similarity between the two samples determines whether to connect and the edge weight of the connected samples based on the degree of similarity. In addition to taking the relationships between the samples into account in the sparse model, there are also classification tasks in which the atoms in the dictionary are interrelated with each other [24]. In this case, when an atom participates in the representation of a sample, it can be assumed that atoms belonging to the same category in the dictionary should also participate in the representation of this sample. This form of constraint motivates the use of a subset of atoms in the same category in the dictionary in the representation process.

B. VR VIDEO SEQUENCE HUMAN MOTION RECOGNITION DESIGN
The acquisition of motion body videos is usually performed by cameras, camcorders, and other devices, which can result in some background information being covered in the video sequence. It is important to pre-process the video before feature extraction, an operation that removes the interference of background data on motion recognition accuracy while reducing the amount of data to be processed [25]. This paper provides a detailed description of the preprocessing work for two types of videos. In the visual attention mechanism of the human eye, moving human targets have some degree of saliency in both spatial and temporal dimensions. Therefore, it is necessary to extract the salient features of the input detection video sequence to obtain better recognition results. The flow of saliency detection by fusing static and motion features is shown in Figure 2.
In static feature saliency detection, the most commonly used features include color, luminance, and orientation in an image, and saliency detection is performed to better distinguish an object from the background. The degree of difference between these features is an important metric for achieving that goal [26]. The color features of an object are the most direct perceptions obtained by the human eye when observing an object, so color features are the most commonly used features in the field of visual saliency, among which the frequently used color spaces are RGB, HSV, and Lab. It is a method of mixing colors by superimposing three colors of light on each other and is therefore suitable for displaying light-emitting objects such as monitors [27]. An RGB image is an array of color pixels, where each color pixel is a set of three values corresponding to the red, green, and blue components of the image at a particular spatial location. Dynamic feature saliency detection can obtain significant features in the time dimension, and the greater the difference between the target motion and the background motion brings more significant information.
Effective features can accurately represent human motion movements and improve the accuracy of final motion recognition [28]. The extraction and representation of motion features are crucial for the impact of final motion classification recognition and are the most basic and critical module in motion recognition. The dense trajectory is a typical representative of the global feature and local feature mixture method [29]. After acquiring a dense trajectory, it is necessary to represent the features. both hog and HOF, the direction is quantized to 8 bits, and HOF has an additional 0 bit. The optical flow calculation depicts the absolute motion, so the camera motion is inevitably included and difficult to remove, but since the MBH descriptor is derived in the horizontal and vertical directions of the optical flow and its derivatives are calculated, the influence of the camera motion can be removed to some extent [30].
Human motion features are usually represented by global features, human body models and local features. The global feature means that the entire motion sequence of the human body is used as a statistical model, and the motion feature is obtained by processing the model. The human body model usually takes the joints, skeleton, and other body parts as the key points and models them for movement identification [31]. The local feature representation usually extracts STIP and uses STIP to effectively express the movement feature information to realize the human body. Each feature has its advantages and disadvantages. So hybrid features are gradually becoming the mainstream of research. The dense trajectory is a typical representative of the mixed-feature approach. The trajectory is sampled by dividing a dense grid in the optical flow field. The sampled feature points are tracked to obtain the trajectory, and then the trajectory displacement vector is calculated as the shape descriptor [32]. The machine eventually achieves human motion recognition.
To ensure the objectivity and fairness of the evaluation, the proposed model is validated in multiple data sets using the evaluation method of EDN in this paper. Firstly, we introduce the self-acquisition datasets and data pre-processing operations; secondly, we introduce the implementation details in detail and use the datasets to carry out comparative experimental verification and evaluation of the proposed source scene preservation video generation algorithm. In the field of motion-guided character video generation, since the effect of appearance and background transition is subjective, to measure the effectiveness of the method in this paper and check whether the edges of the fusion region are embedded naturally and smoothly. The Poisson image editing algorithm is used to compare the fusion results with the fusion results obtained by the EDN method without background fusion.
The MVC-based Poisson fusion method replaces the original Poisson equation with the Laplace equation and then uses mean-value coordinates to approximate the solution, making it an interpolation problem with reduced complexity and a simpler algorithm that can run interactively in real-time. This fusion method can blend the foreground smoothly into the background, keeping the screen unobtrusive, smooth and undulating, and the border color difference is not obvious. Compared to other fusion methods, the MVC-based Poisson image fusion technique is more powerful in terms of color manipulation, allowing two different color versions to blend seamlessly, adjusting the portrait color to obtain a newly fused image while preserving the full of the person, including details such as edges, corners and so on. The MVC-based Poisson fusion acceleration method has been experimentally proven to be very suitable for the fusion operation in character video generation technology, with outstanding picture effects.

III. DESIGN OF THE DATA SET AND EVALUATION INDICATORS ATTRIBUTE NETWORK BASED ON NEIGHBOR STRUCTURE A. DATA SET DESIGN
The target dataset is a ten-minute video recording of any action that should show the full appearance of the target object and sufficient range of motion to ensure the learning quality of the target video. To ensure the quality of the images, multiple sets of videos at 100-120 frame rates were used as training data sets. The DeepLabv3+ algorithm needs to segment out the figure foreground and background, the source background as the background data in the fusion operation.
The DeepLabv3+ algorithm is used for accurate segmentation of foreground and background images, the motion transformation algorithm is used for high-precision motion estimation and high-quality motion transformation, and the Poisson fusion algorithm is used to achieve a natural and seamless switching effect. Using the segmented foreground figure dataset for motion detection, the effects of screen occlusion and background environmental interference can be reduced. The obtained video frames of the source characters and lunar-marked characters are used as input to the source scene preservation video generation network, and the poses are estimated using the post-detector, and the prose sketches are drawn using a pre-trained postposedetector [33]. It synthesizes high-resolution human motion video by using the motion conversion network based on the generative adversarial network. The obtained fusion results are used for the image of video operation, and the obtained video results are used for the fusion operation using Poisson image editing to obtain the video of the target effect.
As shown in Figure 3, five consecutive frames are displayed in each section, the first column shows the source character sequence, the second column shows the standardized pose skeleton diagram. The third column shows the result of EDN motion transformation model generation under the training of the source video frame, the fourth column is the proposed source scene preservation video generation model, which outputs the target character with source background using Poisson fusion algorithm, the fifth column uses Laplace The output of the pyramid fusion algorithm [34]. To achieve the fusion with the background style of the source object, the Poisson image editing algorithm is selected to fuse the foreground image and the source background image under the dim stage atmosphere. The obtained video frame fusion result is used to perform the image to video operation, and the visual effect of the final synthesized video is greatly improved after extensive experiments.
A performance measure of tracking results is the Euclidean distance error from the target center position. Accuracy is the percentage of all detection results that are correctly detected [33]. A threshold of 20 pixels is set for position error, below which the detection is considered correct and the precision under this threshold is calculated. Due to differences between algorithms, setting a position error criterion of 20 may not be fair to some algorithms, so accuracy curves based on the position error threshold are compared across algorithms. Another metric is the overlap, which is the ratio of the intersection and concatenation area between the tracking target frame and the true frame. If the tracking result is greater than the overlap criterion, the frame is considered to be successfully tracked, and statistics are available for the entire video. By setting different coincidence thresholds, a tracking success curve based on this threshold can be obtained, and the AUC value of the curve is used as an overall measure of tracking.
The running time of each method in the CVPR-TB50 dataset is shown in Table 1, from which it can be seen that KCF, CT, and MST-ours have a big speed advantage, and the MST-our method can process more than 140 frames per second in this dataset.
By using joint sparse constraints, it is better to consider the spatial relationship between templates during the representation than to learn independently for each sample. In the second set of experiments, after removing the weight matrix on the representation coefficients, candidate samples with higher similarity to the newly added templates are not preferentially selected. It indicated that the spatiotemporal prior relationship between the templates and the candidate targets can effectively correct the representation results. The third set of experiments was designed to verify the effectiveness of the contribution instruction matrix in handling the occlusion situation [35]. From several video sequences containing occlusions, it can be seen that the proposed algorithm in this chapter is significantly improved compared to the third set of comparison experiments.
Multiscale conditional adversarial network generation is implemented using the TensorFlow framework, and we use Adam optimizer with optimization hypermastigote. A discriminator network contains only one discriminator whose inputs are true pairs and false pairs. The true pair consists of the final synthetic image and the conditional personal image x, while the false pair consists of the target person image y and the conditional personal image. The two discriminator network has two discriminators, one of which is identical to the structure in a network with only one discriminator, and the input to the other discriminator consists of true and false pairs after a twofold downsampling operation of the conditional person image x, the final composite image v, and the target person image y. The two discriminator network has two discriminators, one of which is identical to the structure in a network with only one discriminator, and the input to the other discriminator consists of true pairs and false pairs after a twofold downsampling operation of the conditional person image x, the final composite image v, and the target person image y. The two discriminator network has two discriminators, one of which is identical to the structure in a network with only one discriminator, and the other of which is identical to the structure in a network with only one discriminator [36]. The three discriminator network consists of three discriminators, two of which are identical to the structure in the network with only two discriminators, and the input to the third discriminator consists of true and false pairs consisting of conditional person image, final composite image y, and target person image after a 4-fold downsampling operation. It notes that we used the same generator, the same dataset, and the same hypermastigote to train and test all of the above comparison experiments.

B. EVALUATION DESIGN
During the experiment, the search area size is 3 times the target size. In the bottom-up part, the filter regularization parameters, the learning rate is the same as the KCF tracker settings, and finally, the response map is generated by using the color and hog features. In the size estimation, a total of sizes are included with a size change factor of 1.006. In the top-down section, 50 positive samples are selected in the first frame to initialize the dictionary. Each target template and candidate sample were realized to 32 × 32 pixels and split into local image blocks of 16-pixel size with a step of 8 pixels between each image block. In the sparse representation, penalty factors for the sparse constraint term and the differential constraint term were set to 0.01 and 0.03, respectively.
The OTB dataset contains a total of 100 test sequences. During the tracking process, the target in the first frame of each sequence was manually calibrated. For the quantitative evaluation of each tracker, two metrics, center position error and overlap rate, are used. PrecisionPlot and success plot can be obtained based on these two metrics, respectively. The precision plot shows the pixel difference between the predicted frame and the actual frame center position, and its performance score is the proportion of frames within 20 pixels of the difference. The success plot is calculated based on the overlap ratio, which takes into account the size variation of the target in addition to the center position, and its performance is scored as the area under the curve, as shown in Figure 4.
We used the same evaluation criteria from previous personal image generation work, and their corresponding mask versions mask-SSIM and mask-IS. SSIM value is calculated by combining luminosity, contrast, and structural comparisons between the synthetic image and the true label. The SSIM value ranges from 0 to 1, with higher SSIM values indicating higher structural similarity. Inception score is a common evaluation metric. Closely related to human subjective judgment is evaluating synthetic images by capturing the uniqueness of a single sample and the differences between multiple samples. Since the generator does not know what kind of background the generated image is, it uses the removal of the image background to mitigate the effects of the background to compute the mask-SSIM and mask-IS. The mask assessment has been set up according to the method presented in PG2. It is clear from this that the multistage conditional generation adversarial network achieved the highest and second-highest scores in the IS and mask-IS assessment criteria, respectively, and that the multiple conditional generations adversarial network still has determining gaps in the SSIM and mask-SSIM assessment criteria compared to the latest methods. Due to the random and ambiguous nature of the samples, it is temporarily not possible to generate images of people with high resolution and realism on all samples. Moreover, the network structure consisting of three discriminators generates images that are not as good as those generated by two discriminators. There are three main reasons for this phenomenon. The three discriminators are difficult to agree and their performance is more unstable during the training process. The continuous downsampling operation destroys the spatial structure of the image. For low-resolution images, continuous downsampling will extract invalid feature information.
where each row represents the actual motion and each column represents the output result of the recognized motion. According to the result, we can see that the recognition rate is relatively high for simple movements, and due to the hybrid Gaussian model for background subtraction, the foreground detection effect is better, and the foreground target can be extracted well, the anti-interference is better, and in the more complex indoor environment, the human body wearing different colored clothing can be better identified. Due to the use of feature extraction based on mixed features, it can accurately detect the various movements of the human body in the indoor environment, the recognition rate is high and the recognition error rate is within the controllable range. According to the results, it can be seen that the recognition results for simple movements are acceptable and can detect the actual movements of the human body more accurately.

A. MOTION PREDICTION EXPERIMENTAL PROCESS AND ANALYSIS OF RESULTS
The FOA-SVR-based posture prediction method is implemented by Sklearn, a machine learning library in Python, and the results are validated and analyzed to show that the method can effectively predict the posture values. In the experimental scheme of this paper, the population size of the improved Drosophila optimization algorithm is 20, the maximum number of iterations is 200, the initial value of the search step is 5, and a single-step prediction method is used, i.e., the first safety posture values are used to predict the last posture values. Since there are 270 potential values for constructing the training set and 30 potential values for constructing the test set, 260 training samples, and 20 test samples will be obtained eventually. The improved Drosophila algorithm is then used to optimize the penalty coefficient and the parameter g of the RBF kernel function in the SVR algorithm, and the global optimal parameter values are obtained through an iterative optimization-seeking process. As shown in Figure 5, it shows that the closer the value of the coefficient of determination is 1, the closer the prediction value is the actual value, and the higher the prediction accuracy. From the table, it can be seen that with continuous iterations, the parameters are constantly changing and updating, and the value of the coefficient of determination is getting closer to 1 until it converges to the global optimal parameter.
After the iterative search of the improved Drosophila optimization algorithm, the optimal parameter values are obtained.  The obtained parameters are substituted into the SVR model, the SVR is trained with the training set, and the prediction effect of the method in this paper is verified with the test set to obtain the predicted potential sequence. Figure 6 shows the comparison between the predicted and actual results of 20 potential values, where predict represents the predicted value and represents the actual value, and its coefficient of determination reaches 0.906. It can be seen from the figure that the method of this paper has a good prediction effect, the prediction curve, and the actual value of the curve change trend -consistent, good fit, high accuracy of prediction.
The results are shown in Figure 7. Analyzing the results, we can see that the set difference between the current measured potential value and the actual potential value is very small, which further illustrates that the IF0A SVR-based two-route safety posture prediction method has a good posture prediction effect. To further illustrate the superiority of this paper's prediction method over other methods, this section will conduct a comparative experiment based on the experimental results in the previous subsection. The prediction method proposed in this paper improves the potential prediction method based on the improved algorithm optimized to support vector regression.
In the FOA_SVR based posture prediction experimental scheme, the Drosophila population size is 20, the maximum number of iterations is 200 and the fixed step length is 5. The Drosophila algorithm was iteratively optimized and the coefficient of determination was 0.881067350, which was lower than the coefficient of determination of the IFOA_ SVR prediction model, and the optimal parameter values were 3.38 and 12.32. The value of C is related to the local search ability of the particle swarm algorithm, the value of learning factor C2 is 1.6, the value is related to the global search ability of the particle swarm algorithm. The same support vector regression on the parameter performs an iterative search for optimization. After the iterative search for optimization of the particle swarm algorithm, a better posture prediction model was constructed with a decision coefficient of 0.829. The comparison shows that the decision coefficients are still lower than those obtained with the improved Drosophila optimization algorithm. In each of the above three swarm intelligence algorithms to optimize the SVR selection of the optimal parameters C and g, 200 iterations were performed, and the decision coefficients obtained from each iteration were used as the adaptation of the algorithm, and the adaptation curves for the improved Drosophila, Drosophila and particle swarm algorithms were plotted separately. The adaptation curves for the improved Drosophila algorithm are shown in Figure 8.

B. ANALYSIS OF HUMAN MOTION RECOGNITION RESULTS
On the KTH dataset, the test set is selected using videos of a total of 9 people, while the training set is the video of the remaining 16 people, which satisfies the training set and test set video number ratio of approximately 2:1. On the UCF Sports dataset, the same training set test set video number ratio of 2:1 is used as the KTH. To depict the accuracy of the classification recognition results obtained by the algorithm in this paper, the confusion matrix is obtained on the KTH dataset and UCF Sports dataset, respectively, as shown in Figure 9. Because the surveillance camera was shot at a high place, therefore, the size of the moving target to be detected is small. The moving objects in this video scene include pedestrians and people riding electric vehicles in addition to vehicles, but due to the problem of shooting angle, such moving objects are smaller in size, and effective features cannot be extracted in the feature extraction stage, so in the moving object detection stage actively filter out these moving objects whose pixel area is too small, and only detect large-sized moving vehicles. Besides, there are no other moving targets or dynamic backgrounds in the shooting scene, so the video can be used as experimental data to detect moving targets in the video. VOLUME 8, 2020 As shown in Figure 9, it can be seen that in the KTH dataset, some of the motions are recognized as clipping'' motions. There is a partial resemblance between the ''clapping'' movements; due to the high similarity between ''jogging'' and ''running''. There is a 10-15% error between the two identification rate. In the UCF Sports dataset, the instability of the sample and the slightly higher level of background complexity made it more difficult to identify the dataset itself. To evaluate the efficiency of the algorithm in this chapter, the number of sampled tracks is used as the evaluation index, and the ratio of the number of sampled tracks to the number of tracks in the literature is calculated to obtain the comparison result. As shown in Figure 10, it can be seen that since the algorithm uses saliency detection to obtain the motion target subject, and then extracts the dense trajectories in the subject target region, the number of dense trajectories obtained is only 34.2% ∼ 80.3% of the literature, which improves the efficiency of the algorithm to some extent. However, due to the introduction of the saliency detection algorithm, the algorithm in this chapter is more time consuming than that in literature. Using the processing speed as an evaluation metric, a set of videos are selected from the KTH dataset and UCF Sports dataset, respectively. Firstly, static significant value of the video frame multiple is calculated to obtain the position of the moving subject in the video, which is linearly combined with the motion subject region obtained from the dynamic significance detection to obtain the subject motion region. The dense trajectory is extracted only in the obtained subject motion region, and a space-time body is constructed along the trajectory to obtain the trajectory descriptors. Finally, the Fisher vector is used to encode the feature descriptors and use SVM for classification identification. The simulation results show that the algorithm achieves good recognition results for both the simple data set KTH and the more complex data set UCF Sports, and improves the validity and comprehensiveness of the motion feature expression in the dense trajectory-based human motion recognition algorithm to some extent. The next step is to expand the application scenarios and reduce the complexity of the algorithm. The recognition rates of the experimental results are shown in Figure 11. Where each row represents the actual motion and each column represents the output result of the recognized motion. According to the results, the recognition rate is relatively high for simple movements, and due to the use of a hybrid Gaussian model for background subtraction, the foreground detection effect is better. It can extract the foreground target well, and the anti-interference is better, and in the more complex indoor environment, the human body wearing different colors of clothing can be better identified. Due to the use of feature extraction based on mixed features, it can accurately detect the various movements of the human body in the indoor environment, the recognition rate is high and the recognition error rate is within the controllable range. According to the results, it can be seen that the recognition results for simple movements are acceptable and can detect the actual movements of the human body more accurately.
The human body geometry information and optical flow descriptors were extracted and fused to form a word bag model by the k-means algorithm, followed by human body motion recognition by the HMM algorithm, which completed the HMM parameter training and human body motion recognition, respectively. In the experimental results, we can see that the recognition rate of HMM is high, the error rate is within the control range, and the anti-interference ability is good due to the use of a variety of feature mixing methods for feature extraction. It can better complete human motion recognition work in an indoor environment.

C. ANALYSIS OF THE RESULTS OF THE EVALUATION OF INDICATORS
The proposed method is compared with 10 innovative methods that perform well in OTB benchmarks, including one correlation filter-based tracker: the SRDCF tracker. As shown in Figure 12, it can be seen that the tracker proposed in this chapter rank first in tracking accuracy performance and third in success rate. Combining accuracy and success rate plots, the tracker proposed in this chapter outperforms most of the algorithms in OTB. The accuracy and success rate of each tracker for eight different challenge scenarios is also presented in Figure 12. It is shown that the tracker proposed in this chapter can effectively handle background clutter and occlusion. This is mainly because the tracker contains both short-term and long-term memory information of the target, which can be accurately reflected by the tracker with long-term memory when the appearance pattern of the target is destroyed by the interference of surrounding objects and then reappears in the scene. At the same time, the bottomup part can provide discriminative particles for the top-down part, so that useless particle are eliminated in large numbers, and the remaining particles, although limited in number, cover a wide range, greatly increasing the representation efficiency of the top-down part.
In other scenarios, this paper proposes that the tracker performs essentially the same as the other two deep learningbased trackers, crest and AdNet, but not as well as the GCT tracker. It should be noted that the method proposed in this chapter still has a significant advantage over those without pre-trained parameters. When the appearance of the target changes continuously, most of the manually extracted features such as color features and hog features cannot accurately describe the appearance and morphology of the target, which is also a common problem of most methods without pre-training parameters. The approach in this chapter redefines the target appearance through the GRS model, which reduces this limitation by adding high-level perceptual information such as contextual information and spatiotemporal continuity information. However, the problems with manual features can still reduce the effectiveness of particle generation of short-time trackers and lead to tracking drift. Therefore, the search for more stable target features for short-time trackers is one of the directions for future research work.
The VOT2016 benchmark contains a total of 60 video sequences. Unlike the OTB benchmark, which initializes each tracker only once until the end of the video, the VOT benchmark uses a restart mechanism that re-initializes the target position to the tracker after 5 frames of a track failure. Accuracy is the average of the overlap between the predicted and actual boxes in the successful tracking phase. Robustness is a measure of the number of tracking failure. The expected average overlaps rate is a combination of both accuracy and robustness, as shown in Figure 13.
The proposed tracking algorithm is compared with 29 related and recently innovative tracking methods, including 16 correlation filter tracking algorithms based on manual feature extraction. The tracking performance of each method is evaluated by the pooled AR plots and the expected average overlap ranking. As shown in Figure 13, it can be seen that the FCF tracker, the SSKCF tracker, and the combined bottom-up and top-down tracking framework proposed in this paper have the highest localization accuracy during the successful tracking phase. Meanwhile, the CCOT tracker, DeepSRDCF tracker, and DDC tracker have the lowest number of target loss during the tracking process. The integration framework proposed is tested on the OTB-100 and VOT2016 datasets. The dictionary atoms collected from the top-down part are important for downstream tasks such as target action recognition and multi-target tracking.

V. CONCLUSION
We provide an in-depth study of the recognition algorithm for human motion in VR video sequences utilizing an improved hidden Markov model algorithm. To address the problem of local occlusion and shape variation of the samples in the target recognition, the global-based sparse classification model often leads to degradation of recognition accuracy due to Gaussian assumptions on the noise distribution.
The background subtraction operation of the ViBe algorithm is applied to the video frame to obtain the human movement region and extract the dense trajectory features of the region. The Fisher vector is used to encode the features and perform linear sequence difference analysis, which linearly maps the feature vectors in the sequence to a low-dimensional subspace. The supervised information is used to obtain the temporal structure of the sequence to maintain the overall difference of the sequence and to remove the redundant information to reduce the feature dimension. Through an in-depth study of the mathematical model for the local structure sparse representation, it is found that the structural information representing the coefficients can be used to make metric judgments about the target category without relying on a residual metric based on Gaussian assumptions. At the same time, the local structure sparse model is improved to correct the noisy data by using dictionary atoms, which improve the target recognition accuracy and lay the foundation for the subsequent target tracking. The Baum-Welch algorithm is used to train the parameters of the human body under the four motions respectively, and the parameter documents of the hidden Markov model are obtained. In the process of human body motion recognition based on the hidden Markov model, the group with the highest output probability value is selected as the decision result, and the four common motions of the human body are recognized. The recognition rate of the four actions is between 98% and 94%, the misjudgment rate is within the controllable range, and it has good anti-interference ability in the recognition process. The results of the reform have laid a solid foundation for the future.
LEI LIU received the degree from the Department of Physical Education, Qiqihar University, in 2000, and the master's degree in sports training from Harbin Engineering University, in 2010. She is a Lecturer with the College of Physical Education, Jiamusi University. For a long time, she has been teaching with the College of Physical Education, Jiamusi University, responsible for the teaching and training of volleyball, swimming, touch rugby, and sports and leisure program. She has published 15 articles, and hosted and participated in a number of provincial and municipal scientific research projects and won awards.
YUFENG JIAO received the degree in applied electronics technology from Harbin Engineering University, in 1999, the master's degree in material science and engineering from Jiamusi University, in 2009, and the Ph.D. degree in material science and chemical engineering from Harbin Engineering University, in 2016. She is an Associate Professor with Jiamusi University. For a long time, she has been teaching with the College of Materials Science and Engineering, as well as many courses such as ''Casting Test Technology.'' She has published more than ten articles, and hosted and participated in a number of national, provincial, and municipal scientific research projects and won awards.
FANWEI MENG received the degree from the Department of Physical Education, Qiqihar University, in 2000, and the master's degree from the College of Sports Science, Shenyang Normal University, in 2009. He is an Associate Professor with the College of Physical Education, Jiamusi University. For a long time, he has been teaching fitness theory and practice, basketball, football, speed skating, and other courses with the College of Physical Education, Jiamusi University. He has published more than ten articles, and hosted and participated in a number of provincial and municipal scientific research projects and won awards. VOLUME 8, 2020