Behavioral Feature Description Method Based on the Vector Module Ratio and Vector Angle of Human Body Structure

In the field of computer vision, the depth image sequence collected by depth camera is not sensitive to the interference of light, occlusion and background environment. Therefore, in recent years, it is often used to collect behavior data, from which the characteristics of bone joint points are extracted as behavior information. However, it is found that the direct use of joint coordinate information collected by depth camera for behavior recognition is easily affected by individual differences in behavior and changes in shooting distance. Considering the position information of human joints as well as the angle and length information of hidden limbs, based on skeleton data, a behavior feature description method based on the ratio of vector angle and vector mode of human structure is proposed. This method solves the above problems perfectly and obtains ideal results on the self built data set, which is suitable for simple daily behavior recognition.


I. INTRODUCTION
In today's society, people's life rhythm is fast, the work pressure is big, and the aging society is serious. The phenomenon of widowed elderly and left behind children has become an inevitable social problem. In order to solve the daily monitoring problem of isolated groups in indoor small scale scene, a behavior feature description method based on vector modulus ratio and vector angle of human body structure is proposed in this paper.
According to the characteristics of different stages of acquisition equipment, researchers extracted global features [1]- [4] local features [5]- [8] and bone joint point features of the obtained image respectively for behavior recognition research. At present, the feature of bone joint points is a research hotspot. The method of behavior feature description proposed in this paper is based on the feature of bone joint points.
The associate editor coordinating the review of this manuscript and approving it for publication was Dalin Zhang. Bone joint feature is the feature information of joint point obtained from the action sequence of behavior individual, including relative position, track, etc. Shotton et al. [9] proposed to divide the human body into 31 parts through machine learning, Then in 2013, Shotton et al. [10] further determined the three-dimensional coordinate information of the joint point of the target according to the depth information of the acquired image, and drew the joint image; Wang et al. [11] used Shotton's method to track and obtain 20 human bone joints, and used the Fourier Time Pyramid (FTP) to model the time pattern of joint eigenvectors. Its main contribution is the Actionlet Ensemble (AE) model, which can deal with the errors of bone tracking and better characterize the internal variation. They tested on the MSRAction-3D data set, and the recognition accuracy reached 88.2%; Packer et al. [12] fused the target recognition and behavior recognition, which can not only detect their own motion information, but also help judge the human behavior according to the objects they hold. They use the successful discriminant technology to distinguish the pose trace of the target character and then model it. By combining these elements into a model, they can VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ simultaneously identify actions, track the position of objects, and achieve good experimental results on the cooking action dataset; Wang et al. [13] synthesize the position relationship of various parts of the body and the distance between pixels and camera to obtain features for analyzing human behavior; Sempena et al. [14] obtain the spatial position of bone points, transform it into quaternion form, and use dynamic programming algorithm for further recognition. This method mainly classifies the upper limb movements; Pisharady and Saerbeck [15] selects the angle between the limb vectors as the distinguishing feature, and uses the Support Vector Machine (SVM) classification model to identify the target movements; Liu et al. [16] fused multiple depth feature information to train the classification model, and achieve significant improvement in the classification effect on the data set used. Some scholars proposed a new skeleton representation, which uses rotation and displacement in 3D space to explicitly model the 3D geometric relationship between various body parts, and uses the combination of dynamic time warping [17], Fourier time Pyramid [18] and linear SVM to classify them [19]- [27]. The experimental results on three action datasets show that the proposed representation method is better than many existing skeleton representation methods. In general, behavior recognition is a promising and valuable research topic. Researchers are still in-depth research on behavior recognition technology, and new achievements will be put forward in the future. In order to better realize the recognition of human behavior in small space in indoor environment, this paper deeply studies the previous research results. In the previous research on behavior recognition based on joint information, the 3D coordinate sequence of bone joint points is usually used to construct feature vector directly. These methods only emphasize the position information of joint points, and do not consider the length and angle information of limbs in human physiological structure during human activities. Therefore, in this paper, the position information of human joint points and the hidden limb angle and length information are comprehensively considered, and the human body structure vector is constructed by using the joint point position information, and then the human behavior characteristics are described by the angle of the structure vector and the vector modulus ratio. This method mainly solves the following two problems: • Photography distance change: the distance between the actor and the camera will directly affect the threedimensional coordinates of the joint points. If the joint point coordinate information sequence is used directly without processing, the classification effect will be affected.
• Individual differences of behavior: in the process of completing the same behavior, the differences of age, gender, height and body shape will lead to differences in the proportion and length of different parts of the body, which will affect the recognition effect.

II. METHOD A. SKELETON SPACE COORDINATE SYSTEM
In this paper, Kinect is used as the behavior acquisition device, which has three cameras [28]. The middle camera can take color images and transmit them into the form of static image sequences; the left and right cameras work together to obtain depth images. Therefore, Kinect can obtain color image data stream, depth image data stream and bone data stream at the same time. Bone data flow is provided in the form of bone frames. Each frame of the first generation Kinect used in this paper can save 20 points, as shown in Table 1.
When collecting bone flow data, Kinect can map the spatial coordinate system of human body depth image to the spatial coordinate system of bone through coordinate conversion. In the skeleton space coordinate system, the optical axis of Kinect's infrared camera is z-axis, and the intersection of Z-axis and image plane is the origin. Let the center of the right hand palm be in the same direction as the light emitted. The point of the right thumb is the X axis of the coordinate system, and the other fingers are the Y axis. Kinect bone space coordinate system is shown in Figure 1.

B. BEHAVIORAL FEATURE DESCRIPTION METHOD BASED ON THE VECTOR MODULE RATIO AND VECTOR ANGLE OF HUMAN BODY STRUCTURE
It can be seen from the physiological structure of human body that human joints cannot break through the natural physiological constraints in the process of activity, that is to say, there is a certain physiological limit angle in joint activity. In the previous research on behavior recognition using joint information, the three-dimensional coordinate sequence of bone joint points is usually used to construct feature vector directly. This method only emphasizes the position information of joint points, and does not consider the information of limb length and angle between limbs in human physiological structure during human activities. In this paper, the position information of human joints and the hidden limb angle and length information are considered synthetically. The position information of joints is used to construct human body structure vector, and then the angle between structure vectors and the ratio of vector to module are used to describe human behavior characteristics. This approach perfectly solves the difficulties (1) and (2) mentioned above.
In this paper, Kinect's skeleton data flow information is used to obtain the three-dimensional coordinates of the target joint points and construct the human body structure vector between the joint points. Using the principle of geometric space vector, the angle between vectors is calculated. The four parts of left hand, right hand, left foot and right foot are too close to other adjacent joint points, so they can be ignored in the application scenario preset in this paper. In addition, the spine center can be represented by hip center and shoulder center. Therefore, in this paper, only the remaining 15 joint position information is used, so the bone joint points in the following skeletal diagram are all 15. Taking the joint points marked in Figure 2 as an example, the calculation process of the angle between the structural vectors is divided into two steps: Step 1 The human body structure vector is calculated by three-dimensional coordinates of joint points: Step 2 The cosine theorem is used to calculate the angle between structural vectors: The angles between 16 structural vectors are extracted by (1) to (4), which respectively represent the angles between various parts of the body in the process of motion. As shown in Figure 3: Only using the angle between human body structure vectors cannot fully express behavior information. It can be found that there is a strong distinction between the behavior information implied in the relative position changes of body parts. Therefore, as a part of behavioral characteristics, we abstract the relative position between body parts as the ratio of corresponding structural vector modules, referred to as vector module ratio, as shown in Figure 4. Eq. (5) gives the VOLUME 8, 2020 characteristics of modulus ratio selected in this paper: The angle between human body structure vectors and the ratio of vector module are combined as behavior characteristics to describe behavior information, as shown in Table 2. This method does not take the absolute position information as the behavior feature, and the position change of the behavior individual relative to Kinect will no longer affect the behavior recognition effect, so the problem (1) mentioned above can be solved. In addition, due to the physiological limit of human joint activity and the range of the ratio between skeletons, this method can greatly avoid individual differences in the description of behavior characteristics, so the problems (2) mentioned above can be solved. To sum up, this method can be used for behavior recognition in the preset application scenarios.

C. KEY FRAME EXTRACTION STRATEGY
The key frame extraction strategies used in this paper are as follows: Step 1 Extract the start frame and the end frame as key frame.
Step 2 The number of keyframes extracted from the image sequence is 20. In addition to the start frame and end frame extracted in step 1, 18 frames need to be extracted from the remaining picture sequence. The specific method is shown in (6) and (7). Where, formula (6) gives the step size of extraction and formula (7) gives the extracted frame number.
1 + 1 * s, 1 + 2 * s . . . 1 + 18 * s Assuming that it takes 1.6 seconds for the target to complete the action of ''drinking'', a total of 48 human skeleton images are taken according to Kinect's shooting speed of 30 frames per second. As shown in Table 3, when using this key frame extraction strategy, the feature description method proposed by us reduces the dimension of the feature vector.

III. EXPERIMENT A. SETTING OF EXPERIMENTAL ENVIRONMENT
The distance and angle between the individual and the camera will affect the classification effect. In this paper, the control variable method is used to control the pitch angle and the Kinect placement height. After testing the shooting effect for many times, the Kinect is placed on a one meter high platform and its pitch angle is adjusted to zero. In addition, in order to ensure that the depth distance measured by Kinect is basically consistent with the actual distance of the test object, this paper makes the following experiments: let the tester move backward from the face of Kinect, observe and record the distance of human body image completely in Kinect, and then record the depth distance and actual distance measured by Kinect every 0.1 M. It is found that when the distance between the target character and Kinect camera is within 2-4 meters, the depth distance is consistent with the actual distance, which is adapt to the preset application scenarios in this paper. According to the above experiments, the experimental conditions of behavior information collection in this paper are determined, as shown in Table 4.

B. THE ESTABLISHMENT OF BEHAVIOR DATABASE
According to the preset application scenario, this paper conducts experiments on the self built data set, which contains six common indoor behaviors.
In this paper, Kinect is used to design the following behavior data set to test the recognition accuracy using the feature description method proposed in this paper: 50 testers are selected for behavior information collection, and each student needs to complete 6 different behaviors, including walking, drinking, bending, squatting, sitting and falling. In this way, each behavior has 50 groups of data, and all behaviors have a total of 300 groups of data. The behaviors recorded in this paper are all the behaviors of each student in the natural state, without any performance elements.
The last frame of the keyframe image sequence of the six behaviors is shown in Figure 5. From left to right, from top to bottom, the sequence is walking, drinking, sitting, bending, squatting and falling.

C. CLASSIFICATION METHOD OF BEHAVIOR RECOGNITION 1) BP NEURAL NETWORK
We choose BP neural network [29]- [31] as the classification model, and use Matlab to test the effect of the feature description method proposed in this paper on the self built data set described above. Table 5 shows the parameters of the neural network we use.
In the content shown in Table 5, WetN represents the number of layers of BP neural network; InputN represents    the number of input layer nodes; OutputN represents the number of output layer nodes; HindN represents the number of nodes in the hidden layer; The number of nodes is selected according to (8) and (9), and the test results are shown in Table  6. Therefore, the number of hidden layer nodes is 24.
TF represents the transfer function of network hidden layer; OTF represents the transfer function of network output layer; MinMSE represents minimum mean square error; MAXI represents the maximum training times; TMethod represents Weight update algorithm.
2) SUPPORT VECTOR MACHINE Support vector machine (SVM) algorithm was originally used to solve binary classification problems. If we are faced with the problem of multi class classification, we must construct a multi class classifier. The common methods of constructing SVM multi class classifier include one to many methods and one to one method. Among them, the one to many methods classifies the samples of a certain class into one class, and the rest of the samples belong to another class. k classes need to construct k SVM; the one-to-one method designs a SVM between any two types of samples, so k(k − 1)/2 SVM should be designed for k categories. In this paper, a total of 15 SVM are designed by one-to-one method.
In the sample space, 250 samples were randomly selected to train the model, and 50 samples were used as the test, the average value was calculated after 10 repetitions. When different kernel functions are selected in the simulation experiment, the detection results are shown in Table 7.
According to the results mentioned above, the Gaussian kernel function is finally selected as the kernel function. VOLUME 8, 2020     To sum up, the parameter settings of SVM model are shown in Table 8.

3) ANALYSIS OF RECOGNITION EFFECT OF TWO CLASSIFICATIONALGORITHMS
It can be seen from the above that the optimal average recognition rate of BP neural network model is 96.2%, and that of support vector machine model is 92.0%. In other words, under the premise of this study, the best average recognition accuracy of BP neural network model is higher than that of SVM model, so this paper selects BP neural network as the classification model.

D. ANALYSIS OF EXPERIMENTAL RESULT
This experiment runs on the MacBook Air notebook, the hardware index is: 1.6GHz inter core i5 processor, 8GB memory, 2.20GHz flash memory, 4.00GB running memory. Software development environment includes: Windows 7 operating system, Matlab R2014b version, Kinect for win- dows SDK. In addition, Kinect V1 is used to collect the behavior information in our paper.
In order to test the performance of the actual behavior recognition and classification, this paper selects one feature vector in the behavior database as the test data, repeats the experiment many times and records the experimental results. The test results are shown in Table 9: According to the above experiments, the average recognition accuracy of the system is 96.245%. It can be said that the description method of behavior characteristics we proposed has a high accuracy in identifying indoor daily behaviors. In order to further describe the classification effect of the system, the confusion matrix in the recognition process is given in Figure 6: From the confusion matrix, it can be seen that the behavior recognition based on this feature description method still has some confusion for similar behaviors. For example, squatting behavior and sitting behavior are easy to be confused, which may be due to the fact that both behaviors need to lower the center of gravity, and the main moving parts are concentrated in the lower body. Therefore, it is not enough to use the angle between structure vectors and the ratio of vector and module to characterize the similar behavior.

IV. CONCLUSION
The above content of this paper shows that the behavior feature description method based on vector modulus ratio and human body structure vector angle can solve the problem raised at the beginning of this paper. In addition, it also has achieved high recognition accuracy on the self built data set. Since the self built data set in this paper contains common indoor behaviors, the behavior feature description method proposed in this paper is suitable for the monitoring of solitary groups.
The behavior description method we proposed still has its limitations. The vector mode ratio and vector angle of human body structure are not enough to distinguish similar behaviors. In addition, the key frame extraction method in this paper is simply ''equal interval extraction'', which may discard some useful frames. Therefore, in the next step, we will study the key frame extraction method to make up for the low recognition rate of similar behavior.