Human Behavior Recognition Based on Multiscale Convolutional Neural Network

The key problem in human behavior recognition is how to build a spatiotemporal feature extraction and classification network. Aiming at the problem that the existing channel attention mechanism directly pools the global average information of each channel and ignores its local spatial information, this paper proposes two improved channel attention modules, namely the space-time (ST) interaction module of matrix operation and the depth separable convolution module, combined with the research of human behavior recognition. Combined with the superior performance of convolutional neural network (CNN) in image and video processing, a multi-scale convolutional neural network method for human behavior recognition is proposed. Firstly, the behavior video is segmented, and low rank learning is performed on each video segment to extract the corresponding Low rank behavior information, and then these Low rank behavior information are connected on the time axis to obtain the Low rank behavior information of the whole video, so as to effectively capture the behavior information in the video, avoiding tedious extraction steps and various assumptions. The ability of neural network to model human behavior can be transferred and reused in networks with different structures. According to the different characteristics of data features at different network levels, two effective feature difference measurement functions are introduced to reduce the difference between features extracted from different network structures. Experiments on several public datasets show that the proposed method has a good classification effect. The experimental results show that the method has a good accuracy in human behavior recognition. It is proved that the proposed model not only improves the recognition accuracy, but also effectively reduces the computational complexity of output weights and improves the compactness of the model structure.


I. INTRODUCTION
In the field of computer vision, the research on human behavior recognition can not only develop the relevant theoretical basis, but also expand its engineering application. For the theoretical basis, the field of behavior recognition integrates the knowledge of many disciplines, such as image processing, computer vision, artificial intelligence, human kinematics and bioscience. Human behavior recognition is an important method to process video content using computer vision technology. It is an important research direction [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Khoa Luu .
According to the different forms of convolution kernel, behavior recognition methods based on deep learning can be divided into two categories: 2D convolution network and 3D convolution network, many researchers have applied deep learning to motion recognition. They have tried to use various methods to realize the behavior recognition technology based on computer vision, and achieved good results. The specific methods and literature will be analyzed in detail in Chapter 1. These behavior recognition methods can be roughly divided into two categories: one is behavior recognition technology based on traditional classification methods; The second is behavior recognition technology based on deep learning. Combining the advantages of these two methods, VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the mainstream research direction of current behavior recognition technology is to use the method of manual feature extraction combined with deep learning [2], [3]. However, due to the complexity of human behavior itself, and human behavior is easily disturbed by complex background, occlusion, light and other environmental factors, most of the current feature extraction methods are cumbersome and prone to error transmission, Moreover, it is difficult to effectively model the relatively slow or static behavior. In addition, the convolutional neural network with a single scale can not fully describe the human behavior characteristics from multiple angles, which is not conducive to the final behavior recognition.
In the research of domain, a large number of efficient network structures have emerged, such as C3R [4], eco [5], TSN [6], etc. Although these network models are different in structure, they all have high modeling ability for video data and can effectively distinguish different human behaviors in natural scenes. Theoretically, the feature description vectors obtained from different network models are sensitive to category information (taking the classification task as an example), and become linearly separable at the output layer of the network. Even if they come from different modeling processes, the feature vectors obtained should be similar. Whether the knowledge acquired by different network structures can be learned and shared is a problem worth discussing. Chen et al. [7] increased the width and depth of the original network, used the decomposition of the original parameters or the unit matrix to initialize the weight parameters, and realized the cross structure transfer learning. Ali et al. [8] used the 2D network to supervise the input and output of the 3D network, made the 3D network fit the output characteristic distribution of the 2D network, and indirectly realized cross structure learning. Inspired by this, this paper further relaxes the constraints of the model structure, and adopts effective measurement strategies [9], [10] between the two networks with greater structural differences to achieve a more general sense of transfer learning, which is called soft transfer.
At present, human behavior recognition methods are mainly divided into two categories: traditional manual feature extraction methods and deep learning methods. Manual feature extraction methods usually include three consecutive steps: feature extraction, local descriptor calculation Classification [11]. Sullivan et al. [12] match the edge information with the key posture and position of the mark, and then track between consecutive frames according to the contour information. Oikonomo et al. [13] proposed a detector, which involves calculating the entropy characteristics of the cylindrical neighborhood around a specific space-time position, By highlighting motion features to represent different positions in the video, Patrona et al. [14] introduced automatic motion data and dynamic motion data weighting to adjust the importance of human data on the premise of action participation, so as to achieve more effective action detection and recognition.
Compared with 2D convolution method, the method based on 3D convolution network has a simpler and more efficient network structure. In the research of har based on smart phone, the behavior data collected by smart phone sensors (triaxial acceleration and gyroscope sensors) are difficult to use directly because of noise Therefore, feature engineering is widely used in various har models to extract robust human behavior features from sensor datain paper [13], a human behavior pattern recognition framework named human is designed and gives the RF with higher recognition Paper [14] combines human behavior data with the environment information, and proposes a human behavior recognition framework based on environment perception. Experiments using decision tree (DT), support vector machine (SVM) and k-nearest neighbors (k-NN) show that the behavior recognition framework based on environment information helps to improve the recognition performance of the model According to the requirements of different fields, paper [15] proposed a human behavior recognition learning model based on cascade integration Each layer in the model is composed of extreme gradient boosting trees (egbt), RF, extreme randomized trees (ERT) and softmax regression In the first layer, the four models are trained with sensor data, and then the probability vectors representing the different categories of each data are obtained Then, the initial input data and probability vector are concatenated together as the input of the next level classifier, and finally the prediction results are obtained according to the last level classifier Experimental results show that compared with the existing recognition methods, this method obtains better recognition accuracy, and the model training process is simpler and more effective The research of human behavior recognition based on volume neural network extended model needs to complete the feature marking manually, and its calculation amount, generalization ability of the model and feature acquisition ability need to be further improved. For the basic module of densenet [11], a new network structure of MDN is designed, and the soft migration technology is used to learn and inherit the video feature modeling ability of networks. Different network models are different in structure. Remember that mdn-i3d is a semi supervised ''learner supervisor'' combination.
This paper proposes a human behavior recognition method based on multi-scale convolution neural network is proposed. Combined with the research of human behavior recognition, two improved channel attention modules are proposed, namely, the space-time interaction module of matrix operation and the deep separable convolution module. The ability of neural network to model human behavior can be transferred and reused in networks with different structures. According to the different characteristics of data features at different network levels, two effective feature difference measurement functions are introduced to reduce the differences between the features extracted from different network structures. Experiments on several public datasets show that the proposed method has a good classification effect. The experimental results show that the accuracy of this method is good.

A. BEHAVIOR RECOGNITION NETWORK UNDER ATTENTION MECHANISM
In the product neural network, each picture is initially represented by three RGB channels. After different convolution operations, each channel will generate new information. The features of each channel represent the components of the input on different convolution kernels, and how much these components contribute to the key information. Therefore, inspired by the human attention perception mechanism, adding the channel attention mapping module to the network can effectively model the relationship between channels, so as to improve the network feature extraction ability. Hu et al. [17] proposed a lightweight squeeze and exception (SE) module, whose structure is shown in Figure 1.

1) EXISTING CHANNEL ATTENTION MODULE
The main components of the module are dimension compression module, incentive and weighting. This module first uses the global average pooling operation to turn each 2D feature channel into a real number, then uses the full connection operation and the activation function (relu, sigmoid) to obtain a more comprehensive channel level weight relationship, and finally uses element multiplication to fuse the obtained weight with the original feature.

2) IMPROVED CHANNEL ATTENTION MODULE
The subject of behavior recognition is people. For the goal of people, the weights of the central position and the boundary position should be different. Se_ In block, the global average pooling operation is used to give the same weight to each position of the feature map, which to some extent strengthens the unimportant information and suppresses the important information. In order to give each position of the feature graph learnable weight, this paper considers two improved attention modules: (1) spatial temporal module of matrix operation, as shown in Figure 2 (a); (2) The depth wise separable (DS) module of deep separable convolution is shown in Figure 2 Like the SE module, the improved attention module proposed in this paper is a plug and play module, so the improved attention module can be directly added to the existing basic network to form a new recognition network. Taking DS module and RESNET as examples, figure 3 shows the schematic diagram of network module. Figure 3 (a) shows the original RESNET residual block, and figure 3 (b) shows the network module after adding the DS module.

B. NEURAL NETWORK MODEL FOR HUMAN BEHAVIOR RECOGNITION
In the traditional neural network model, the layers are fully connected, and the nodes between each layer are unconnected, while the nodes between the rnn [18] hidden layers are connected, and the input of the hidden layer includes not only the output of the input layer, but also the output of the hidden layer at the previous time The specific manifestation is that the network will remember the previous information and apply it to the calculation of the current output.
LSTM is a special cyclic neural network. It uses gating mechanism to better construct long-term dependence in data. Three gates [19] are placed in one cell of the model, which are called input gate, forgetting gate and output gate respectively. Gru is a simplified version of LSTM, which retains the longterm memory ability of LSTM model. The main change of Gru is to replace the input gate, forgetting gate and output gate of LSTM cells with update gate and reset gate, and combine the cell state and output vectors into one [20]. Blstm [21] is an enhanced version of LSTM. The forward layer and backward layer used for training sequence data correspond to two lstms respectively, and both lstms are connected to an output layer. This structure provides complete context information for each point of the input sequence of the output layer In practical application, blstm and LSTM models are highly comparable See Figure 3 for the process of creating features of the cyclic neural network model For labeled datasets, each example is a sequence of different number of frames, and all frames have the same fixed number of features X represents a sequence in the dataset, l represents the number of frames in the sequence x, and K represents the number of features of each frame in the dataset example, so the size of X is k × l. The model transmits the window frame by frame to the hidden layer to extract features, then uses the output of the hidden layer to calculate the extracted feature window, and finally averages the extracted features.

C. MCNN MODEL
According to the characteristics of human behavior Lai, MCNN model is designed to extract the behavior characteristics of Lai and complete the recognition of human behavior. When people observe things, they usually look at the VOLUME 11, 2023  unimportant parts slightly, and their eyes move faster. The important parts will be observed carefully, and their eyes move slower. In order to simulate this phenomenon, in order to obtain the motion information under different receptive fields. The model takes 1 for low rank behavior information 1 × 1, 3 × 3 and 5 × 5 convolution kernels of three sizes are used for feature extraction, and four different convolution kernels are used for operation under the same size to obtain the feature information in four different directions. After three size convolution kernel operations, the size 1 ×1 the feature image extracted by the convolution kerne is the largest, 3 × 3 characteristic graph of convolution kernel followed by the characteristic graph of convolution kernel is the smallest. Compare the zero filled characteristic diagram with 1 × 1 the characteristic graphs of convolution kernels pass together 2 × 2 to compress the characteristic graph, and then use two 3 × 3 and then pass 2 × 2. After each convolution layer is completed, the network model uses the rectified linear unit (relu) to activate, and finally connects. Full connection layer and output the behavior category through the sigmoid function. The specific network model structure is shown in Figure 4. In order to simplify the image, only the operation schematic diagram of the first frame feature map is given, and other feature maps on the same layer do similar operations [22]. The low rank behavior information is fused with the multiscale cross-channel convolutional neural network model. The number of segment n of behavior video is very important for the training of the whole model, which will greatly affect the calculation speed and recognition effect of the network model. In addition, when n is certain, the optimal weight parameters of each layer in the multi-scale crosschannel convolutional neural network model are also related to the specific behavior data set. The size and number of video frames of different data sets are inconsistent [23], and these parameters will be obtained in training for specific data sets. When training the model, first set the loss function of the whole model, assuming that, there are l labeled samples z 1 , y 1 , z 2 , y 2 , . . . , z L , y L , This tag is oneof-c tag, where: For samples, Let the output of the network model be ol, and define its error as: Then the overall loss function of the network model is defined as: Let W be the vector composed of all network model parameters, then the optimal set of parameters w * is defined as: In the optimization of specific parameters, it is still very complicated to use el to directly solve all parameters. Therefore, it is considered to use el to weight and derive the input first, and then input weight and derive the parameters [24]. After defining the loss function of the network model, the alternative training method is proposed to the whole model is trained, that is, when the number of video segments n is fixed, the MCNN model is trained in the above way, then the network model parameters are fixed, the value of the number of segments n is changed, the model loss function is minimized, and the value of the best n at this time is calculated. In this way, the alternating training model is repeated many times, and the optimal parameter combination of the whole model under a specific data set is obtained.

III. NETWORK IMPROVEMENT
A. IMPROVED DENSE LINK NETWORK Figure 5 shows the MDN network structure. In the figure shows the scale and step size of 3D convolution kernel. In order to reflect the difference between MDN and supervision network in structure, the design scheme of dense connection is adopted. However, compared with the original 3densenet, it has made some improvements: the number of dense connection layers is reduced, the 3D convolution kernel of dense layer is split into (2 + 1) d, which makes the model lightweight; In order to adapt to variable length time  Table 1 for detailed network parameters.
Random initialization of MDN network will cause too large deviation between output and supervision network, which is not conducive to convergence, especially when the structure of MDN network is very different from that of supervision network. Therefore, it is not enough to supervise only at the last output layer. It is necessary to adopt a gradual step-by-step supervision strategy to strengthen network supervision at the bottom of the network to ensure that the MDN network can find the optimization direction smoothly. At the final output layer of the network, the MDN is restricted to generate a feature space similar to the distribution of the supervision network, forming the second stage of supervision, explicitly fitting the feature distribution, and realizing the supervisor's ''teaching'' of knowledge to learners [25]. The first two supervision steps do not use label information, and strictly speaking, they belong to unsupervised learning. Finally, supervised training is carried out to use the label information of data, so that learners can not only output similar features with supervisors, but also the features are effective for classification. This constitutes a three-stage monitoring strategy to accelerate the convergence of the network [26].
The closer the convolutional neural network is to the bottom of the network, the more general the data characteristics are, Not related to specific tasks. At this time, a more strict loss measurement function can be used to keep the learners and supervisors consistent at least in the shallow characteristics, and to clamp the output value of the network to fluctuate within a certain range. Cosine similarity can be used as the loss measurement function of the first stage [27]. Figure 6 shows the monitoring points set in the I3D and MDN intermediate layers. The output characteristics of the ''mixed 3C'' layer of the I3D network are grids ∈ Rd × H × W × 480. The output characteristics of the ''transition1'' layer of the MDN network are gridt ∈ Rd × H × W × 480, the video feature is composed of n = D × H × W regular arranged local space-time cells g (g ∈ R1 × 480), the cosine similarity error of n to cell (GSJ, gtj) reflects the mismatch degree of the characteristics of the two networks. Equation (1) describes the calculation process of similarity loss function, where <, > refers to inner product operation. Before calculation, normalize the channel dimensions of grids and gridt in advance, and the cosine similarity calculation can be simplified to inner product operation [28].

B. CONSTRUCTION OF CONVOLUTIONAL NEURAL NETWORK STRUCTURE
Traditional CNN uses single-layer linear convolution in convolution layer, which is not outstanding in nonlinear feature extraction and complex image implicit abstract feature extraction. The activation function has strong fitting ability, and can fit all feature patterns when the number of neurons is sufficient. Therefore, the combination of nested maxoutmlp layer and activation function is used to improve the fitting ability of the algorithm and the recognition accuracy of the model.
The number of linear regions in a neural network with nested maxout layers varies with Increase with the number of maxout layers, and activate the function relu [29]. Maxout network is easy to over fit the training data set without model regularization, because maxout network can identify the most valuable input information in the training process, and is easy to carry out feature co adaptation [30]. The method in this paper is tested on the dataset with different number of maxout layer fragments, as shown in Figure 7. The test results of different number of maxout fragments and fragments using maxout layer and batch normalization (BN) layer are combined. When the maxout fragment is 5, the nested model has reached the saturation state.
Generally, researchers will select the maximum pooling layer for down sampling, which is more representative in feature extraction [31]. The average pooling is used in all aggregation layers to aggregate effective features. The irrelevant feature information in the input image can be suppressed by average pooling and discarded by maximum merging. The average pool is an extension of the global average pool, in which the model attempts to extract information from each local patch to facilitate abstraction to feature mapping. The nested structure can obtain Abstract representative information from each part, so that more discernible information is embedded in the feature map, and the spatial average pool is used in each pooling layer to aggregate local spatial information [32]. On the cifar-10 dataset without data expansion, the comparison results of the maximum and average pooled layer test error rates are shown in Table 2.
The convolution layer of nested multilayer maxout network, that is, maxoutmlp is used for feature extraction based on the nested network structure. The constructed convolution neural network model uses batch standardization to reduce saturation and differential pressure to prevent over fitting [33]. In addition, in order to increase the robustness of object space conversion, the average pool is applied in all pool layers to aggregate the basic features obtained.
where: (I, J) is the position of the pixel in the feature map; Xi, J are input blocks centered on pixel points (I, J); Km is the channel for indexing feature mapping. Fi, J, K; N is the number of MLP layers. From another point of view, the maxout unit is equivalent to the cross channel maximum pooling layer on the convolution layer. The cross channel maximum pooling layer selects the maximum output to be input to the next layer. Maxout cells help to solve the problem of fade out because the ramp can flow through each of the largest cells [34].
The feature mapping in the nested maxout MLP layer module is calculated as follows: where: BN (·) refers to the batch normalization layer; (I, J) is the image in the characteristic diagram. Location of elements; Xi, J are input blocks centered on pixel points (I, J); Kn et al.
Is the serial number of each channel in the characteristic diagram; N is the number of layers of nested maxout MLPs. The batch standardization layer can be applied before the activation function. In this case, the nonlinear element tends to produce activation with stable distribution and reduce the saturation [35].

C. FEATURE EXTRACTION MODULE OF DEEP SEPARABLE CONVOLUTION
Although st module can meet the requirement of modeling the relationship between channels, its operation is complex and too many additional calculation parameters are introduced. Therefore, DS (Fig. 8 (a)) is proposed as a more  effective module. The DS module is mainly divided into two parts: dimension compression and incentive weighting. In the dimension compression part, the depth separable convolution is used. See Figure 8 (b) [36] for the detailed operation. The following assumptions are made: input (CIN × H × W), convolution kernel (K1 × K2), number of convolution kernels (cout), number of groups (g). For normal convolution, the number of parameters is: CIN × K1 × K2 × Cout; Group convolution is adopted, and the number of parameters is: (1/g) × CIN × K1 × K2 × Cout, the number of parameters is reduced by G times. When cin=cout=g, the packet convolution is depthwise conv. Furthermore, when cin=cout=g, k1=h, k2=w, the output feature map size becomes cout × one × 1. The function of global pooling is realized, and the learnable weight of each position of the feature map is given. In the excitation weighting part (Figure 9 (c)), two changes have been made compared with the se module and St module. First of all, since the mean and variance of each batchnorm [37] operation are calculated in a batch, if the batchsize is too small, the calculated mean and variance are not enough to represent the entire data distribution. Therefore, groupnorm [38] is used to replace it, which is independent of the batchsize and is not constrained by it. When there is better pre training, it can be considered not to use [39]. Secondly, considering that sigmoid function is saturated at both ends, it is easy to discard information during propagation, so it can be considered to discard it [40].

IV. EXPERIMENT AND ANALYSIS
A. DATA ACQUISITION AND PROCESSING A public dataset and a self built dataset are used to verify the effectiveness of the proposed model on har UCI har data set [31]: this data set collected behavior data of 30 volunteers aged 19-48 Each of them wears a smart phone (Samsung Galaxy S II) at their waist to carry out 6 activities (walking, uprights, downstairs, sitting, standing, lying), and collect behavioral data at a constant rate of 50Hz by using the built-in accelerometer and gyroscope sensor of the smart phone Then, the obtained human behavior data are preprocessed by noise filter, and segmented under a fixed width sliding window with 2.56 seconds and 50% overlap. Finally, robust features are extracted by feature engineering The data set includes 10299 human behavior data with 561 features, including 7209 in the training set and 3090 in the test set UCI har is an unbalanced data set, and the specific data information distribution is shown in Figure 9 Self built dataset: to further explore the performance difference between the proposed model and other advanced models on balanced dataset and unbalanced dataset This paper uses iPhone XR to collect human behavior data The data set contains 4200 human behavior data sets with 102 features, including 2940 training sets and 1260 test sets The data collection of this data set involves 2 subjects, and the specific distribution of the data set is shown in Figure 10.
In this paper, the simulation experiments for different data sets are carried out on the hardware platform with i7-9700 CPU, 3.00GHz and 8GB ram using MATLAB r2019b software platform The manifold regularization parameter C1 of Mr SCNS model is determined by using 5-fold cross validation The experimental results show that C1 = 0.001 is the best In order to verify the effectiveness of the proposed method, this paper compares it with SVM, irvflns, SCNS, CNN, LSTM and other models Among them, SVM uses radial basis function as kernel function The LSTM consists of an LSTM layer with 200 hidden layer nodes, a full connection layer of size 6 and a softmax layer CNN contains three convolution layers, three relu layers, three pooling layers, a full connection layer and a softmax layer At the same time, one training of CNN and LSTM algorithms contains 100 batches. The activation functions of irvflns, SCNS and Mr SCNS are consistent with Lmax, which are sigmoid and 500, respectively Other parameters in SCNS and Mr SCNS are: λ = 1:1:10, ε = 0.05 and Tmax = 20 In addition, in order to ensure the fair comparison of various learning models, CNN and LSTM model settings refer to the solution for human behavior recognition on the official website of MATLAB [32], and the optimal model parameters obtained through cross validation; The structures and parameters of irvflns, SCNS and Mr SCNS were obtained on the basis of literature [22] combined with cross validation

B. EXPERIMENTAL ENVIRONMENT AND PARAMETER SETTING
The deep learning framework tensorflow is adopted, and the integrated development environment is pychart In this paper, the activity data samples collected in the experiment are divided into two parts, in which the training samples account for 70% and the testing samples account for 30% First, input the training data into the model for calculation, and then compare the results with the sample labels to adjust the parameters of the model Secondly, the training data is used to train the parameters of the model, and the final parameters are used to build the model for feature extraction Finally, the softmax function is used to classify the behavior and get the behavior label In the deep learning system, adjusting model parameters is a manual process. In order to achieve the optimal model, this paper adjusts the corresponding model parameters (including the number of neurons in the hidden layer, the number of hidden layers, etc.), and explores a variety of different configurations of these parameters

1) DETERMINATION OF THE NUMBER OF HIDDEN LAYERS
Theoretically, increasing the number of hidden layers can reduce the network error, obtain more comprehensive information and improve the accuracy In the process of using different models for activity recognition, the number of hidden layers needs to be verified by experiments. In this paper, the average accuracy of 10 experiments is taken, and the results are shown in table 3 It can be seen from table 3 that the classification accuracy of the model changes with the increase of the number of hidden layers, showing a trend of first increasing and then decreasing This is because increasing the number of hidden layers in the model will have the following effects: training more hidden layers requires higher training costs; Higher level models shrink data by lower level hidden layers Therefore, in order to ensure the quality of the model, a large number of raw data are needed to train a high-level model According to the experimental results, in order to make the model have high accuracy and maintain good performance, it is determined to use level 3 hidden layer for RNN algorithm, level 2 hidden layer for Gru and LSTM algorithm, and level 1 hidden layer for blstm algorithm.

2) DETERMINATION OF THE NUMBER OF HIDDEN LAYER NEURONS
The number of nodes in the hidden layer is an important parameter for the ability of deep neural network to extract VOLUME 11, 2023 data features Increasing the number of nodes in the hidden layer can reduce the systematic error of the network and improve the accuracy of activity identification; The number of hidden layer nodes is too small, and the network may not be trained at all or the network performance may be very poor The network loss of nodes in different hidden layers of each model is investigated. The results are shown in figure 10 It can be seen from Figure 11 that the network loss decreases with the increase of the number of hidden layer nodes h, but after H = 32, the loss of model Gru and RNN begins to increase. This is because the excessive number of hidden layer nodes prolongs the network training time and brings additional costs to the model training Considering the conditions of each model, the number of hidden layer nodes is 32

3) DETERMINATION OF MODEL LEARNING RATE
The learning rate determines the extent to which the control parameters are updated Too large amplitude may cause parameters to take values back and forth on both sides of the optimal value. Too small amplitude can ensure convergence, but it will greatly reduce the optimization speed The parameters above are correct. In the case of constant, figure 12 shows the network losses of each model with different learning rates under the same parameters As can be seen from Figure 12, the loss of each model tends to be stable with the increase of learning rate Select the learning rate based on the situation of each model η = 0.0025 training model.

C. COMPARISON OF COMPUTATIONAL COMPLEXITY BEFORE AND AFTER INTRODUCING QR DECOMPOSITION
In order to clearly illustrate the ability of QR decomposition in solving the problem of increasing the computational complexity of SCNS and aggravating the computational load of smart phone devices by introducing manifold regularization This paper presents the comparison results of modeling time   with and without QR in MR SCNS, as shown in Figure 13 It can be seen that under the function of QR decomposition, when the number of hidden layer nodes does not reach 5, the average modeling time of the two is almost the same; However, when the number of hidden layer nodes is 5, the average modeling time difference between the two increases gradually, especially when the number of hidden layer nodes reaches 12, the maximum difference is 2.78s; It is worth noting that when the number of hidden layer nodes reaches 17, the average modeling time between the two will maintain a similar difference This is because in the process of har modeling, the model needs to be adjusted for a period of time. The model structure in this period is not stable enough, so it will increase first and then keep similar. In addition, this paper also calculates the overall modeling time before and after introducing QR decomposition into mrscns model based on UCI har data set. The experimental results are shown in Table 4 It can be seen from table 4 that before introducing QR decomposition to solve the output weight, the modeling time of the model needs 348.25s at most and 342.81s at least The difference between them is nearly 6s After introducing QR decomposition, the maximum and minimum modeling time of model modeling are 322.14s and 318.57s respectively, and their gap is reduced to 4.57s, The main reason for this result is that the computational complexity of the original solution method will gradually increase as the number of hidden layer nodes increases (the output matrix of the hidden layer becomes larger), and the computational complexity of QR decomposition is much lower than that of the original calculation method, so the corresponding gap will also become smaller.
By constructing behavior pictures, 10900 experimental samples were finally generated. 70% of the samples were randomly selected as the training set and 30% as the test set in each experiment. After dividing the training set and the test set according to the above proportion, a single sensor data set (behavior pictures composed of ax, ay, AZ) and two kinds of sensor data sets (behavior pictures composed of AAS, GR, pitch) are used as the input of f-dcnn for comparative experiments. Figure 14 shows that compared with a single sensor, the fitting degree of the train accuracy and test accuracy curves of the multi signal fusion method is higher, reaching the highest value after 35 iterations and tending to be stable, showing better generalization ability androbustness.

V. CONCLUSION
In this paper, a human behavior recognition method based on improved attention mechanism is proposed. By analyzing the shortcomings of the existing channel attention mechanism, an improved attention module is proposed. In order to verify the effectiveness of the improved attention module, experiments are carried out from the aspects of visualization results, network accuracy improvement, additional network parameters and so on. The multi-scale convolution kernel is used to obtain the behavior characteristics under different receptive fields, and the convolution layer, pool layer and full connection layer are reasonably designed to further refine the characteristics, which verifies that the cross structure learning is feasible. The necessity of multi-stage progressive supervision strategy is verified by comparing the supervision in different stages; The influence of model structure on the effect of soft migration is discussed. It is found that the network is easier to converge when the structure of monitoring network is similar to that of learning network. In future work, more sensors can be used to improve the data dimension, so as to further improve the recognition accuracy. There are many parameters in the model module of our method, and the future work will focus on how to improve the lightweight of the model.