A Multichannel CNN-GRU Model for Human Activity Recognition

Human activity recognition (HAR) is one of the important research areas in pervasive computing. Among HAR, sensor-based activity recognition refers to acquiring a high-level knowledge about human activities from readings of many low-level sensors. In recent years, although the existing methods of deep learning (DL) have been widely used for sensor-based HAR with some good performance, they still face such challenges as feature extraction and characterization, continuous action segmentation in dealing with time series problems. In this study, a multichannel fusion model is proposed with the idea of dividing. In this proposed architecture, a multichannel convolutional neural network (CNN) is used to enhance the ability to extract features at different scales, and then the fused features are fed into the gated recurrent unit (GRU) for feature labeling and enhanced feature representation, through the learning of temporal relationships. Finally, the multichannel CNN-GRU model is designed using global average pooling (GAP) to connect the feature maps with the final classification. The model performance was conducted on three benchmark datasets of WISDM, UCI-HAR, and PAMAP2 with the accuracy of 96.41%, 96.67%, and 96.25% respectively. The results show that the proposed model demonstrates better activity detection capability than some of the reported results.


I. INTRODUCTION
Human activity recognition (HAR) refers to inferring the current action and predicting the following action from a series of observations and analysis of human behavior and the environment [1]. There are two mainstream techniques for HAR: video-based [2] and sensor-based systems [3]. Video-based system classifies video clips containing various types of human actions [4]. This way is very intrusive to the life of the target individual and difficult to ensure his/her privacy. Besides, the quality of the video captured by the camera is also affected by complex environment, such as lighting, background noise, and the target object occlusion [5], leading to performance degradation. Moreover, The associate editor coordinating the review of this manuscript and approving it for publication was Jon Atli Benediktsson . the recognition of video images faces more difficulties and more expensive costs [6]. Sensor-based HAR extracts the features of human activity details from the raw data of the sensor and recognizes the human activity [7]. Sensors have a wider range of application scenarios such as healthcare, sports, smart home, and human-computer interaction due to their stability, non-intrusive nature, and excellent ability to protect privacy [8]. Smartphones and smartwatches, a range of wearable devices, have inertial sensors such as gyroscope, accelerometer, and magnetometer embedded in them. These increasing computational devices make it possible to collect time series data efficiently and infer details of human activities [9], and serve as very useful monitoring tools in smart homes.
In recent years, sensor-based HAR has become a popular research area, with researchers first using traditional VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ machine learning (ML) methods for HAR task. The general process of HAR includes [10]: Collecting motion data using sensors, pre-processing the data, action segmentation, extracting features, and action classification. Fig. 1 shows the whole process of HAR task. Traditional ML, including SVM [11], decision trees [12], Bayes [13], and random forest [14], has seen excellent performance in classifying action. However, ML has many limitations and relies heavily on manual feature extraction due to its shallow learning process. Manual feature extraction, such as statistical and frequency domain features, always depends on elaborate features selection of human experience and domain knowledge [15]. Besides, the hand-crafted features can only characterize some simple human activities, but not the complex ones. As a result, shallow ML algorithms find it difficult to adapt to new complex HAR scenarios [16]. DL has achieved automatic feature extraction by end-toend neural networks, largely reducing the time-consuming and labor-intensive manual extraction of features and simplifying the huge feature engineering. Meanwhile, the features extracted by DL are deep [17], [18]. Currently, DL methods, with higher efficiency and higher classification accuracy, have found wide use in HAR, and become effective methods for HAR. CNN and Recurrent neural network (RNN) are two typical neural networks. CNN evolved from multilayer perceptron and has features such as weight sharing, local connectivity and down-sampling [19], which has excellent performance in the field of computer vision. RNN is a DL neural network used to model sequence data [20], connects neurons which saves the previous input sequence-information to abstractly characterize the whole sequence, and it generates a new sequence in the end. RNN solves the intractable problems of variable-length sequences and long-distance dependencies in sequences that exist in feedforward neural network (FNN), and is widely used in the fields of sequence annotation, image annotation, etc. Long short-term memory (LSTM) [21] and GRU [22], two variants of RNNs, are used to solve the gradient disappearance and gradient explosion problems of RNNs. Compared with LSTM, GRU has one less control gate inside, fewer parameters, and easier training, but can get similar results.
Good results of DL networks have also been achieved in the other fields, Liu et al. optimized the structure of the GRU network and proposed a new modulation recognition method based on feature extraction and a DL algorithm [23], Hartpence and Kwasinski utilize ensembles to defend against data poisoning attacks attempting to create classification errors [24]. However, HAR using sensor-based DL methods still faces some problems. The first is the extraction, characterization, and classification accuracy of features [25]. Despite the advantage of DL in extracting data features automatically, different network structures have high and low characterization ability of features. Besides, time series of HAR activity has backward and forward relevance, and has difficulties in labeling the sequences. Thus, the performance of feature extraction will directly affect the accuracy of classification; The second is the computational cost, i.e., the number of parameters [26]. These lightweight wearable devices, despite the improvement of chip arithmetic power, still have high requirements on the computational cost of the model, requiring the model to be relatively lightweight and fast to response to real-time data in practical application [27].
For better feature extraction and limitation on computational cost, we propose a three-channel CNN structure for feature extraction for the input samples. The features at different scales extracted from these three channels are connected and fed into GRU for the sequence features. The use of GAP instead of fully connected (FC) layer improves the training speed of the model and get more accurate the classification. Our model achieves higher performance on the benchmark HAR dataset.
The main contributions of the proposed model are: • To begin, a multichannel convolutional neural network is used for initial feature extraction at different scales before connection and fusion. Next, a GRU neural network is used for sequence labeling to further extract sample features. The mapping between feature maps and final classification using the GAP makes the transformation smoother, ensuring the model is more robust against fitting.
• Compared to other fusion models as well as similar multichannel models, the multichannel CNN-GRU model we proposed has fewer parameters and higher accuracy on WISDM, UCI-HAR, and PAMAP2. The rest of this paper is structured as follows: The second section introduces the related work of HAR, especially DL methods; The third section contains the methodology used in our proposed multichannel model; the fourth section describes the experiments and the experimental procedure, and the fifth section summarizes the work of this paper and gives the areas for improvement.

II. RELATED WORK
As we know sensors could be easily built into smartphones, smartwatches, and other wearable products. For its high portability and accurate and rapid collection of motion data, sensor-based HAR has many application scenarios [28], such as motion classification, fall detection, human-computer interaction.

A. MACHINE LEARNING METHOD
Researchers have conducted a lot of research work in the area of traditional ML. Initially, researchers used traditional ML methods for the action classification task of HAR with some success. Bao and Intille [29] presented the earliest HAR system that used five wearable dual-axis accelerometers, machine learning classifiers. It could identify 20 categories of activities of daily living, achieving 84% classification accuracy, and this result is fairly good for its relatively large number of activities. However, traditional ML requires manual feature extraction from the raw data, which is a very huge project, and the effectiveness of the extracted features is affected by domain knowledge, which makes it difficult to improve the accuracy of the action classification results.

B. DEEP LEARNING METHOD
In recent years, DL methods have been used in HAR with impressive performance. Zeng et al developed a method based on CNN, which can capture local dependency and scale invariance of a signal. They also proposed a partial weight sharing approach and applied it to accelerometer signals to obtain further improvements [30]. Yang et al. [31] further used 1-dimensional convolution (Conv1D) in the same time window to unify and share the weights of time series data from multiple sensors.
Ronao and Cho [10] proposed a model consisting of alternating convolutional and pooling layers, the extracted features are passed to the FC and Softmax layers to predict human activities. CNN and statistical learning were combined to implement a real-time classification framework by Andrey [32]. In [25], the authors designed a cell phone sensor-based HAR model using CNNs. Wang and Liu [33] proposed a hierarchical LSTM approach to identify human activities. CNNs were also used in HAR task to extract temporal features, and achieved significant performance improvements. Bianchi et al. [34] proposed a CNN model consisting of four convolutional layers and one FC layer for human activity recognition, which achieved good results on a small training set. A hybrid CNN-LSTM model is proposed in [35] for multi-mode wearable sensor devices. In [36] the authors designed a LSTM-RNN architecture model for HAR.

C. SIMILAR FUSION MODEL
Recently, there have been some new studies using fusion models for sensor-based HAR. Dua et al. [37] used CNN and GRU in 3-head module to extract features and FC connection for classification, and they achieved satisfactory results on several datasets, but the overly complex head causes the increase of parameters for the model, and it fails to meet the HAR requirement of the lightweight. Hamza et al. [38] and Ronald et al. [39] utilized the Inception module from the Inception-Resnet model [40] in the HAR DL model to perform the HAR classification task, in [38] the authors used inception modules consisting of 1D convolutional layers and DenseNet network to design Inception module is essentially to extract features of different dimensions to enhance the computer vision and to increase the depth of model. However, the improvement obtained by directly applying the Inception module or modifying the convolutional kernel size of the Inception module to the HAR model is not obvious enough. We continue to extend the idea of inception by optimizing the network structure and parameters of each channel. Specifically, following the input this model connects multiple channels of CNN neural networks with different convolutional kernel sizes, while the batch normalization (Batch Norm) layer is added between the two convolutional layers of a single channel to speed up the network convergence, and a max pooling layer added at the end of each channel. In the end, the output features at different scales of each channel are connected using Concatenation layer similarly.
The fused features are fed into the GRU neural network for feature labeling, and the using GAP instead of FC layers could make the transformation smoother and enable the model to have stronger anti-fitting ability, reducing the number of parameters significantly. The excessive number of parameters will limit the application of DL models of sensor-based HAR to real-world environments. Although deeper models have the ability to express features more richly, the pursuit of complexity can lead to huge system overhead, making it difficult to be applied in real world. Our model achieves better results with fewer parameters and is more adaptable to practical applications.

A. MULTICHANNEL CNN
CNN has been widely used in DL and derived many classical structures, such as FCN, Res-Net. These methods play important roles in classification tasks of HAR. Fig. 2 depicts the process of extracting time series features using Conv1D. The convolution kernel is convolved with a window of medium length of the sample to obtain the corresponding VOLUME 10, 2022 features, and it is shifted down to convolve with the data behind. The dimension of resulting features equals the number of convolution kernels. We apply the structural features of Inception module to our model, in which different channels obtain features at different scales. This can enhance the receptive field of the computer to perform HAR tasks. Three channels at different convolutional scales are designed, with Batch Norm layer added between the convolutional layers for normalization, and the sample data go through each channel and the output features are connected. CNN extracts the features at multiple scales and makes the model obtain stronger feature representation, which will effectively improve the accuracy of classification.

B. GRU
GRU solves the problem of gradient disappearance and explosion of general RNN. Fig. 3 depicts the principle of GRU, which has the same structure of input and output as the RNN. The current input x t and the hidden state h t−1 passed down from the previous node are passed through GRU to get the output y t and the hidden state h t passed to the next node. It only needs one unit to complete two operations of forgetting and selecting memory (LSTM needs multiple units to complete this function), and the Formula 1 is the updated expression of GRU unit.
GRU has better performance in longer sequence data compared to RNN. GRU controls the transmission state with the state of gates, remembering the critical information that needs to be kept for a long time and forgetting the unimportant ones. Compared with LSTM, GRU has a smaller number of parameters. The features at different scales extracted from multiple channels are fused and put into the GRU layer for labeling the time-dependent sequences, enhancing the feature representation. A model consisting of CNN networks only  cannot solve the problem of error tolerance, with wrong data or illegal data increasing the recognition rate of CNN decreases. This is due to that this model fails to filter dirty data in the input samples. Instead, GRU network enables the model to have fault-tolerant capability. The input samples correspond to several feature maps at several consecutive moments, even a wrong channel occurs in the corresponding feature map in a certain moment, GRU would predict and erase the errored channel according to the other features for there is a time dependency in each feature map.

C. GAP
The FC layer connects the convolutional layer and the normal layer, takes the data from the previous layer, and puts the result into the normal layer through nonlinear transformation, its conversion process is shown in Fig. 4 (a); the GAP layer averages the feature data in both height and width dimensions, while the FC layer is prone to overfitting when training too many parameters, its conversion process is shown in Fig. 4 (b). Thus, the GAP layer has a more stable performance. There are two advantages of using GAP instead of FC: First, the transformation between feature map and final classification is simpler and more natural in GAP; Second, it does not need a large number of training tuning parameters like FC layer, which reduces the number of spatial parameters and makes the model more stable.

D. PROPOSED MODEL
In this study, a multichannel CNN-GRU model is proposed for HAR. After input of the samples, they are fed into three channels with convolutional kernels of different scales, then into two GRU layers after feature fusion, and sent to the final classification layer through GAP and Batch Norm layers. The structure of the model is shown in Fig. 5. Three channels are similar in structure except for the convolutional kernels at different size scales in the convolutional layer; the convolutional kernels at different scales obtain features of different scales from the samples and possess the capability of enhancing the vision of the neural network.
The samples first pass through a Conv1D of the channel, which accepts input data in three dimensions. The first dimension -the number of samples, the second dimension -the size of the sliding window is 128, and the third dimension -the original number of features (3 for the WISDM dataset and 9 for the UCI-HAR dataset). Then the data passes through the activation layer with an activation function of rectified linear unit (ReLU), followed by Batch Norm layer, which converts the sample to data with a mean of 0 and standard deviation of 1. This can speed up the training and convergence of the model, control the gradient explosion, prevent the gradient from disappearing, and reduce overfitting. Then the data go through a Conv1D layer and an activation layer with the same activation function as ReLU, and finally the 1-dimensional max pooling (MaxPooling1D) layer with a size of 2 and a step size of 1. The number of convolution kernels in the first Conv1D layer is 64 for all channels, and the second one is 128. The structure of the three channels is the same except that the size of the convolution kernels is 3, 5, and 7, respectively. These extracted features are concatenated in the concatenation layer and fed into the GRU layer with the number of neurons of 128 and 64, respectively. Then, they go through the GAP layer and the Batch Norm layer to realize normalization, and then the Dense layer with softmax activation function as the classification function to obtain the normalized output.
The spatial complexity of CNN networks is low, and the number of its parameters is related to the feature dimension and the number of convolutional kernels, etc. Conv1D is used in our model, with two input feature dimension of sliding window size and features, and its number of parameters is low. The number of GRU parameters is the sum of updated unit parameters and reset unit parameters, with its size related to the input dimension and gate units. In our experiment, the number of parameters of the proposed model with the comparison model is compared, which is an important evaluation of the HAR framework.

A. DATASETS
To verify the validity of the model, experiments were conducted using the WISDM dataset (single sensor), the UCI-HAR dataset (multi sensors), and the PAMAP2 dataset (multi sensors). The basics of the three datasets are described below.

1) WISDM DATASET
WISDM is a benchmark HAR dataset provided by the Wireless Sensor Data Mining (WISDM) Lab research team. 36 participants, with Android smartphones in their front leg pocket, conducted specific activities in a controlled environment [41]. A total of 1,098,207 samples (sampled at 20 Hz) are obtained using a three-axial acceleration (implanted in an Android phone). Participants were asked to perform six activities: sitting, standing, walking, walking up, down stairs, and jogging. Each sample consists of six attributes: user ID, activity, timestamp, x-acceleration, y-acceleration, and z-acceleration. Some of the data in WISDM are displayed in Fig. 6.
2) UCI-HAR DATASET UCI-HAR was collected from 30 volunteers between the ages of 19 and 48 wearing a smartphone (Samsung Galaxy SII) on the waist [42]. Eatch Each individual performed six activities-three static activities: walking, walking upstairs, and walking downstairs, and three dynamic ones: sitting, standing, and laying. These data were recorded by the developed software. Two three-axial linear acceleration and a three-axial angular velocity captured the data at a constant rate of 50 Hz using the built-in gyroscope and accelerometer VOLUME 10, 2022   of the smartphone. The training and test set have been divided and its pre-processing has also been completed in UCI-HAR. So we can just use it. A brief description of UCI-HAR is shown in Table 2.

3) PAMAP2 DATASET
PAMAP2 -recorded from 18 activities performed by 9 subjects, wearing 3 IMUs and a HR-monitor -is created and made publicly available by Reiss et al [43]. Three inertial measurement units (IMUs) and a heart rate monitor were used as sensors during the data collection. These relatively lightweight and small IMUs contain 3-axis MEMS sensors, including two accelerometers, a gyroscope and a magnetometer, all sampled at 100 Hz. Participants followed a protocol of 12 activities (lie, sit, stand, walk, run, cycle, Nordic walk, iron, vacuum clean, rope jump, ascend and descend stairs) and 6 optional activities (watch TV, computer work, drive car, fold laundry, clean house, play soccer). The data are from a total of 9 volunteers, aging from 24 to 32, and each performs some of these activities. The description of this dataset is presented in Table 3.

B. DATASET PREPROCESSING
The original data needs to be pre-processed due to their unbalance distribution. By normalization, we make the data have a mean of 0 and standard deviation of 1. To better evaluate the effectiveness of the proposed model, special attention is given to divide the dataset. The original data consists of time series of different activities by user ID, and the data of the user to be predicted is completely unknown when the model is applied to reality. With a sliding window splitting the original data, the dataset is randomly divided into training set and the test set according to a certain ratio. This would lead to that some samples of the same user's activity may appear in both training set and test set. Dividing the dataset in this way may improve the accuracy of the proposed model, but does not reflect its true validity.
The reality is that the user data to be tested is completely unknown when the model is applied. Thus, we divide the training and test sets by user IDs to ensure that the samples from the same ID could only exist in one of the two sets. The size of the sliding window and the overlap have a great impact on the partitioning of time-series data. The sliding window size of 128 and the overlap rate of 50% are set to all WISDM, UCI-HAR, and PAMAP2 according to the sampling frequency and human activity habits.

C. EVALUATION METRICS
Commonly used evaluation metrics for classification models includes: precision, recall, and F1 score. These metrics will be used to evaluate the proposed model. Accuracy: For a given test dataset, the ratio of the number of samples correctly classified by the classifier to the total number of samples is the correct rate for the identified samples. F1-score: It is a measure of a model's accuracy on a dataset, used to evaluate binary classification systems, which is the harmonic mean of the precision and recall.
Confusion matrix (CM): It is a square matrix that gives the full performance of the classification model. rows of the CM represent instances of the true class labels and columns represent the predicted class labels. The diagonal elements of this matrix define the number of points where the predicted labels are equal to the true labels.
Parameters: The amount of data to be trained in the model, measuring the spatial complexity of the model.

D. RESULTS AND DISCUSSION
In this section we test the proposed model on three benchmark datasets to evaluate its effectiveness. We carry out four experiments: the first is the performance of our proposed model on three datasets, the second is comparison of the three-channel model with other numbers of channels model, the third is comparing GRU with LSTM, and the fourth is comparing the model connected with GAP with the one connected with FC layer. The model is built and trained based on DL framework of Keras and TensorFlow-gpu 2.6.0. The labels are transformed into One-Hot encoding and trained using Adam optimizer with a learning rate of 0.001 and categorical cross-entropy serves as the loss function of the model. The Batch size is 96 and the number of training steps is 100. All the experiments in this study are performed on Windows 10 system, and the computer's CPU is R9-5900HX, memory is 16GB, and GPU is NVIDIA GeForce RTX3060.

1) RESULT ON WISDM DATASET
The samples in the WISDM dataset were divided according to user IDs. The first 30 users (ID: 1-30) were used as the training set and the last 6 users (ID: [31][32][33][34][35][36] were used as the test set. The training set had a total of 14,035 samples and the test set had a total of 3,121 ones. Fig. 7 shows the confusion matrix obtained from the trained model on the test set. The experimental results show that the model achieved accuracy over 97% for four action categories (walking, jogging, standing, sitting), with upstairs and downstairs lower than others due to the similarity of the two actions. Table 4 shows the evaluation metrics of the proposed model on the WISDM dataset, the accuracy and F1-score of reaching 96.41% and 96.39%, respectively.   Our model is compared with the existing models, as shown in Table 5. It demonstrates that the F1-score and Accuracy of this model against other models, showing that this model outperforms other compared methods for HAR.

2) RESULT ON UCI-HAR DATASET
In UCI-HAR dataset, 7352 samples are used as the training set and 2947 samples as the test set. Fig. 8 shows the confusion matrix obtained by evaluating the trained model on the test set. The results show that the model achieved over 95% accuracy for five action categories (walking, walkup, walkdown, standing, laying). Table 6 shows the evaluation metrics of the model on the UCI-HAR dataset, with its accuracy reaching 96.67% and F1-score reaching 96.72%.
This model is also compared with the existing models. Table 7 compares the F1-score and accuracy of the proposed model with other models, showing that this model outperforms compared methods.

3) RESULT ON PAMAP2 DATASET
In this dataset, 11 protocol activities are chosen to perform classification. Note that the 24th activities of rope jumping is not chosen because it has very little recording time, and even some users did not perform this activity. The other activities are more balanced categories. The data of No. 6 and VOLUME 10, 2022   No. 7 of nine users were selected as the test set, and we performed a linear interpolation of the missing values in the corresponding activities for the selected users. Meanwhile, the data of first 10 seconds and the last 10 seconds of each activity are deleted to reduce the mislabeling. All the 52 features are selected, and 19,700 training set samples and 6727 test set samples were obtained. Fig. 9 shows the confusion matrix obtained by evaluating the trained model on the test set. The proposed model has a lower recognition rate on sitting and vacuum cleaning, but has a better performance on other activities. Both standing and vacuum cleaning are easily misclassified as ironing, due to their similar activity characteristics. Table 8 shows the evaluation metrics of the model on the UCI-HAR dataset, with its accuracy reaching 96.25% and F1-score reaching 96.59%.
In Table 9, the F1-score and accuracy of proposed model are compared with other models, and the results show that this model outperforms other comparison methods.

4) COMPARISON OF MULTICHANNEL CNN-GRU MODEL WITH FUSION MODEL
Dua et al. performed sequence-based convolution. Then, the samples are through pooling and flattening operations, inputted to GRU, and then concatenate was used to connect the features from multi-head module. The final classification was obtained by connecting them with a FC layer. Each head contains two layers of GRU, this will make the head heavy and result in a large number of parameters. Thus, it is not reasonable to perform classification directly after fusing features.
Some researchers have incorporated Inception module into HAR DL models, where they use one or more Inception modules in the hope that the model will have depth and width to extract more comprehensive and effective features. In this study, we deepen the depth and widen the width of the model, taking into account the parameter number, to optimize both the structure of the model and the network layer. Our design allows the proposed model to have good performance. Table 10 compares the F1 score, accuracy, and parameter number of similar multi-channel models fusing the inception module with the proposed model on the WISDM, UCI-HAR, and PAMAP2 datasets.
It is clear that the model in this study is better than other two similar inception fusion multichannel models in the accuracy and the parameter number. Our framework has better performance than multi-input GRU-CNN on UCI-HAR and PAMAP2, and the number of parameters is much smaller.

5) PERFORMANCE COMPARISON OF MODELS WITH DIFFERENT NUMBER OF CHANNELS
More number of channels means more convolutional kernels of different sizes can be involved for feature extraction, so is it true that more channels will have higher classification accuracy? We designed 1-channel, 2-channel, 4-channel, and 5-channel models for comparison, and the 1-channel model structure and 2-channel model structure are illustrated in Fig. 10 (a) and Fig. 10 (b) respectively, the 4-channel model structure and 5-channel model structure are illustrated in Fig. 11 (a) and Fig. 11 (b) respectively. These models have the same layers and parameters as the proposed model except the number of channels and the size of the convolutional kernel. The convolutional kernel size of the two convolutional layers of the 1-channel model is 5, and the rest of the settings are the same as the 3-channel CNN-GRU model. In the 2-channel model, the size of the convolution kernel of the first channel convolution layer is 3, the size of the convolution kernel of the second channel convolution layer is 5, and the rest of the settings are as above. The first three paths of the 4-channel model are the same as the proposed 3-channel CNN-GRU model, the fourth channel is 11, and the rest of the settings are the same as above. The first four channels of the 5-channel model have the same settings as the 4-channel model, the fifth channel is 13, and the rest of the settings are the same. The confusion matrices of 1-, 2-, 4-and 5-channel models on WISDM are shown in Fig. 12 (a), Fig. 12 (b), Fig. 12 (c) and Fig. 12 (d) respectively, the confusion matrices of 1-, 2-, 4-and 5-channel models on UCI-HAR are shown in Fig. 13 (a), Fig. 13 (b), Fig. 13 (c) and Fig. 13 (d) respectively, the confusion matrices of 1-, 2-, 4-and 5-channel models on PAMAP2 are shown in Fig. 14 (a), Fig. 14 (b), Fig. 14 (c) and Fig. 14 (d) respectively, which were obtained from the test sets of the three datasets with different channel models separately, and it can be seen that different channels have different effects for different action recognition. On the WISDM dataset, the 1-channel CNN-GRU is prone to identify standing as sitting and upstairs as walking, and the 2-, 4-and 5-channel models are prone to identify upstairs as downstairs. On the UCI-HAR dataset, all the channel models have the problem of identifying sitting as standing, but this situation slightly improves as the number of channels increases. On the PAMAP2 dataset, both standing and vacuum cleaning are easily misclassified as ironing. Table 11 records the accuracy, F1-score and parameter number of the models with different number of channels on the WISDM, UCI-HAR, and PAMAP2. The parameter number measures the lightweight of a model, and we can see from the table that the proposed model proposed is higher in accuracy than other models with different number of channels, and also has a reasonable number of parameters.

6) COMPARISON OF MODEL USING GRU WITH LSTM LAYER
The GRU layers has similar effect with LSTM, but has less parameters, making the model more lightweight. Fig. 15 shows the structure of the multichannel CNN-LSTM   model, in which two layers after feature fusion differ from the proposed model, and the rest of the settings are the same.
The confusion matrices of multichannel CNN-LSTM model on WISDM, UCI-HAR, and PAMAP2 are shown  in Fig. 16 (a) and Fig. 16 (b), and Fig. 17 respectively, which were obtained under the same experimental environment and settings. On the WISDM dataset, the upstairs recognition accuracy of the multichannel CNN-LSTM model is higher than that of the multichannel CNN-GRU model, but the downstairs recognition rate is lower than that of the multichannel CNN-GRU model; on the UCI-HAR and PAMAP2 dataset, the two models have similar results.    Table 12 compares evaluation metrics of the two models. They are similar in terms of accuracy and F1-score, but the multichannel CNN-GRU has a smaller parameter number. Lightweight is desirable in condition of accuracy.

7) COMPARISON OF MODELS USING THE GAP LAYER WITH THE FC LAYER
The model in this study uses a GAP layer instead of a FC layer to connect the feature maps from the GRU layer with final classification output. The two models have the same settings except for the connection layer. The structure of the multichannel CNN-GRU-FC model is shown in Fig. 18.
The trained model predicts the test set of both datasets to obtain the confusion matrices on WISDM, UCI-HAR, and PAMAP2 as shown in Fig. 19 (a), Fig. 19 (b), and Fig. 20 respectively.
Connecting the feature maps with the final classification using the fully connected layer is prone to overfitting.  As seen in the confusion matrix on the WISDM dataset, the multichannel CNN-GRU-FC model identifies quite a few of the standing actions as upstairs. Replacing the FC layer with the GAP layer can effectively suppress this phenomenon. Table 13 compares the F1-score, accuracy and parameter number of the multichannel CNN-GRU-GAP model with the multichannel CNN-GRU-FC model on the WISDM, UCI-HAR, and PAMAP2 datasets. It is clear that the proposed model with the GAP layer outperforms the FC-connected model in all evaluation metrics.

V. CONCLUSION
In this study, the proposed multichannel CNN-GRU model can identify user activity more accurately from raw data obtained from sensors. The multichannel CNN structure is able to extract different-scale features, GRU can extract time-dependent features, and the GAP layer allows to have a smaller parameter number. These advantages make the model identify human activity categories accurately and quickly. We demonstrate that the three-channel CNN-GRU model could balance both the number of parameters and the accuracy by comparing models with different channels. Experiments show that our proposed model has good performance on all datasets and outperforms other compared HAR models. Meanwhile, we observe the impact of data pre-processing on the classification results. In spite of fault tolerance of the model, the noise such as illegal and wrong data in raw data from the sensors still affects the classification result. It is not enough to use normalized pre-processing for the data in dataset, and for future work we intend to process the data more effectively. Besides, even though we tested this model on three benchmark datasets, the training samples are still relatively small, and in the future we will train our model on larger benchmark datasets or our own collected activity data to verify its generality for sensor-based HAR.