Multi-Temporal Sampling Module for Real-Time Human Activity Recognition

Human activity recognition, which recognizes human activities from time-series signals collected by sensors, is an important task in human-centered intelligent systems such as in healthcare and smart vehicles. In these applications, rapid response of the system is necessary because critical events such as an elderly person falling or drowsy driving require immediate action. A straightforward approach to achieving this requirement is to reduce the amount of information the model must handle. To this end, traditional studies have attempted to abstract the original signal by sampling it with a pre-defined interval. However, it is difficult to achieve the best efficiency because the ideal sampling interval is unknown in advance. In this study, we propose a multi-temporal sampling module that allows the neural networks to consider multiple sampling intervals simultaneously. Experiments on four benchmark datasets showed that the proposed model achieved the best F1 score over seven conventional models under the computation budget of 10M multiply-accumulate operations. Especially, an experiment on PAMAP2 dataset demonstrated that the proposed model can achieve the best trade-off between efficiency and accuracy when the input signal is oversampled at a high sampling frequency. In addition, the proposed model achieved ~1,000× improvement with respect to model size compared to the conventional methods.


I. INTRODUCTION
H UMAN Activity Recognition (HAR) aims to identify the daily activities of humans from time-series signals collected by sensors [1]. In most HAR studies, machine learning approaches have reported high performance [2]- [4] but require sophisticated feature engineering that incurs a lot of expert knowledge and time. For this reason, recent studies have increasingly considered Deep Learning (DL) models because they yield a superior learning performance based on how they automatically engineer features [5], [6]. As most DL models have high latency due to their complex architectures, 1D Convolutional Neural Networks (CNNs), which are supported by various hardware accelerators [7], are often considered to achieve real-time HAR [1], [6], [8].
In traditional HAR [9], [10], the original signal is sampled according to a pre-defined interval to improve efficiency, because the signal is typically oversampled [11]. Conventional CNNs for HAR, alternatively, can abstract the signal or feature map by pooling layers [12], [13]. However, it is difficult to achieve the best efficiency because the ideal sampling interval is unknown in advance. A straightforward solution to solve this issue is to allow CNNs to consider multiple sampling intervals simultaneously, yet the stride of pooling is fixed and the pooling is applied after the vanilla convolutions have already consumed a significant computational cost [14], [15].
To address this issue, we design a novel Multi-Temporal Sampling (MTS) module. Specifically, the module contains a new convolution unit, namely Sparse Sampling Convolution (S2Conv), that takes the random permuted form of a diagonal matrix to abstract the information delivered from the prior layer based on the sampling. Multiple S2Conv units that work with different sampling intervals are integrated into our MTS module to consider multiple sampling intervals simultaneously. By stacking MTS modules, the information to be handled can be reduced significantly as it passes through each layer. Our experiments demonstrate that our MTS-ConvNet achieves ∼ 1, 000× and ∼ 15× improvements with respect FIGURE 1: Design goal of a new component to replace a vanilla convolution (VanillaConv) unit. As temporal features pass through components, the amount of information to be handled is exponentially reduced, resulting in a reduction of computational cost and model size. (a): new feature maps are generated through VanillaConv from the original signals; (b): our components efficiently extract higher-level features by abstracting the information delivered from prior layers; (c): Based on the final features extracted by the components, an activity on each pattern is predicted. All types of convolutions come with rectified linear units.
to model size and Multiply-Accumulate (MAC), respectively, compared to the existing methods without incurring a significant degradation of accuracy.

II. RELATED WORK
In recent years, many HAR studies have proposed large DL models to solve various challenges derived from realworld applications. Some studies proposed DL models for complex activity recognition, including composite activities [16], [17], concurrent activities [18], [19]. To address multimodal sensory datasets, some other studies integrated an attention mechanism into the DL models [20]- [24]. These models, including vanilla convolutions and recurrent units, demonstrated their effectiveness, but they are insufficient to ensure the real-time response when the available computation budget is reduced due to low-cost devices or background APPs [25]. Especially, the recurrent units often require infeasible computational costs at edge implementation [7], [26], thus making the real-time HAR difficult.
In HAR literature, many comprehensive surveys have emphasized the importance of real-time applications [1], [6], [8], [27], [28]. Most prior studies have focused on sampling the signals and experimentally demonstrating that the ideal sampling interval depends on the type of activity [9], [11], [29]. Furthermore, Cheng et al. attempted to predict an instance-wise sampling interval [10]. Similarly, Yang et al. proposed an instance-wise dynamic sensor selection method [30]. However, these approaches, which focus on data acquisition, may result in irretrievable information loss.
Some early studies proposed DL models for real-time HAR, including pre-processing that transforms the input data from the time domain to the frequency domain [31], [32]. However, this complex pre-processing should be conducted without stopping, resulting in an increase in overhead. Ignatov first attempted to design a CNN architecture for realtime HAR without any transformation of the data [12]. Wan et al. proposed a CNN architecture including three convolutional layers for real-time HAR using accelerometers and gyroscopes [13]. However, these studies have not examined the effects of the sampling interval on performance for HAR.

III. MULTI-TEMPORAL SAMPLING MODULE
Our goal is to design a novel component to replace a vanilla convolution (VanillaConv) unit for real-time HAR. The component abstracts the information delivered from the prior layer, resulting in an improvement of efficiency. Figure 1 shows the forward pass of temporal information in the proposed neural network. Given original signals, a VanillaConv unit generates new temporal feature maps by learning intra-modality information or inter-modality correlations [8]. As the components are stacked along with the layer, the information to be handled is exponentially reduced. Because the ideal sampling interval may also vary when the information is abstracted, the component must consider multiple sampling intervals at each layer. To this end, we will design an integrated module as the basic component. Finally, a fully-connected (FC) layer predicts a human activity using extracted features.

A. VANILLA CONVOLUTION AND NOTATIONS
Let I ∈ Z T,M and O ∈ Z T,N be an input and output for VanillaConv, respectively, where T is the temporal resolu- tion, and M and N are the numbers of feature maps. Given an input and a kernel K ∈ Z D k ,M,N with kernel size D k , the n-th feature map of O at time t is calculated as follows [33]: In the field of HAR, the purpose of convolution is to emphasize local temporal patterns closely related to the activities from input feature maps [8]. In this regard, Vanilla-Conv can be divided into feature map-wise convolution and time-wise convolution, similarly to [34]. Given the kernel K 1 ∈ Z D k ,M of feature map-wise convolution, the m-th feature map at time t is calculated aŝ Next, for a given kernel K 2 ∈ Z M,N of time-wise convolution, the n-th feature map is calculated as where the computational cost is T ×M ×(D k +N ). Based on this decomposition, we integrate two sampling processes into VanillaConv, i.e., time sampling and feature map sampling, as shown in Figure 2.

B. SPARSE SAMPLING CONVOLUTION
We design an efficient convolution unit that extracts higherlevel features at a specific sampling interval within MTS modules. In Eq. (3), the feature map sampling is applied tô O. Let G n ∈ Z T,Ms be the sampled feature maps used to generate the n-th output feature map O :,n , where M s ≪ M is the number of the sampled feature maps. As a result, the size of the kernel K 2 will be reduced, i.e., K 2 ∈ Z Ms,N . Meanwhile, the time sampling is applied to K 1 in Eq. (2). We consider the kernel size D k as the sampling interval and thus parameters of K 1 are sampled with D k . The sampled kernelK 1 can be written as where θ is a trainable parameter and i m is feature mapdependent index that has only one value inK 1 :,m ∈ Z D k . To extract temporal information, O t,n should be computed by interacting with features of the adjacent times. To this end, the time sampling can be conducted by spreading i m of Eq. (4) across adjacent times as shown in Figure 2(c), while satisfying the following condition: Furthermore, the feature map sampling can be directly applied to I because the order of input feature maps is unchanged byK 1 . Additionally, the parameters ofK 1 can be replaced with one by the associative law of multiplication between them and the parameters of K 2 of Eq. (3). As a result, two kernels can be integrated intoK ∈ Z Ms,N without additional computation.
Meanwhile, it is difficult to find the optimal M s with respect to a trade-off between accuracy and efficiency. Therefore, we simply set M s to D k , which is the minimum number of parameters withinK :,n that satisfy Eq. (5). Consequently, we propose a Sparse Sampling Convolution (S2Conv) of which the computational cost is T × D k × N . It extracts the n-th output feature map at time t as follow: end for 8: end for 9:Ĩ ← ReLU(BN(Ĩ)) ▷ the batch normalization BN 10:

C. PERMUTATION OF FEATURE MAPS
To avoid loss of information resulting from the random sampling of feature maps, the following issues should be considered. First, mapping M input feature maps to N sample groups generates a large search space, resulting in M C D k possible cases per group. Second, the order of sampled feature maps within G n should be considered because their dependency can vary across time. To handle these issues, we permute the input feature maps and map them to each sample group in order. Let P 1 and P 2 be permutation functions for grouping and an arrangement of feature maps within each group, respectively. Based on this, Eq. (6) can be rewritten as follows: where • is a function composition and K is computed by Eq. (6). Because P 1 and P 2 can be computed through the permutation matrix, the composition of them can be conducted by multiplying a permutation matrix P ∈ Z M,M to I ∈ Z T,M . However, it is difficult for P to be optimized by stochastic gradient descent because it has discrete values. This issue can be resolved by training a permuted time-wise convolution directly [35]. Specifically, the input feature maps to each module are permuted by time-wise convolution. After that, feature maps can be simply sampled in sequence. Given a permuted inputĨ ∈ Z T,M , when M = N , each feature map group G n is sampled as follows:  In the MTS module, the time complexity is determined by computations of TWConv and S2Conv. By assuming that elements of S k are an arithmetic sequence of which the first term and the common difference are three and two, respectively, their computations cost can be computed as follows: Therefore, the time complexity of the MTS module is O((|S k | + M |S k | )N T ). Suppose that VanillaConv's D k is a mean of S k . Then, its time complexity can be computed Thus, our MTS module has lower time complexity than does VanillaConv.
Built on the MTS module, we constructed our MTS-ConvNet architecture as shown in Figure 3. For brevity, we represent all type of convolution with a specific kernel size as "convolution (kernel size)". We used VanillaConv (1) to examine the temporal modeling capability of the MTS module and to improve the efficiency of MTS-ConvNet. Because a delayed reduction of resolution may lead to higher classification accuracy [36], the pooling layers were concentrated toward the end of the network. MTS-ConvNet for all datasets are trained in PyTorch [37] by Adam optimizer [38] with β 1 = 0.9, β 2 = 0.999 and ϵ = 10 −8 . Both the learning rate and weight decay are set as 0.0005. We train MTS-ConvNet for 500 epochs with a batch size of 128 on a 2080Ti Graphics Processing Unit (GPU).

FIGURE 4: Comparison between our MTS module and
VanillaConv with respect to model size.

IV. EXPERIMENTAL RESULTS
We conducted experiments using four benchmark datasets.
The UCI-HAR [39] and WISDM [40] datasets include six activities such as "jogging" and "walking". They were collected at 50 and 20 Hz frequencies, and their segment lengths VOLUME 4, 2016 are 128 and 60, respectively. Also, the UCI-HAR dataset was collected from accelerometer and gyroscope, while the WISDM dataset was collected from the only accelerometer.
The OPPORTUNITY [41] and PAMAP2 [42] datasets contain 18 classes, including complex activities such as "playing soccer". For real-time HAR, we only considered the on-body sensors. They were collected at 30 and 100 Hz, and their segment lengths are 150 and 512, respectively. To evaluate the performance of our MTS-ConvNet for HAR, we used three metrics: Multiply-Accumulate (MAC) operation, model size, and F1 score. The MAC operation is the basic computation of neural networks, which takes the form w × x + b. Therefore, we used MACs to measure the computations of the neural networks. Model size is typically used to measure the number of parameters in neural networks. Lastly, because HAR datasets often involve class imbalance issues [8], the F1 score is commonly used as an alternative to accuracy. We adopted seven baseline networks, as follows.
• Real-time HAR models. Ignatov [12] introduced a CNN for real-time HAR that consists of a single convolutional and FC layer. In addition, Wan et al. [13] proposed a CNN that consists of three convolutional and two FC layers. Their experiments support baseline performance for real-time HAR models. • Time-Series Classification (TSC) models. We adopted two CNNs that have reported significant success for the TSC problem. The 1D Residual Neural Network (ResNet) [43] and Fully Convolutional Network (FCN) [44] include eleven and three convolutional layers, respectively. • Efficient CNNs. We also use three CNNs: Mo-bileNetV2 [45], SqueezeNet [36] and ResNext [46]. These were designed to produce a computationally efficient model for image classification. To conduct experiments on HAR datasets, we replaced 2D operations with 1D operations. Table 1 shows the overall comparison results of our MTS-ConveNet along with all the baseline networks. We obtained the average value by repeating the experiment 10 times. Our MTS-ConvNet exhibited comparable F1 scores while attaining the lowest MACs and model size. Specifically, MTS-ConvNet achieved higher F1 scores than existing realtime HAR models on the UCI-HAR, WISDM, and PAMAP2 datasets. Furthermore, MTS-ConvNets trained on the OP-PORTUNITY and PAMAP2 datasets had smaller model sizes than real-time HAR models trained on the UCI-HAR and WISDM datasets, which makes real-time HAR on complex activities possible.
Meanwhile, ResNet and FCN with the highest MACs achieved the best F1-score on UCI-HAR, WISDM, and OP-PORTUNITY datasets. These results indicate that a deeper network and more output channels can produce better F1 scores. However, ResNet and FCN, which require at least 23× MACs compared to MTS-ConvNet, can be insufficient to ensure the real-time response on wearable devices with various hardware specifications. On the other hand, our MTS-ConvNet outperformed ResNet and FCN in terms of the F1 score on PAMAP2 collected at 100 Hz even though 49× and 29× improvements with respect to MACs, respectively. This result indicates that the proposed method can achieve the best trade-off between efficiency and accuracy when the input signal is oversampled at a high sampling frequency.

V. ANALYSIS ON MODEL SIZE AND SAMPLING INTERVALS
Our MTS-ConvNet demonstrated substantial efficiency with respect to model size. Figure 4 shows a comparison result of the MTS-ConvNet and [13], which is a real-time HAR model built on VanillaConv using the UCI-HAR dataset. The two horizontal axes indicate a scale factor to control the number of output channels and the depth of VanillaConv units or MTS modules in each model, respectively. As shown in Figure 4, MTS-ConvNet has an efficient rate of increase in model size as the network becomes wider or deeper.
To demonstrate the benefit of considering multiple sampling intervals, we compared feature maps extracted at different kernel size D k as information for two activities passes through our MTS modules. For better visibility, we used the WISDM dataset, which was collected from only an accelerometer. As shown in Figure 5, the activity "Ascending stairs" has higher variations in input signals while "Standing" hardly has variations. For "Standing," the temporal features tended to be emphasized from all S2Conv units, regardless VOLUME 4, 2016 (7), the accuracy can be degraded. However, the ideal D k is unknown in advance as shown in Figure 5. Consequently, our MTS module can achieve a better tradeoff between the efficiency gain and the accuracy by using multiple S2Conv units with diverse D k .
Furthermore, we examined the impact of the number of kernel sizes |D k | along with a depth of network on the F1 score. As shown in Figure 6, MTS-ConvNet, which either uses more multiple kernel sizes or is deeper, tends to achieve a higher F1 score. Meanwhile, we found that the small number of filters in the first TWConv led to a poor F1 score on the OPPORTUNITY dataset, which was collected in a sensor-rich environment. This may be because 113 input channels are immediately compressed to 16 channels, resulting in a loss of information. Therefore, we modified the number of output channels of the first MTS module to 32 only for the OPPORTUNITY dataset. Finally, Table 2 shows the improved F1 scores.

VI. REAL-TIME ACTIVITY RECOGNITION
To estimate the actual response time of our MTS-ConvNet, we used a Samsung Galaxy S10 smartphone having an octacore processor (2 × 2.7 GHz + 2 × 2.3GHz + 4 × 1.9 GHz) and 8GB RAM; herein, our implementation excluded the use of graphics processing units because the real-time HAR should be conducted without stopping as background APPs. Specifically, input signals are acquired in real-time from the smartphone's built-in accelerometer and gyroscope at 50 Hz. After that, our MTS-ConvNet, which was pretrained on the UCI-HAR dataset and embedded into the smartphone, continuously runs the activity recognition whenever the previous prediction is finished. As a result, a short YouTube demo video is available at https://youtu.be/Ie47soUp6Bs. The inference time of our MTS-ConvNet was measured between 20 and 45ms. Because the length of recognition intervals for the UCI-HAR dataset is 128 (2.56s segments), our model is sufficient to meet the real-time requirement.

VII. CONCLUSIONS
In this paper, we present a novel approach to integrating a traditional sampling process into a neural network for realtime human activity recognition. To this end, we introduce a sparse sampling convolution unit that allows neural networks to abstract the information delivered from the prior layer based on the sampling, resulting in improved efficiency. Furthermore, we propose a novel multi-temporal sampling module that contains the proposed convolution units with multiple kernel sizes, resulting in a better trade-off between the efficiency gain and the accuracy. The proposed module enables a sophisticated architecture that depends on various resource-limited sensor devices. Therefore, a promising study in the future would be a neural architecture search. For example, the trade-off between accuracy and efficiency can be optimized by directly measuring the latency and memory of sensor devices and reflecting them in the search process.