Efficient Human Activity Recognition Using Lookup Table-Based Neural Architecture Search for Mobile Devices

Mobile devices play a crucial role in human activity recognition as they enable real-time sensing of user interaction for learning algorithms like neural networks. To facilitate human activity recognition on mobile devices, it is important to deploy efficient neural network architectures due to the limited computational capacity of these devices. However, conventional neural architecture search methods often generate less effective architectures because they neglect the specific requirements of target devices on which the neural network would operate in real-time. Moreover, these methods are impractical in the mobile device environment due to their high computational cost for architecture search. To address these challenges, we propose an efficient neural architecture search method based on a latency lookup table. Our proposed method efficiently performs the network search process based on differentiable NAS while considering the actual latency of mobile devices, which is stored in a lookup table. The experimental results on public datasets provide evidence that the proposed method outperforms conventional methods in terms of speed. We achieved a search time of under 1.5 hours on each dataset, which is more than seven times faster on average compared to conventional methods. Furthermore, our in-depth analysis shows that the optimal architecture can vary depending on the target mobile devices, such as Galaxy A31 and S10. By tailoring the models to each device, optimized models achieved an additional 4-5% improvement in inference time for each respective device.


I. INTRODUCTION
Human activity recognition (HAR) using wearable and mobile devices has attracted considerable research attention for application in fields such as healthcare [1], surveillance [2], smart Home [3]. Many studies have focused on offline mobile device-based activity monitoring to address challenges such as privacy, communication cost, latency, and network traffic to the cloud [4]. However, mobile devices have limited resources and diverse hard-The associate editor coordinating the review of this manuscript and approving it for publication was Chan Hwang See. ware specifications, making the design of HAR-specific models crucial for effective activity recognition on these devices [5].
Deep learning algorithms have shown excellent performance in most HAR studies, particularly the DeepCon-vLSTM approach [6], which has achieved state-of-the-art results. This approach combines the strengths of Convolutional Neural Networks (CNNs) [7] and Recurrent Neural Networks (RNNs) [8], creating a hybrid network architecture. In particular, designing the CNN architecture effectively to extract valuable features plays a crucial role in performance [9]. Only a few studies have been conducted to automate the CNN design process using neural architecture search (NAS) methods for human activity recognition [9], [10], [11], [12]. These studies are inspired by reinforcement learning and evolutionary algorithms, which are commonly employed in computer vision tasks [13], [14]. However, existing HAR NAS methods often produce less effective architectures by not considering the computational capacity of the target devices and requiring computationally expensive architecture searches.
To address these challenges, we propose a mobile HAR NAS approach based on differentiable NAS (DNAS), which incorporates the latency of real mobile devices during the architecture search process. The DNAS-based approach addresses [15] the computational expense associated with exploring discrete search spaces in traditional NAS approaches. To alleviate this issue, DNAS adopts a strategy of relaxing the search space into a continuous domain. This relaxation enables efficient optimization using gradient-based methods, allowing for faster convergence and exploration of architectural configurations. By using a single training process, the architecture search time of the proposed method was significantly reduced. Furthermore, the effectiveness of the mobile device optimization method was verified by deploying searched models on real smartphones. This paper focuses on addressing the limitations of conventional NAS methods in real-life mobile HAR systems. The model's latency is important in capturing changes in human behavior using mobile devices. Each device has unique hardware specifications, requiring optimal architectures based on device characteristics. Conventional NAS methods suffer from high architecture search time and rely on indirect metrics instead of target device-specific optimization. To overcome these limitations, the proposed approach incorporates the latency measured on the target device into the objective function and facilitates fast search using a DNAS-based approach. The goal is to find an efficient HAR model that adapts quickly to different edge devices. By doing so, this approach offers an efficient and device-specific solution for mobile HAR NAS. The main contributions of this paper are summarized as follows: 1) We propose a mobile HAR neural architecture search based on a differentiable NAS (DNAS) that reflects the 71728 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
latency of real mobile devices in the architecture search process. 2) By using a single training process, the architecture search time of the proposed method is significantly reduced compared to previous methods.
3) The effectiveness of the proposed method is verified by deploying searched models on real smartphones.

II. RELATED WORK
A typical choice for dealing with the time-series characteristics of HAR can be an RNN, which is capable of addressing the long-range time dependency of the given data [16]. In RNNs, a long short-term memory (LSTM) structure and gated recurrent unit (GRU) are used to capture the long-range dependency of sequence data. However, recent studies in HAR have focused on CNNs because they are capable of capturing human behaviors and hierarchically extracting features from low-level to high-level features through multiple convolutional layers. In DeepConvLSTM [6], CNN and LSTM structures of the RNN were integrated to achieve state-of-theart performance on various public HAR datasets.
In conventional studies, deep learning architectures have been designed manually, which is difficult and time-consuming due to a series of factors involved in architecture design. These factors include operations, the number of layers, the connection between layers, the target device, and the recognition of activities. For example, the neural network architecture may become dense or loose depending on the complexity of the activity and the target device. Although designing a neural network architecture for various image recognition tasks has been automated using neural architecture search (NAS) [13], [14], [15], few studies have considered NAS for human activity recognition. To the best of our knowledge, there are two types of NAS methods, namely reinforcement learning (RL)-based NAS [11] and evolutionary algorithm (EA)-based NAS [12].
In these two methods, HAR neural architectures are used to achieve a high F1-Score with low floating point operations (FLOPs) or memory access cost (MAC). In the RL-based NAS, a meta-controller is trained, in which an optimal architecture is generated to provide a reward feedback such as F1-Score and FLOPs on HAR dataset to the meta-controller. In the EA-based NAS, NSGA-II [17], a multi-objective optimization method, is used to design a lightweight and fast neural architecture. Finally, the generated model is a neural architecture that achieves Pareto optimality for HAR tasks with respect to three objective metrics: F1-Score, FLOPs, and MAC.

III. MATERIALS AND METHOD
In this section, we explain the entire algorithm of the proposed method, from search space to search strategy and the way to reflect the latency of the target devices.

A. MOTIVATION
In HAR systems using a mobile device, the latency in the inference phase is important because quickly capturing changes in human behavior in fields such as fall detection for elderly people or rehabilitation exercise recognition for patients [18], [19] is critical. In particular, each mobile device exhibits distinct hardware specifications so that optimal architectures can differ depending on the latency of the mobile device. Conventional NAS methods exhibit two disadvantages when applied to real-life mobile HAR NAS. First, indirect metrics, such as FLOPs or MAC, are used instead of a metric directly related to the target device. Therefore, the target device-specific models may not be optimized because conventional methods exploit the indirect metrics during the NAS process; they do not optimize the actual latency from the target devices. Second, in previous HAR NAS studies, repeated architecture sampling, training, and evaluation are required, considerably increasing the architecture search time. For example, existing HAR NAS methods spend considerable time searching only one convolutional block architecture in the ConvLSTM baseline network. In the empirical tests, RL-based NAS [11] required approximately nine GPU hours on average for each of the well-known public HAR datasets, and EA-based NAS [12] required approximately seven GPU hours. The high architecture search time in these methods can be attributed to the retraining of the sampled architecture, which requires repeated sampling, training, and evaluation of the procedures.

B. APPROACH
To address the limitations of conventional methods, we constructed an overparameterized network with a directed acyclic graph (DAG) structure that encompasses all possible architectures. We then conducted a DNAS-based architecture search, as depicted in the architecture search module of FIGURE 1. Utilizing the overparameterized network enables a quick search for the optimal architecture, as the network only needs to be trained once.
In order to directly optimize the latency for each mobile device instead of relying on FLOPs as an indirect measure of latency, we employed a novel approach. The latency used for optimization is measured in advance by the target mobile device, as shown in FIGURE 1, and stored in a latency lookup table (LUT). To generate the desired network, we utilized a loss function that combines a latency term as an objective metric with a cross-entropy term for learning accuracy.  FIGURE 1 provides an overview of our framework, which consists of two main modules. The first module is the architecture search module responsible for the search process, while the second module is the Device LUT module designed to incorporate latency information specific to the target device. The architecture search module includes a feature extractor network that inputs behavioral data and outputs useful features for prediction. Its primary role is finding the optimal network architecture for feature extraction. On the other hand, the Device LUT module measures the latency of each operation defined in the search space on the target device, storing the latency information in a lookup table. During the search process, the Device LUT module provides the stored latency information as needed.

C. SEARCH SPACE
To ensure the robust analysis of one-dimensional (1-D) timeseries sensor data on human behavior, it is essential to design a search space with appropriate operations tailored for timeseries data. In this regard, 1-D convolution is a suitable operation for extracting features from time-series data, as it applies the kernel in a single direction. Additionally, 1-D convolution can be categorized into two types of operations: normal and dilated, based on the dilation size. Despite having the same kernel size, the dilated convolution's kernel exhibits a wider receptive area than the normal convolution's kernel. As a result, the dilated convolution can extract more temporal features while requiring less computation [20]. For example, in FIGURE 2, the normal convolution recognizes a kernel area of 3 × 1. However, the dilated convolution expands the kernel by a dilation size of 2, resulting in a recognition area of 5 × 1. Considering this, our search space encompasses both normal convolutions with diverse kernel sizes of {1, 3, 5, 7, 9} and dilated convolutions with kernel sizes of {3, 5}. Additionally, we incorporate average and max pooling operations for extracting higher-level features, skip connection operation to transmit information from previous layers to subsequent layers, and no connection operation to disconnect nodes. Therefore, as summarized in Table 1, An overparameterized network is a DAG structure composed of nodes and edges. Each edge is calculated through a weighted sum that applies operations in the search space to the data. For instance, when the probability of each operation on the edge from node1 to node3 is provided, such as 0.61 for Conv_1, 0.10 for MaxPool_5, and so forth, the output of each operation is multiplied by its corresponding probability value. These multiplied outputs are then added together with the outputs of other operations, enabling each operation to contribute in proportion to its assigned probability value. Once the parameter update is complete, each node, such as node3, ultimately selects the two operations with the highest probability among all the edges.
the entire search space is primarily divided into six types of operations based on the ''Operation'' column, with each operation further categorized according to the kernel size. In the case of the convolution operation, the operation type is additionally subdivided based on the dilation size, resulting in a total search space of 14.

D. SEARCH STRATEGY
The high cost of architecture search time in previous HAR studies has been a bottleneck for modeling, as it involves repeatedly searching for models during the search process. To address this issue, we utilized the DNAS method [15] to design a mobile HAR neural network architecture. We denote an architecture space (S), in which we find an optimal architecture (arch ∈ S) after training its weights (weight). We formulated the neural network architecture search problem as follows: where L is a loss function to be defined. In order to employ the differential-based architecture search method, the discrete search space consisting of individual operations needs to be transformed into a continuous space. To achieve this, 71730 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
we initially conducted a process called continuous relaxation. During this process, each operation is connected by architecture parameters, enabling a smooth transition within the search space. A set of candidate operations is expressed as where M is the number of candidate operations (M = 13, as listed in Table 1). The architecture parameters corresponding to the candidate operations for continuous relaxation of this search space are A = {α 1 , α 2 , . . . , α M }. All operations were connected by applying the softmax function to the architecture parameter α corresponding to each operation. As shown in the example of FIGURE 3, each edge is composed of a weighted sum of operations within the search space in Table 1. The connection of these edges forms a DAG structure, creating an overparameterized network. Therefore, in each edge, a mixed operation is represented asō and defined as follows: where Prob(i) denotes the probability of selecting the ith operation, and o i (x) represents the parameter associated with the ith operation for the given input x. The probability Prob(i) is defined using the softmax function, where α i represents the architecture parameter associated with the ith operation. The architecture search process using an overparameterized network is summarized in Algorithm 1. Therefore, all edges are searched simultaneously by training architecture parameters α of the overparameterized network. Finally, the optimal architecture is generated by selecting the optimal operation with a high probability at each edge. Our search space has 13 candidate operations configured as presented in Table 1 for four nodes and each edge in the DAG. Our overall search space has 13 candidate operations for every 14 edges from two stem nodes to four nodes of the DAG; therefore, the possible architectures are 13 14 ≈ 4 × 10 15 . The architecture search time is considerably reduced because these vast possible architectures are relaxed into one overparameterized network.

E. LATENCY METRIC OPTIMIZATION PROCEDURE
The loss function (1) should reflect not only the accuracy of the architecture but also the latency of the target device. Therefore, we define the following loss function: where θ is the weight factor defined as follows: Update α by ▽ α L val (α, weight) with Data val 9: end while 10: output: S with trained parameters; 11: Extract the final architecture from S according to α where β is the weight factor for E[LAT] versus Target. The first term CE(arch, weight) denotes the cross-entropy loss of the architecture with the weight parameters. The second term E[LAT] denotes the estimated latency of the architecture on the target device, where Target is the target latency specified by the user. Coefficient λ is a scaling factor that adjusts the scale of CE and E[LAT] of the loss function, respectively. To calculate the E[LAT] term, we used a latency LUT to estimate the overall latency of a network based on the actual latency of each operation [21]. The latency of each operation is measured on specific devices and stored in a LUT. This LUT contains the actual latency values for each operation on different target devices. During the search, the LUT is loaded based on the target device being considered, and the latency information is incorporated into the objective function. This allows the search algorithm to evaluate and optimize the architecture based on the measured latencies on the target device.
The overall latency of an overparameterized network was estimated by calculating the expectation values of each edge. Here, Prob(i, j) denotes the probability of selecting o(i, j). lat(o(i, j)) denotes the actual latency of o(i, j) when loaded on the target device LUT of the ith edge and jth operation.
Thus, we can estimate the overall latency of the overparameterized network.

IV. EXPERIMENTAL RESULTS
All the experimental results are composed of three parts. Section A. introduces the details of comparison methods, datasets, and implementations. Comparison results are discussed in Section B. For comparison of target devices, in-depth analysis is discussed in Section C.

A. EXPERIMENTAL SETUP
We implemented state-of-the-art methods to compare the performance against existing RL and EA-based methods, as well as our proposed method. Our experiments were VOLUME 11, 2023 FIGURE 4. Comparison of architecture search time results for the proposed method and existing NAS methods [11], [12] on five HAR datasets.

FIGURE 5.
Comparison of parameter size results for the proposed method and existing NAS methods [11], [12] on five HAR datasets. VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. conducted on five public HAR datasets, where we measured the architecture search time, parameter size, and F1-Score as comparison metrics with existing methods. The architecture search time serves as an indicator of the search process efficiency, while the parameter size and F1-Score provide insights into the resulting architectures' size and accuracy, respectively. In addition to these evaluations, we conducted experiments targeting various target devices, including the Galaxy A31 and Galaxy S10, to assess the proposed method's optimization metric. Furthermore, to validate the effectiveness of our approach, we measured the latency on the Galaxy A31 with the Mediatek MT676 CPU and the Galaxy S10 with the Exynos 9820 CPU. In Section IV-B, we employed FLOPs as the optimization metric for the RL-based method, FLOPs and MAC for the EA-based method, and CPU latency for our proposed method. Furthermore, in Section IV-C, we specifically focused on the latency of the Galaxy A31 and Galaxy S10 devices as the optimization metric.

71732
The five public HAR datasets used in experiments are as follows: UCIHAR [22], UniMiB-SHAR [23], WISDM [24], OPPORTUNITY [25], and KU-HAR [26]. The data of these datasets were collected by wearable sensors and mobile devices, which are widely used in HAR. The UCI-HAR dataset was collected from 30 subjects wearing a mobile device on their waist for six daily activities. The UniMiB-SHAR dataset was collected by recording 17 activities performed by 30 participants using acceleration sensors on mobile phones. The WISDM dataset was collected from 36 subjects wearing a mobile device on their pants and consisted of six daily activities. The opportunity dataset was collected by placing wearable devices on the upper body, buttocks, and legs of four subjects, who were then asked to perform 17 daily activities. The KU-HAR dataset was collected by recording 18 activities performed by 90 subjects by using acceleration and gyroscope sensors on their mobile phones.
We divided each dataset into a training set and test set as a proportion of 8:2, and the test set is also used to validate the overparameterized network. The overparameterized network was trained for 100 epochs with 256 batch sizes, and SGD and Adam were used as optimizers for learning the weights and architecture parameters, respectively. To balance the cross-entropy and latency term in the loss function (3), we set λ to 0.5, and β as 1 that are optimal setting values in our experiments. The framework used in the experiment was PyTorch deep learning library and was performed using an RTX 2080Ti GPU 11GB. Target device experiments were conducted on Android 10 OS systems.       71734 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    . FLOPs comparison with A31 and S10 model (a) and Latency comparison deployed on Galaxy A31 (b) and Galaxy S10 (c) for the models searched by the proposed method on the WISDM dataset. FIGURE 5 displays the parameter size of the search architecture for each method using the dataset. Consideration of multi-objectives led to a decrease in parameter size in the majority of datasets. This reduction in parameter size was evident even in the results of the proposed method, which integrated the latency term, across four datasets. Despite having the shortest architecture search time, the proposed method consistently produced architectures with the smallest parameter size compared to the other two methods across most datasets. In particular, in FIGURE 5 (a), the parameter size of the RL and EA methods is 0.60 (MB), whereas the proposed method generates a model that is approximately three times lower with 0.23 (MB). These results revealed that creating a lightweight model instead of FLOPs or MACs is possible by optimizing latency. FIGURE 6 displays the F1-Score of the searched architecture for each method by using the dataset. Without considering multi-objectives, the F1-score is generally high across different methods. Notably, in the OPPORTUNITY dataset shown in FIGURE 6 (e), the proposed method exhibited improved performance despite a reduction in parameter size. In contrast, the EA method encountered a decrease in performance despite an increase in parameter size. In FIGURE 6 (c), the F1-Score of the proposed method is 0.86, and the F1-Score of the RL and EA methods are 0.85 and 0.87, respectively. The proposed method has a similar F1-Score despite the much shorter architecture search time than the RL and EA methods. In the remaining datasets, the proposed method can search architectures with similar performance to the RL and EA methods with at least seven times less architecture search time.

C. IN-DEPTH ANALYSIS
In previous HAR NAS studies, HAR architecture models have been optimized for FLOPs. However, the objective of this study is to create an optimized HAR architecture model for each target device using the latency of the target device. To demonstrate that the architecture of an optimal model can vary depending on the specifications of the target device, we configured an entry-level device (Galaxy A31, CPU: Mediatek MT6768) and a high-end device (Galaxy S10, CPU: Exynos 9820). We measured the latency of operations in our search space on the Galaxy A31 and Galaxy S10 devices. To facilitate the optimization process, we created a latency LUT by storing the measured latencies in advance. In the proposed method, one of the objective functions includes a term related to latency. During the optimization VOLUME 11, 2023  process, we utilize the latency LUT to search and incorporate the latency of each operation specific to each device. By doing so, we can effectively consider and optimize for the latency characteristics of both devices in our approach. Since latency optimization depends on the device rather than the dataset, we utilized the latency LUT from the OPPORTU-NITY dataset to perform the architecture search on Galaxy A31 and Galaxy S10. After the architecture search according to the target device, each model was deployed on both mobile devices for benchmarking the latency. Moreover, we revealed the portability of the obtained model by migrating it to the remaining four datasets. In FIGURE 7 to 11, (a) illustrates the comparison of FLOPs between the A31-optimized model and the S10-optimized model. (b) and (c) provide detailed comparison results of the latency for the A31-optimized and S10-optimized models on each respective device.
Comparing the FLOPs of each model in FIGURE 7 (a) revealed that the FLOPs of the S10-optimized model is 0.05 MFLOPs lower than that of the A31-optimized model. However, when the two models were deployed on the Galaxy A31 and Galaxy S10, the latency differed depending on the device. As displayed in FIGURE 7 (b), the latency of the A31-optimized model is 10.08 ms, which is lower than that of the S10-optimized model with fewer FLOPs, 10.45 ms. By contrast, in FIGURE 7 (c), the latency of the S10optimized model was 6.24 ms, which is lower than that of the A31-optimized model of 6.56 ms. Experimental results revealed that the latency-based architecture search method proposed in this study is more appropriate for mobile device optimization than the FLOPs-based architecture search method. Likewise, similar results were observed when the model was migrated to other datasets. Although the S10 model has higher FLOPs in those datasets, the optimized model consistently shows lower latency on each respective device. Specifically, in the KU-HAR dataset experiment, Comparison of operation latency on each mobile device(Galaxy A31, Galaxy S10) e.g., the latency of Conv3 has the most prominent distinction between devices. FIGURE 10 (b) and (c) illustrate a significant latency difference of nearly 1 ms between the two models for both the Galaxy A31 and Galaxy S10.
Moreover, we conducted an investigation based on the latency LUT to analyze the differences between architectures using target mobile devices. We visualized the block diagrams of the model architectures optimized for each device in FIGURE 12 and FIGURE 13. Additionally, we depicted the latency LUT for candidate operations on Galaxy A31 and Galaxy S10 in FIGURE 14. A noticeable difference can be observed in the latency of each operation between Galaxy A31 and Galaxy S10. While Galaxy S10 benefits from ample computational resources, Galaxy A31 exhibits more significant variations in operation latency. In particular, unlike Galaxy S10, Galaxy A31 displays high latency in convolution with a kernel size of three and max pooling with a kernel size of five. Conversely, Galaxy S10 exhibits lower latency in the convolution with a kernel size of three compared to other operations, and the difference in latency between each operation is smaller than that of Galaxy A31. By comparing and interpreting the latency of each operation on each device, it becomes evident that the model generated through the proposed method was tailored to reflect the characteristics of each device. FIGURE 12 shows that the convolution operation with a kernel size of three and the max pooling operation with a kernel size of five, which exhibited the highest latency on Galaxy A31, were not selected. Instead, operations with relatively lower latency were chosen. Conversely, as shown in FIGURE 13, on Galaxy S10, the convolution operation with a kernel size of three was selected due to its relatively low latency. This comparison highlights how the proposed method adapts the model architecture to the specific latency constraints of each device.
In summary, based on the additional experiments conducted, it is evident that the optimal model for HAR can vary across different devices due to variations in operation latency. Consequently, an architecture design that takes into direct consideration the hardware specifications of the target device was confirmed to be effective in environments with limited computational resources, such as smartphones.

V. CONCLUSION
In this study, we proposed a mobile HAR NAS based on a differentiable neural architecture search to automatically design the architecture of a HAR model for a mobile device. Our experiments achieved a significant milestone by utilizing a single training process incorporating an overparameterized network encompassing all candidate operations. This approach resulted in a remarkable reduction in search time on each dataset, with a duration of under 1.5 hours. As a result, the proposed method exhibited an average speed improvement exceeding sevenfold compared to conventional methods. Furthermore, our assessments on various target devices revealed that the proposed method, utilizing the target device's latency instead of focusing on FLOPS optimization like conventional HAR NAS methods, facilitates the exploration of hardware-specific architectures. In additional experiments conducted with the Galaxy A31 and Galaxy S10 smartphones as target devices, the latency of the A31-optimized model was on average 2% faster than the S10-optimized model on the A31 device. Specifically, when running the OPPORTUNITY dataset on the A31, the A31 model showed a 4% improvement in latency compared to the S10 model. On the other hand, when running the same dataset on the S10, the S10 model exhibited a 5% reduction in latency compared to the A31 model. However, some concerns remain. The time and effort required to construct the latency LUT were not considered, which is not trivial if the number of devices and candidate operations increase. This problem could be alleviated by using a regression model trained from the latency LUT values to predict the latency of other devices.