Coarse-to-Fine Activity Annotation and Recognition Algorithm for Solitary Older Adults

Older adults want to remain independent with dignity for as long as possible, especially the solitary older adults. Activity recognition plays an essential role in elderly care and rehabilitation by detecting activity shifts among the elderly population. Despite over a decade of research and development in activity recognition, accurate and reliable systems for older adults in use are few. We propose an automatic data collecting and labeling system by addressing the annotation issue, and a novel coarse-to-fine activities of daily living(ADLs) recognition algorithm for older adults, by combining supervised and unsupervised machine learning methods. The automatic data collecting and labeling system targets at the annotation issue caused by the diversity of ADLs in free-living situations. Multiple sensors fusion strategy is employed to interpret and annotate the ADLs. Leveraging supervised and unsupervised machine learning methods, we can discover and recognize ambulatory and trivial ADLS for older adults. The performance of the automatic data collecting and labeling system is double-checked in a four days long test. With the reliable ground truth, we evaluate the coarse-to-fine ADLs recognition algorithm. The performance of our algorithm is promising, the recognition accuracy is larger than 91%.


I. INTRODUCTION
As the aging process of world's population is accelerating, the issue of population aging is an unavoidable social problem. Significant problems are reported in elderly care and rehabilitation, associated with the waste of public medical resources. Older adults, regardless of nationality, want to remain independent with dignity for as long as possible, especially the solitary older adults. Activity recognition plays an essential role in this area by detecting activity shifts The associate editor coordinating the review of this manuscript and approving it for publication was Honggang Wang . among the elderly population, which can alleviate stress on limited medical resources, and help the older adults to maintain functional ability and live independently longer, by detecting and diagnosing early illnesses for early warning [1]. Based on device diversity and sensor modality, previous researches in activity recognition can be roughly categorized into 4 classes: video-based, radar-based, WiFi signal-based, and inertial sensor-based. Several good reviews can be found in [1]- [6].
In the past decades, a large number of video datasets and benchmarks for activity recognition have been released, and various video-based activity recognition algorithms emerge VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ one after another. Thus, the video-based methods are more realistic and appropriate to recognize human postures and activities, comparing with the other three classes. The video data can be accurately labeled with the data itself manually. Methods of the other three classes may still need video or human journal for labeling. It is natural and intuitive to recognize human postures and activities from video data. However, most existing video datasets are captured from surveillance cameras, which always suffer from privacy issue [2]. The privacy issue could be attended to with depth video and infrared thermal video, which are also more resistant to the illumination changes and the pool imaging light, comparing with the visible video(RGB or gray). Therefore, depth video and infrared video for action recognition have gained much attention. However, there are still several challenges remain unsolved in the video-based activity recognition area, which are occlusion, annotation, visibility range, and illumination variation. These challenges prevent the video-based activity recognition methods from wildly use, especially in the out-door situations.
Recent years, radar has been employed to detect, recognize, and understand human activities, because of the effective sensing capability and penetration through obstacles [5]. Passive radar techniques for indoor human activity recognition have been extensively investigated [5], [7]- [10]. Most of them show high precision performance (more than 85% accuracy), in the non-line-of-sight environment, which is the typical situation of activity recognition for older adults. To ensure the high accuracy, ambient radar solution is essential for the radar-based methods, which requires a set of ambient radar sensors. The number of sensors increases with the environment size, while the cost explodes with it. Moreover, the complexity of multiple radar system also roars with the number of radar, making it too sophisticated to handle. As a result, trained professionals are need to interpret the radar signals, which make the radar-based methods hard to be implemented [4].
Nowadays, researchers employ wireless remote sensing technologies with commercial off-the-shelf wireless fidelity(WiFi) devices, to perceive and identify activities [4]. Instead of special and professional devices, common WiFi devices are used in a passive manner for activity detecting, which means cheap, universal, convenient, and unobtrusive. WiFi signals propagate indoors and carry rich human body information, which can be modeled for human tracking and activity identifying. Multi-path fading rule is employed to extract activity features from the channel state and the frequency modulated wave [6]. Although the satisfactory accuracy and robustness are guarantied for the WiFi signal-based methods, complex environment limits them from application in real-world. Device placement scheme, coverage, multitarget, and through-wall attenuation are all attached to the challenges of the WiFi signal-based methods [4].
Thanks to the ubiquity of smart devices, such as mobile phones, wrist bands, and smart watches, using inertial measurement unit (IMU) sensors embedded in the smart devices dominates the activity recognition area [1], [11]. Obvious advantages of inertial sensors can be found over other sensor modalities, ubiquity, cheap installation, unobtrusiveness, usability, ease of use, and free of privacy issue. Since the inertial sensors are attached to human body, they carry much more body information than other sensors, which can infer human activities in detail. Moreover, there is no occlusion or visibility range issues exist in this case. Since activities of daily living (ADLs) are ambulatory and trivial with high level of diversity, unsupervised machine learning algorithms are unlikely to identify ADLs. Thus, most activity recognition algorithms are supervised learning based, where training data with reliable labels is critical. However, the annotation process of the training data requires huge human effort to scan through the raw data for manual labels. In order to model activities effectively and increase the generalization of the model, large dataset with various human activity modalities is needed, which makes the annotation process more challenging. The annotation issue is one of the major challenges remains unsolved in the inertial sensor-based activity recognition approaches.
In this paper, we propose an automatic data collecting and labeling system by addressing the annotation issue, and a novel coarse-to-fine ADLs recognition strategy for older adults, by combining supervised and unsupervised machine learning methods. Since older adults want to remain independent with dignity for as long as possible, especially solitary older adults, we aim at helping solitary older adults with functional ability, to live independently longer, by modeling their activities. For this purpose, we collect data from smart homes with multiple sensors, and annotate the data automatically, by using a multiple sensors fusion strategy. Previous studies tend to focus on differentiating solely among ambulatory activities in controlled environments, which is the most fundamental form of the problem in the activity recognition sense. Although success has been achieved in this regime, with reported high sensitivity rates and low misclassification rates, this sort of scenario is too idealized to be deployed. The diversity of ADLs in free-living situations may confuse the models trained in controlled environments. Addressing this challenge, we propose a coarseto-fine strategy to discover activity clusters and recognize them then. An unsupervised machine learning method is employed to discover and segment the raw sensing data into big clusters. This is a natural step to group the raw sensing data into big clusters firstly, since large clusters would be much easy for identification [12]. Then, a Hidden Markov Model (HMM) is used to assign the clusters with a well-defined activity set. Since HMM employs a mathematical model based random process, to describe the courses of activities, the statistical property gives HMM the capability in modeling a random signal sequence (activity) with multiple features.
The rest of this paper is organized as follows. In Section II, we describe the automatic data labeling system and the coarse-to-fine ADLs recognition strategy. More specifically, 4052 VOLUME 8, 2020 we introduce the data collection protocol and smart home settings for solitary older adults in Section II-A. In Section II-B, we present the details of the proposed automatic data collecting and labeling system. Section II-C describes the coarse-to-fine ADLs recognition strategy. The evaluation experiment setup and the experimental results are shown in Section III. Finally, we make discussion on the obtained results and draw conclusion in Section IV.

II. METHOD
In order to help solitary older adults remain independent, maintain functional ability, and live at home longer, we research the activity model of solitary older adults and secure their safety at home. Collaborating with several local independent aging facilities, we build a smart home for older adults with an unobtrusive and continuous monitoring system. This monitoring system consists multiple sensors, a two dimensional (2D) laser scanner, a visual-depth information (RGB-D) camera, and a smart watch. The 2D laser scanner and the RGB-D scanner are environmentally mounted, while the smart watch is worn on one hand.
Due to the limitations of environmentally mounted sensors, such as occlusion and coverage, signals describing activities in different angles are collected from different sensors, and then are fused to interpret and model activities. The labeling issue is one of the biggest challenges in interpreting and modeling activities. Thus, we propose an automatic labeling algorithm based on sensor fusion, and model the the activities of older adults with IMU signals, to avoid the limitations of environmentally mounted sensors.  Since we target at modeling ADLs of older adults in free-living situations, we do not have any training process for the older adults. Each older adult was invited to an apartment in the smart home, and lived there for a week or more, as shown in Fig. 1. The smart watch was required to wear on the non-dominant hand, to reduce the motion noise level. And older adults were asked to charge the smart watches at any time convenient to them, like taking a shower or having dinner. Since several ADLs are rare in daily living, the environment and layout of smart home is designed to guide the activities of participants, by considering the integrality and balance of the dataset we are going to create. The end table is 0.55m away from the sofa, which is out of reach when sitting/lying on the sofa. And the end table is about 0.42m high, which means the participant has to get up and bend for the things on the end table. An electric kettle is placed on the ground. The participant needs to perform crouching activity for a while when operating the kettle.

B. MULTIPLE SENSORS FUSION BASED AUTOMATIC DATA COLLECTING AND LABELING SYSTEM
With great advances in embedded computing technologies, smart wearable devices with inertial sensors play significant roles in our daily living, especially in health care. Thanks to the pervasiveness of these wearable devices, great potential has been demonstrated in characterizing human activities and alleviating stress on limited medical resources. The sensing data, such as acceleration and gyroscope, can be explored to model the activities for older adults and uncover their health status. However, after a decade of research in modeling activities of older adults, accurate and reliable systems in use are few. In order to train models, tremendous efforts have been placed in data collecting and labeling. Many of them suffer from huge human effort, technical and privacy limitations [13]. It is still challenging to collect accurate and reliable activity data with labels. As a result, there is no commonly used dataset for older adults exists. Addressing this challenge, we propose a data collection system with an automatic annotation strategy, to collect the ADL sensing data of older adults and label it automatically. In the annotation process, we focus on an activity set: standing, walking, bending, lying, crouching, and sitting, which are ADLs closely related to basic independent living.

1) HARDWARE DEVICES
The automatic data collecting and labeling system consists of a 2D laser scanner (3irobotix lidar C0602), a RGB-D camera (Intel RealSense camera D415), a smart watch (Huawei watch 2), and a Raspberry Pi computer (Raspberry Pi 3B+), as shown in Fig. 2. The 2D laser scanner is horizontally mounted on the ground under the sofa, to capture the foot locomotion. Ranging laser technology provides a continuous stream of distance data, which takes advantages in wide field of view and consistency of any lighting condition. It gives us the potential to capture the foot locomotion rapidly and VOLUME 8, 2020 FIGURE 2. The automatic data collecting and labeling system for modeling ADLs of solitary older adults. The 2D range data (2D laser scanner), the depth images (RGB-D camera), and the IMU sensing data (smart watch) are collected and transfered to the AWS cloud for storage, by using a Raspberry Pi computer. The data is downloaded to a local server, annotated by using our automatic labeling system, subsequently. measure walking speed precisely. Considering the high levels of occlusion and part self-occlusion, sofas with thin legs are chosen, and the laser scanners are mounted as far as possible from the feet. The RGB-D camera is on the ceiling above the television, facing the sofa in the living room. The depth images captured by the RGB-D camera are used to identify the posture of older adults. Since we intend to model the ADLs of solitary older adults with the lowest invasive, the smart watch is the only sensor to be amounted on the body. As we discussed previously, the laser scanner and the camera are environmentally mounted, they cannot collect the activity data continuously, when the participant gets out of the sights of the devices. Therefore, we tend to model the activities of older adults with the smart watches, by combining the laser scanners and the cameras. The Raspberry Pi computer transfers the data collected to the AWS cloud. And the de-identified and encrypted data on the AWS cloud can be used for labeling and activity modeling. Finally, the ADLs data is annotated by using our automatic labeling system, to form the ADLs dataset for activity modeling.
The 2D range data is generated by the laser scanner at 6.2Hz with an angular resolution of 0.5 • , and stored in point clouds. The range radius of the point cloud is 10m. The RGB-D camera has two depth imagers, an infrared projector, and a RGB module. Two depth imagers enhance the RGB-D camera capability of capturing depth image with wide field of view. And the infrared projector enables the camera to work under poor lighting conditions, with a large distance range from 0.2m to over 10m. Considering the computational burden of Raspberry Pi computer, the transmission speed limit of USB2.0, and the network bandwidth, we empirically determine the resolution and the frequency of the camera to be 640×480 and 6Hz, respectively. We collect barometer signal at frequency 20Hz, three axes of acceleration and gyroscope at frequency 50Hz, from the smart watch. By switching off the network and uploading the data when charging, we make the smart watch maintain a reasonable battery life, the watch operational cycle can reach as long as 28 continuous hours. With the long battery life, participants can charge the watches at any time convenient to them, like taking a shower. Moreover, the build-in memory of smart watch can store continuous data for 14 days without uploading.

2) RANGE DATA BASED WALKING SPEED ESTIMATION
Ranging laser technology provides a continuous stream of distance data, which gives us the potential of capturing foot locomotion rapidly and measuring walking speed precisely. When the point cloud is generated by the laser scanner, K-means classifier [14] is employed to partition the point cloud into different clusters (including the foot point cluster), as shown in Fig. 3. In our work, a foot is assumed to have an approximate circle shape. Since the laser scanner mounted on the ground under the sofa environmentally, we just have the foot silhouette in one direction. Thus, a point cluster has an arc or convex shape within a reasonable ratio will be classified as foot. Based on this foot shape assumption, several foot features are employed to identify and trace the two feet, which are detailed below.
1) Reasonable distance ratio F d . The central point of each point cluster is calculated. The foot point cluster should be within a reasonable Euclidean Distance, F d , as follows: where | · | denotes the L 2 norm; p = x, y is one point in the point cluster P, {x, y} is the coordinate of point p; P c is the central of the point cluster P; the distance ratio is set to The foot silhouette in one direction generated by the laser scanner. A point cluster is assumed to have an arc or convex shape, which is a typical shape of foot. Other directions have the similar observations. 0.14m, which is the maximum distance from cluster edge to the cluster center.
2) Foot length F l . The length of one point cluster in moving direction is estimated as foot length, as follows: where p f and p b are the two points with maximum distance in the moving direction.
3) Foot circularity F c . The radian of each point in the cluster edge is calculated. And the mean value of the radians is considered as approximate foot circularity, F c , as follows: where p i−1 and p i are two adjacent points in the cluster edge, n is the number of points in the cluster edge. 4) Foot arc length F a . The sum of the Euclidean Distance between two adjacent points is calculated as foot arc length as follows: where p i−1 and p i are two adjacent points in the cluster edge, n is the number of points in the cluster edge. Due to the dynamics of walking speed and trivial foot motion in daily living, estimating walking speed in daily living is really challenging. Thus, we take a frame work combining Random Forest and Kalman Filter to detect feet and track them. In this frame work, we assume that the background is static, and the foot appearance is fixed. Since the real-world environment is with high diversity, accurate walking speed estimation asks for identifying feet from other moving objects, such as crutches and animals. Random Forest is employed to detect feet, and results in foot candidates. Then foot candidates are fed into Kalman Filter for foot tracking. Finally, the walking speed is estimated based on the tracking trajectory.
Random Forest is one of the top performers for classification tasks. By organizing multiple decision trees in a cascade manner, Random Forest embeds the randomness in the training phase, which gives the Random Forest capability of dealing missing data and unseen data by reducing overfitting.
In our research, missing data and unseen data were observed in occlusion and part self-occlusion situations. And Random Forest can help the foot tracking process keep accurate. There are many open source implementations of Random Forest, such as OpenCV for C language [15] and Scikit-learn for Python language [16]. We combine OpenCV and Scikit-learn to gain both detection accuracy and computational efficiency, by executing the OpenCV implementation of Random Forest algorithm, with partially modified Scikit-learn parameters. Here, we change the default setting of maximum tree depth to 20, and training iteration to 100.
In order to avoid human effort involved in the annotating process, we collect two training datasets, no-person and single-person datasets. Three indoor environments have been chosen. For each environment, single-person and no-person datasets are collected, 12 minutes duration each. Since the backgrounds are static and same in both single-person and no-person datasets for each environment, it is easy to annotate background (negative dataset) and human feet (positive dataset).
With the trained Random Forest classifier, the foot candidates are detected for each frame of the point cloud. And the centroids of the foot candidates are fed into the Kalman Filter for foot tracking. Each centroid will be assigned to a Kalman tracker (Kalman filter). The Kalman filter has been proved to be effective and accurate in estimating a target's velocity and position [17]. Addressing the occlusion issue in the foot tracking, we employ the Kalman filter [18] to model the walking activity as a dynamic system, by using the OpenCV implementation, as following: where X k = [x k , y k ,ẋ k ,ẏ k ] is the state vector of a centroid c k in frame k, and z k = [x k , y k ] is the measurement vector. x k and y k are the position components, whileẋ k andẏ k are the velocity components; A k is a state transition matrix; u k is the control vector containing acceleration force; B k is the control matrix of acceleration effect; w k is the process noise; H k is the transformation matrix, mapping X k into measurement z k ; v k is the observed noise. Since the frequency of the 2D laser scanner is 6.2Hz, the interval between two frames is 0.16s, which is small for walking activity. Thus, we assume that the foot velocity is constant. As a result, the control vector u k is set to 0.
Here, the Kalman filtering process takes two stages: prediction and measurement. Take one tracker for example. For frame k, the centroid c k is predicted, based on the transition matrix A k−1 and the centroid c k−1 . Then, we search for foot candidate around the predicted centroid c k . If the foot candidate does exist in the predicted area, the centroid c k will be updated. And the association between c k and c k−1 is recorded in the transition matrix A k , which will be used for prediction in next iteration. If the foot candidate does not exist in the predicted area, occlusion may happen. The predicted centroid c k is reserved, and the transition matrix A k is also updated based on c k . This tracker is reserved waiting for a foot candidate which matches the prediction. If there is no match for the predictions in the following frames, the tracker will be removed, and the detected centroid c k−1 will be considered as the end of a walking activity. The step length between footprints in the foot trail is used to calculate the walking speed, as shown in Fig. 4. Left foot and right foot are identified according to the moving direction and the distance to the scanner. And the two feet are tracked separately. Finally, the average speed of the two feet is defined as the walking speed for each step in the tracking period. If one footprint is occluded (missing), another footprint is used to calculate the walking speed. In the worst case, two footprints are missing, next pair of footprints will be used for calculation. In our experiment, we observed only one pair of footprints missing. And it did not affect the accuracy in walking speed estimation. The foot candidates in the first frame are used as initial position, and the initial foot velocity is set to 0.

3) DEPTH IMAGE BASED POSTURE ESTIMATION
The posture estimation runs parallel to the walking speed estimation, recognizing activities from the perspective of complementarity. Addressing the privacy issue, depth image is chosen for posture estimation instead of RGB color image. The pre-trained Convolutional Neural Network (CNN) of OpenPose is fine-tuned to extract human skeleton structure, by using transfer learning from RGB images to depth images. OpenPose [19] is an open source library for multi-person skeleton detection. It is worth to mention that, this person skeleton detection system achieves high accuracy and real-time performance. Then, the skeleton-based features are fed into a Random Forest for posture recognition.
In the ADLs dataset we mentioned previously, depth images and aligned RGB images are collected simultaneously. We employ a pretrained CNN of OpenPose to extract human skeleton structures, by using the RGB images. The extracted human skeleton structures are mapped to the corresponding depth images. The obtained human skeleton structures and the corresponding depth images form the posture dataset, to obtain our own CNN for depth images. Human skeleton structure with 25 keypoints is used in this work, which are nose, eyes, ears, neck, shoulders, elbows, wrists, mid-hip, hips, knees, ankles, big toes, small toes, and heels, as shown in Fig. 5. Less important keypoints are not shown for demonstration purposes, such as eyes and heels. In order to avoid confusion, the 6 activities in the activity set: standing, walking, bending, lying, crouching, sitting, are used as postures here. The postures are marked by 3 human raters independently. Since the dataset of depth images and aligned RGB images is huge, labeling the postures by going through the images frame by frame manually is time consuming and unrealistic. We take a semi-automatic method to mark the dataset, which is described later. Secondly, the pretrained CNN of OpenPose is fine-tuned by using Caffe fine-tuning toolkit [20] on our posture dataset. Our depth image based  task is related to the task of extracting human skeleton structures from RGB images. Thus, there is no need to train a new model with costly relearning. What's more, our posture dataset is a relatively small collection, comparing with the datasets used to train the CNN of OpenPose, MPII [21] and COCO [22]. Due to a small size of our dataset, retraining a brand new CNN could largely increase the possibility of overfitting, whereas transfer learning requires much smaller dataset [23]. The fine-tuning process is implemented in Matlab 2017b with Caffe toolkit, and runs on a Titan GPU with 12GB video memory under Linux (Ubuntu 16.04).
Once the skeleton structure with 25 keypoints is extracted from the depth image, as shown in Fig. 6, geometrical angles and three dimensional (3D) pairwise distances are calculated as posture features. The body trunk angle, a trunk , is designed to identify bending and lying from other postures, which is defined as the angle between the body trunk and the horizontal plane. The definition of a trunk is given as following: where − → n = (1, 0, 0) is the normal vector of the horizontal plane; p neck = {x neck , y neck , z neck } denotes the 3D coordinates of keypoint neck (Fig.), x neck , y neck , and z neck are the xcoordinate, y-coordinate, and intensity of keypoint neck in the depth image. The hip angle, a hip , is designed to classify sitting and crouching from standing and walking, and the definition is given as following: Since occlusion and part self-occlusion frequently occur in ADLs, distances between paired keypoints are calculated for posture recognition in a complementary way. To calculate distance features, we create two keypoint clusters L = {left knee, right knee, left ankle, right ankle} and U = {left elbow, right elbow, left wrist, right wrist}. Paired keypoints are selected randomly from these two clusters and keypoint neck. For example, right wrist and left knee are chosen from cluster L and U , respectively. The distance between right wrist and left knee, d right.wrist&left.knee , is calculated as following: where s is a scaling parameter. The 3D distances are invariant to rotation. However, they are with large variation in respect to different people. To ensure the scale invariant, we normalize the distances with a scaling parameter s. The distance between two shoulder keypoints is defined as the scaling parameter. The two shoulder keypoints are always visible for our RGB-D camera. The distance between them is proportionally related to the height of one particular person. And it is stable and does not change dramatically through frames. These are the reasons we choose two shoulder keypoints to calculate the scaling parameter. Please notice that, keypoint missing caused by occlusion is inevitable. When keypoint missing happens, features are set to Null. The seven geometrical angle features are used to mark the postures semi-automatically, under the supervision of the human raters. a trunk , a left.knee , a right.knee , a left.hip , and a right.hip can be employed to divide standing, lying, sitting, and crouching into sever categories roughly according to continuous time, by using a threshold based method. Thresholds for the features are decided empirically. a trunk , a left.shoulder , and a right.shoulder are used for posture bending, while a left.hip , a right.hip , a left.knee , and a right.knee are for posture walking. Since the postures of older adults are continuous in time, the human raters focus on the beginning and end of each category, and mark the RGB images. The labels for depth images can be obtained subsequently.
Once the features are calculated for one frame, a feature vector is formed and fed to train a Random Forest classifier, as we introduced in Section II-B.2. Since the Random Forest implementation does not accept Null data, the Null features are set to 2π and 100 for geometrical angles and 3D pairwise distances, respectively. In our experiment, we found that the posture estimation subsystem cannot identify standing from walking accurately. As we can observe from Fig. 6 (standing and walking), two postures standing and walking are very similar to each other. The walking speed estimation subsystem can help in recognizing these two postures, where walking speed more than 0.05m/s is considered as walking. Moreover, the posture estimation subsystem also has difficulties in identifying bending and crouching, because of part self-occlusion. It's worth mentioning that small part of occlusion would not affect the skeleton extraction, see Fig. 6 standing and sitting.

4) AUTOMATIC IMU DATA LABELING
In this research, we try to model ADLs of older adults with the streaming barometer, acceleration, and gyroscope data, which are collected from smart watches. As we all know, ADLs are more ambulatory and trivial than the activities in labs. There is no clearly distinct boundary to separate them from each other. In daily living, there are numerous situations, in which one activity may be interrupted by other activities, and divided into small activity segments. For example, walking is interrupted by short stops, such as standing. Traditional activity recognition algorithms cluster the walking segments and the stops as different activities, and fragment the walking into several parts. The fragmenting walking activities with different interruptions have different motion features, which make the modeling algorithm confused.
Inspired by the unsupervised activity discovery algorithm, Unbounded Unsupervised Activity Discovery using the Temporal Behaviour Assumption (UnADevs) [24], we classify the sensing data into big clusters, by defining the minimum duration of ADLs. The UnADevs algorithm has the capability in discovering activity clusters, corresponding to periodic and stationary activities in sensing data, not limited to acceleration and gyroscope data. There are three key parameters of UnADevs, the number of active cluster, the duration that a cluster can remain active (tolerance), and the minimum duration of a cluster. The UnADevs discovers activities in a growing way. Firstly, the data is segmented into overlapping windows by using a sliding window technique [27]. Then, feature vector [b, SMV acc , a x , a y , a z , SMV gyro , g x , g y , g z ] is calculated for each window, where b is the barometer, SMV acc is the signal magnitude vectors (SMV) [13] of acceleration, a x , a y , and a z are the three axes of acceleration, SMV gyro , g x , g y , and g z are with the same definitions for gyroscope. For one window waiting for clustering, the distances between active clusters and the window are calculated. The cluster with minimum distance will be chosen for growing, by adding the window into the cluster. The number of active cluster determines the deviation of the target window. The tolerance parameter decides which cluster should be turned into inactive, and moves a new cluster into active. At the meantime, the minimum duration parameter prevents some small clusters from creating. The size of the sliding window is set to 2s, and two consecutive windows overlap with each other for 1s.
A voting strategy is employed to label the sensing data, as shown in Fig. 7. For each discovered activity cluster, the sliding window technique is used to segment the cluster into overlapping windows. And the windows of walking speed estimation vote for walking and still, while the windows of posture estimation vote for other activities in a complementary way, as we described at the end of Section II-B.3. Since the walking speed estimation is with high accuracy, the voting from walking speed estimation is with high priority. Although the walking speed estimation and the posture estimation are combined to improve the labeling accuracy, misclassifications were reported in several sling windows. Because of the interruptions of other activities, the misclassifications in several windows are reasonable. What's more important, misclassifications with much lower detection rate do not affect the final labeling results (more details in Section III-B and III-C).
Before the labeling process, the voting strategy is used to tune the parameters of the UnADevs algorithm. The initial parameters are set to 3, 22s, and 16s for the number of active cluster, the tolerance, and the minimum duration of a cluster, respectively. For each cluster, activity with highest percentage wins the vote. And the mean value of the highest percentages for all clusters is used to tune the three key parameters. The larger the mean value is, the better the parameters are. We empirically change the value of the parameters, to achieve the highest mean value. As a result, we got the parameter set: 7, 30s, and 25s for the three key parameters.
With the tuned parameters, the sensing data is classified into clusters, as discovered activities. The voting strategy is used again to label the clusters, as we described previously. For one cluster, the activity with the highest voting percentage will be assigned to this cluster. Finally, the sensing data clusters with activity labels form the ADL dataset for older adults. 4058 VOLUME 8, 2020

C. COARSE-TO-FINE ADLS RECOGNITION STRATEGY
Since it is difficult to separate ADLs from each other as we discussed previously, the coarse-to-fine strategy is employed to classify ADLs into big clusters roughly, and refine them subsequently. Firstly, the sensing data goes through the tuned UnADevs algorithm. Clusters are generated as discovered activities. Secondly, the clusters are divided into overlapping windows, W = {..., w i , ...} by using the sliding window technique. The overlapping windows are further divided into small slots, which are defined as sampling periods. Let us denote the slots as w i = {..., s i j , ..}, where s i j is the jth slot in window w i . We divide the window into ten slots in this work. Since acceleration signal cannot identify activities with small motions, such as standing, lying, and sitting, gyroscope signal is combined with acceleration signal to detect the pose change of forearm. And the barometer signal is employed to recognize activities with different altitudes. Feature vectors with nine features (we mentioned in Section II-B.4) are calculated for slots. Thirdly, the feature vectors of slots are fed into the HMM training process [29]. Finally, posterior probabilities of activities in the activity set are estimated by the trained HMM for the slots. For one slot, the activity with highest probability will win and be assigned to this slot. Slots vote for windows, and windows votes for clusters, subsequently.
The HMM can naturally identify the ADLs, by modeling temporal dependencies between consecutive activities. And numerous strong results have been obtained by using HMM for modelling ADLs [28]. Thus, a HMM is employed to recognize activities by using the sensing IMU data in this research.
The HMM for activity recognition can be expressed in a five item array as: φ = (M , N , π, A, B), where M is the number of invisible states (activities). Since there are 6 activities in the activity set, the number of invisible states is set as M = 6; N is the number of observation values, which will be described later; π is the initial state distribution corresponding to the invisible states, π = {π m }, m = 1, . . . , M , m π m = 1; A is the state transition matrix with size M × M , which describes the transition probability between two states; B is the emission matrix with size N × M , which describes the emission distribution of the HMM. The ADLs dataset for older adults we created is used to train the HMM.
The implementation of the HMM for our research has been carried out by using Seqlearn for Python language.
The initialization condition for training a HMM is summarized as following: 1) The number of invisible states M = 6, since we take 6 activities into consideration; 2) The number of observation values N = 26; 3) The initial state distribution: π 1 = 1, π j = 0, j = 2, . . . , 6, corresponding to the 6 activities; 4) The initial state transition matrix A obeys uniform distribution by general principle [10]; 5) The initial emission matrix B also obeys uniform distribution.

A. EVALUATION ON WALKING SPEED ESTIMATION SUBSYSTEM
Various protocols in measuring walking speed exist in the literature, such as Timed-Up and Go (TUG) and Timed 25 Feet Walk (T25FW). These physical performance instruments are frail for elderly populations, suffering from significant intra-individual test-retest variability [25]. Habitual Gait Speed (HGS) is reported to be reliable and considered as an useful indicator in clinical trails [30]. The measuring process of HGS is easy in implementation, which requires no doctor or clinical equipment. Therefore, HGS is chosen and used as ground truth, to evaluate the range data based walking speed estimation subsystem. As we proposed an automatic labeling system, which consists of two subsystems, walking speed estimation subsystem and posture estimation subsystem. And we are going to evaluate these two subsystems separately.
15 older adults were recruited to take part in the evaluation experiment. Distance in HGS measurement is the main fact in influencing the accuracy of measuring gait speed. HGS over 4 metres has been reported excellent reliability in clinical trails [30]. In our experiment, the participants were asked to walk a 5.5 metres path with their normal speed, and repeat the test 5 times. Before the experiment, the participants could practice walking on the path. The 2D laser scanner was mounted on the ground just beside the walking path, as shown in Fig.8. The range data was collected, when the participants were walking on the path. At the mean time, we timed the walkings to calculate the ground truth walking speed by using a stopwatch.
Since we estimate the walking speed of each step, absolute error range, mean absolute error, and error variance are employed to validate the walking speed estimation subsystem, as shown in Table 1. All intraclass mean absolute error is 0.06m/s. Slightly higher mean absolute error value was reported for the youngest female participant (age 57, walker), with the highest error of 0.11m/s. However, the highest error is small according to the mean absolute errors (0.06m/s), which is an evidence of the accuracy of the walking speed estimation subsystem. The slower the walking is, the more accurate the estimation is. As far as we know, most walking activities of older adults are slow in free-living situations, slower than 0.60m/s. Considering manual operation error (timing with a stopwatch), the accuracy of our walking speed estimation subsystem is reasonable and acceptable for labeling the dataset in the complementary way.

B. EVALUATION ON THE POSTURE ESTIMATION SUBSYSTEM
The posture estimation is designed as a subsystem of the automatic labeling system, and runs parallel to the walking speed estimation in a complementary manner, since the posture estimation is assumed to be not accurate due to the slow movements and the occlusion issue. Thus, the evaluation here is designed to check the complementary between the walking speed estimation and the posture estimation. Firstly, the classification accuracy is employed to locate the limitations of the posture estimation subsystem, by using the ADLs dataset we created. Then, the complementary is checked in  the misclassification areas, to find out whether the posture estimation can be refined by the estimated walking speeds.
The confusion matrix of classification accuracy is used to locate the limitations, as shown in Table 2. The classification accuracies of the six postures are shown in bold, while noticeable misclassification rates are shown in gray areas. It can be easily observed that, the accuracy of the posture estimation subsystem is high (average accuracy 89.3%). As the posture estimation is designed as a subsystem of a ground truth labeling system, let us look into the gray areas (misclassification areas), to check the complementary between the walking speed estimation and the posture estimation. The biggest limitation of the posture estimation subsystem focuses on identifying walking from standing, bending, and crouching, which is with the lowest recognition accuracy (78.5%). A graphical user interface (GUI) was developed to help us look into the misclassification areas, as shown in Fig. 9. It can be easily observed that, with the help of speed estimation, walking postures can be easily identified from the other three postures (walking speed larger than 0.1m/s). With the still estimation, walking speed 0m/s, standing posture can be refined from walking misclassification, Fig. 9 the first two rectangles.
The remaining limitations focus on identifying bending and crouching. Since these two postures rest at relatively quiet status, the walking speed could not refine the posture estimation. We further looked into the classification areas of these two postures, and found that the misclassifications did not gather together. With much lower misclassification rates than the accuracy rates, the misclassifications of these two postures do not affect the final voting labels.

C. DOUBLE-CHECK ON THE AUTOMATIC LABELING SYSTEM
Since the automatic labeling system is designed to be with extremely high accuracy, we double-checked the reliability of the system by running a long test (four days) under the supervision of the three human raters, before generating the reliable ground truth. The three raters went through the labels generated by the automatic labelling system with the help of the GUI. The GUI organizes the labels along with the discovered activity clusters, the voting percentages, the walking speeds, and the estimated postures, as shown in Fig. 10. The raters can zoom in and out to check the details or skip quickly. They can also click at one point to take a view of RGB images,  to confirm the final label. In the long test, only 32 cases with lower largest voting percentages (around 74%) were reported, due to occlusion and part self-occlusion, as shown in Fig. 11. In this long test, we confirmed that they did not affect the final labels. We believe that the automatic labeling system is reliable to generate ground truth for training and testing. Table 3 gives the evaluation on the coarse-to-fine ADLs recognition algorithm in confusion matrix form. It can be easily observed that, the recognition algorithm outputs promising results, accuracy over 91%. Lower accuracies are obtained in recognising bending and crouching, which are similar to the results generated by the posture estimation. The similar results proves that it is challenging in identifying bending and crouching. As we know, older adults pay most of their time in lying and sitting, standing and walking come the second, bending and crouching the least. According to the results shown in the table, we come to a conclusion, the longer the activity duration is, the more accurate the activity recognition is. Comparing with bending and crouching, accuracy in identifying standing and walking is higher. Since recognizing these two activities is more important in modeling activity shifts among the elderly population, higher accuracy makes our ADLs recognition algorithm more valuable for implementation. The highest accuracies are obtained in recognizing lying and sitting, which are rest-activities with small motions (less energy consumption). Duration statistics of these two activities are useful for habit preference modeling.

IV. DISCUSSION AND CONCLUSION
Targeting at the challenges in ground truth generation, we propose an automatic data collecting and labeling system. An unsupervised machine learning method is employed for activity discovery, to deal with the ambulatory and trivial ADLs. Then, a multiple sensors fusion strategy is used to interpret and annotate the discovered ADLs. This system is double-checked to be reliable in a long test. The unsupervised machine learning method is the key to the automatic labeling system. Due to the ambiguous boundary in the definition of ADLs, it is impossible for human rater to discover and annotate ADLs. The unsupervised machine learning method gives us the opportunity to interpret ADLs as a whole.
The unsupervised machine learning method is also the key to the coarse-to-fine ADLs recognition algorithm, since the fragmenting activities make the ADLs modeling unrealistic. Moreover, discovering and recognising activity is more natural than recognising activity directly. Another advantage of our coarse-to-fine recognition algorithm is the high accuracy. As discovering activities into big clusters, misclassifications of several sliding windows would not affect the final detection results.
Although the automatic data collecting and labeling system is considered to be reliable, the activity discovering model does not change among people. In order to deploy personalized activity discovery and target identification, we plan to model the three key parameters of the unsupervised machine learning method associated to other sensing signals, such as heart rate and blood pressure. Moreover, we plan to work with local independent aging facilities to create a big ADLs dataset, and make it available online.