A Random Forest Approach to Body Motion Detection: Multisensory Fusion and Edge Processing

Low-complexity and privacy-respecting human sensing is a challenging task in smart environments as it requires the orchestration of multiple sensors, low-impact machine learning (ML) methods, and resource-constrained Internet of Things (IoT) devices. Client/server-based architectures are typically employed to support sensor fusion. However, these architectures need data to be moved to/from the cloud or data centers, which is contrary to the fundamental requirement of the IoT applications to limit costs, complexity, memory footprint, processing, and communication resources. In this article, we propose the design and implementation of an integrated edge device targeting human sensing for indoor smart spaces applications envisioned in Industry 5.0 applications. The proposed device implements the cumulative sum (CUSUM) method for data distillation from multiple sensors and adopts a low-complexity random forest algorithm (RFA) to sense and classify body movements: in particular, the device integrates both infrared (IR) and ultrasonic (US) sensors. This article discusses the benefits of the combined use of CUSUM and RFA methods against classical ML approaches in terms of accuracy, complexity, computing time, and storage. The proposed architecture and processing steps are validated experimentally by targeting the fall detection problem in a smart space environment. RFA reduces the complexity by at least three times compared to classical ML tools based on the analysis of space and time features (convolutional neural networks and long short-term memory): processing time is in the order of 0.1 s while accuracy is about 94%.

solution [4]. However, due to the limited computational and communication capacity at the edge devices, it is also essential to apply low-complexity but high-accuracy designs [5] of the processing, learning, and inference stages. For example, data distillation methods, such as feature extraction and processing [4], must be implemented inside the IoT devices themselves to avoid moving large-size data over bandwidthor energy-limited networks.
Through feature-based distillation, field devices, or sensors, producing data are responsible per part of data preprocessing as they share and distribute features rather than raw data. This must be accomplished not only to respect data ownership, integrity, and privacy but also to limit the data complexity by reducing the data dimensionality to a small set of features/parameters exchanged.
This article targets the design, implementation, and validation of an integrated edge device with real data for human sensing in indoor smart spaces. The device is equipped with multiple sensors and implements real-time sensor fusion, feature extraction, and edge processing methods optimized to achieve good accuracy with a significant reduction in memory, processing time, and complexity. In particular, this article proposes the adoption of a cumulative sum (CUSUM) approach for efficient data distillation from multiple sensors. It also compares different machine learning (ML) approaches to support joint body motion and fall detection applications. A decision tree classifier, based on the random forest algorithm (RFA), is proposed and compared with classical data-intensive neural network-based approaches in terms of algorithm complexity, memory and processing time requirements, and accuracy. Fig. 1(a) and (b) shows the proposed sensor fusion architecture: the edge device is equipped with infrared (IR) and ultrasonic (US) sensors, preprocessing units (PUs), and a feature fusion center (FC). The IR and US sensors acquire data independently, namely, thermal and range information, and send them to the PUs for feature extraction. Each PU applies raw data distillation by implementing the CUSUM algorithm [30], [31], from which features are extracted and forwarded to the FC [implemented on the microcontroller unit (MCU)]. The FC performs feature selection and aggregation. It provides the results for the learning and decision steps based here on RFA. Feature evaluation and decision tree-based processing are performed to reduce the data complexity and mitigate the computational cost at the edge.

A. Related works
Human-centric smart building applications are part of Industry 5.0 trends and target the real-time human body monitoring in indoor/outdoor areas [7], [8]. Passive sensing of people/workers and monitoring of their health conditions can be based on wireless local area network (WLAN) signal processing tools, as reviewed in [9]. For example, a devicefree fall detection algorithm, based on the analysis of ambient radio signals, is proposed in [10] by exploiting low-power IEEE 802.15.4 compliant devices already deployed in the monitored area for machine-to-machine (M2M) communication purposes. Beside WLAN sensing, multisensor fall detection algorithms are analyzed by implementing information fusion of features extracted from experimental data collected by different sensors [11], namely, triaxial accelerometers, micro-Doppler radars, and depth cameras. However, with respect to this approach, the former does not require to wear any device; thus, it does not discloses any personal data (i.e., it is privacypreserving by design).
In order to improve the accuracy and the efficient use of multiple sensors, a cloud-based multisensor fusion is proposed in [12], while in [13], a fusion algorithm is developed to combine range information from different radar sensors. In this case, a radar data fusion system is implemented for human gait estimation, in addition to fall detection. An approach capable of selecting the best fusion architecture (from predefined options) for a specified sensor dataset is also proposed in [14] where the sequential forward floating selection (SFFS) algorithm is exploited to reduce the features number and limit it to a given value.
Smart buildings and industrial environments resort to a massive deployment of resource-constrained IoT devices, sensors, and actuators. Sensors generate a large volume of data in real time, which is an appealing target for AI and data fusion systems [15]. However, deploying data-intensive ML models on such end devices is nearly impossible due to limited computing power, missing specialized AI processing units, and memory constraints. In some applications, the cost of building the computing/communication infrastructure and the related fees to maintain it over time is also much higher than the benefits obtained from gathering and analyzing the same endpoint data. Beside costs, there are also concerns over bandwidth limitations and processing time, as data need to be sent to the cloud, while the analysis results must be returned to the endpoints in a timely manner.
Edge and fog computing methods [16], [17], [18] have been proposed to migrate computations from the data center near to the sensors producing the data or other colocated devices, thereby reducing the system latency, bandwidth occupation [19], and computing resources [20]. Processing on the edge allows also to perform inference near the source of the data; on the other hand, limited resource constraints prevent the adoption of data-intensive and/or complex models [21].
Existing works on ML-based activity recognition [23], [24] exploit long short-term memory (LSTM) [25] and convolutional neural networks (CNNs) [26] architectures as they take advantage of temporal and spatial features, respectively. Focusing on low-complexity designs, RFA-based methods are expected to provide comparable accuracy to LSTM and CNN while being more easily implemented on embedded devices with limited computational resources [27].
Running ML tools on embedded edge devices brings new challenges that include: 1) data distillation, to extract low-dimensional features from raw training data; 2) ML model optimization for resource-constrained computing environments; 3) real-time performance (latency and computing time); and 4) accuracy versus complexity tradeoff analysis.
To address these issues, prior works about ML on edge devices focused on deep learning (DL) methods [16], [17], [18]. Typical approaches propose to keep the ML model size small enough by using few trainable parameters, minimizing the number of computations [15], adopting quantization, sparsification, pruning [22], and DNN partitioning [28]. However, these methods are still debatable in dynamic network scenarios with constrained resource devices since they require additional preprocessing steps or increased communications with edge/cloud servers due to offloading or specific hardware support to maintain a reasonable accuracy.
Unlike previous studies, this article proposes a data-agnostic architecture that combines a CUSUM technique [29], [30] for data distillation, followed by RFA processing. CUSUM implements a semiblind change point detection and is adopted for feature extraction and fusion of raw sensor data. CUSUM was originally proposed for anomaly detection [30] and recently adopted also for data segmentation [6], [31]. Compared to raw data, CUSUM produces low-dimensional features [6] suitable for resource-constrained devices. These features are fed to an RFA model designed to classify human worker motions.

B. Contributions
This article proposes the design, implementation, and validation of an edge device that integrates IR and US sensors targeting body motion and activity recognition for smart spaces applications. The device supports preprocessing and feature-based data fusion techniques optimized for implementation on resource-constrained hardware. Rather than sharing, and directly processing, the raw data from the IR and US sensors, the proposed technique performs low-dimensional feature extraction by exploiting, for the first time, a change point detector tool that is based on the CUSUM approach. Features are then used as inputs for body motion detection that is performed by an ad hoc RFA. The approach is shown to be more efficient compared to classical DL-based approaches in terms of complexity, computing time, storage, and accuracy. Besides, the proposed approach does not need data to be moved on the cloud and thus protects data ownership, integrity, and safety.
The proposed approach is analyzed in terms of complexity, namely, the expected number of operations to produce a detection output, computing time for training/inference, memory footprint, and accuracy. The algorithm also performs the selection of the best pool of features to maximize the detection performance targeting the human body fall detection problem.
This article extends [31] with novel contributions summarized as follows.
1) An edge-based feature processing model that integrates multiple sensors, data distillation, and data fusion is proposed for body motion and activity recognition applications. The CUSUM method is used to extract features (i.e., for data distillation), while feature selection, grouping, and fusion are based on the RFA method optimized to recognize critical body motion events. 2) An edge-based integrated board is designed to implement the proposed algorithm: it includes the sensors, the PUs, and the FC. The multisensor board is equipped with a low spatial resolution IR sensor composed by 64 thermopile elements, arranged as a squared array of 8 × 8 IR pixels, and a US sensor for US ranging.
Hardware and sensor integration are also discussed.
3) The approach is validated experimentally in an indoor environment. In particular, the study focuses on body walk vs fall classification, as critical in smart spaces applications. The RFA method is also compared against popular data-intensive ML designs based on CNN and LSTM [32], [33]. This article is organized as follows. Section II discusses the CUSUM-based sensor fusion model, the feature extraction, and the feature reduction method, tailored for IR arrays and US sensors. Section III describes the proposed RFA method for activity recognition (i.e., body walking versus falling). Section IV analyzes a case study for fall detection in an indoor smart space environment. Accuracy-complexity tradeoffs, training, and processing time are analyzed in detail. Finally, Section V draws the conclusions and discusses current limitations and future works.

II. EDGE-BASED SENSOR FUSION MODEL: SENSOR
DATA AND FEATURE PROCESSING This section analyzes the data fusion architecture targeting the IR and US sensors integration. It includes the data acquisition and the feature fusion models for the human sensing problem. Among various sensors and IoT technologies, thermal vision systems, based on low-cost IR array sensors, allow to track thermal signatures induced by moving people. Moreover, US sensors estimate the distance from obstacles or other people to improve safe indoor navigation. Unlike contact tracing applications based on short-range communications, IR and US-based sensing systems are passive, as they do not need neither the cooperation of the subject(s) nor the use of any wearable device. Thus, they do not pose any threat to user privacy and are also suitable for deployment in large spaces such as smart buildings or industrial workplaces [31]. Fig. 2(a) shows the general multifeature fusion model and the components of the integrated multisensor platform, namely, the edge device, and its functionalities. The integrated edge device includes the sensor devices (SD k ), the corresponding processing units (PU k ), and the FC. Each sensor device produces the raw data that are fed to the PUs (Section II-A), while the PUs implement a change detector algorithm and extract the local features obtained from the CUSUM metric (described in Section II-B). Finally, the FC learns a global RFA model for body motion detection using the local CUSUM features. In addition, it also performs the feature selection, grouping, and fusion stages.
Human body movements may be captured by exploiting both the temperature perturbations in the readings of the IR array sensor SD 1 and the changes in the distance measurements produced by the US sensor SD 2 . Fig. 2(b) shows the setup and the classification problem targeting body motion detection (in our case, the fall versus nonfall events) using data collected by both sensors and their corresponding features. During the learning phase of the RFA model, the FC selects an optimal subset of features (i.e., number and types) that provide the maximum accuracy and the lower training time. As described in Section III, each tree makes the local decision using the selected features subset in the training phase. A final decision is made by major voting, considering all individual trees.
In the following, we introduce the statistical model for the raw data captured by the IR array and the US sensor sources [31], [34]. Notice that the proposed model and method can also serve as a general framework for multisensor deployments. Data acquisition and CUSUM-based feature extraction steps are described as well.

A. Multisensor Data Processing
IR sensors typically consist of a single or multiple arrangement (i.e., array) of thermopile elements equipped with lens, mostly deployed in indoor scenarios to estimate the temperature of the environment and the objects/people herein contained [34], [35]. For instance, passive IR (PIR) sensors are widely employed in consumer security systems [36], also in conjunction with radar sensors [37], to detect the people presence in the field of view (FoV) of the sensor. However, recently, IR arrays have been employed not only for human body occupancy detection but also for localization and tracking applications [34].
Typical US sensors are noncontact micro-electromechanical system (MEMS) devices that provide only range (i.e., distance) information about targets inside the sensor FoV. Since they do not extract any visual image of the target(s), as in the case of low-resolution IR sensor arrays, both sensors are privacy-preserving and can also be fused together to improve the detection accuracy.
1) IR Sensor Array: Low-cost and low-resolution IR sensor arrays, presently available on the market, usually consist of M thermopile elements arranged as a single element (m = 1) or multiple elements (m = 1, . . . , M) in 1-D linear or 2-D grid, whose format depends on the specific IR device. In our case, the IR array, namely, the grid eye [38], acquires raw thermal images organized as 2-D frames of 8 × 8 pixels (M = 64), while the frame sampling frequency of the array is set to 10 Hz. The IR array includes a built-in silicon lens having an FoV equal to 60 • and features 8 pixels vertically and 8 pixels horizontally having an opening angle of about 7.5 • per pixel (i.e., with very low spatial resolution). This corresponds to the approximated full-width at half-maximum (FWHM) value. The sensor is designed for IR wavelengths in the range of 5-13 μm with a noise temperature equivalent difference (NETD) equal to 0.16 • C according to [38].
IR arrays can be wall-or ceiling-mounted [39]. However, without any loss of generality, we consider a ceiling-mounted device, as shown in Fig. 3(a). During the experiments, a person enters in an area of about 12 m 2 and walks or falls during a period of approximately 30 s for several times. Fig. 3 shows an example of human body motion captured by the mth thermopile of the IR array sensor (actually, the one with index m = 30) during fall and walk activities with respect to the empty environment taken as a reference. As confirmed in the example, when a person enters in the area and walks (blue color), the temperature readings with respect to the empty environment increase, while the target moves in the surroundings of the considered thermopile element FoV (see the corresponding peaks in the figure). On the contrary, the samples highlighted in yellow indicate a person that loses the balance and falls. These informative patterns are captured and used to identify walk and fall activities from the empty environment case.
The sensor readings at time t are collected in the vector ε ( depend linearly on the body occupancy vector r t = [r t,1 , . . . , r t,m , . . . , r t,M ] T , whose elements r t,m ∈ {0, 1} provide binary information about the presence, or absence, of the subject at time t as detected by the mth thermopile element. In the following, we also define the human body state t,m at time t to represent the activity of the subject in the environment (i.e., fall versus nonfall state) detected by the mth thermopile element at time t. The vector ε t is modeled as where, for each kth element of The background temperature w t is described by the column vector w t = [w t,1 , . . . , w t,M ] T that conveys information about noisy heat sources that are not caused by body movements but still characterize the empty space. As clarified in the following, the background temperature in the absence of the target is modeled by an autoregressive (AR) model [6].
2) US Sensor: The adopted US sensor is a time-of-flight (ToF) sensor, namely, the CH201 from TDK [40], which internally includes an MEMS-based piezoelectric micromachined ultrasonic transducer (PMUT) [41] and a system-onchip (SoC) unit for range preprocessing. The US device uses the PMUT to transmit (and receive) short pulses of sound waves through the air [31] (and references herein), with an FoV of about 45 • . The US sensor range readings are collected in the vector ε (US) As shown in [31], the round-trip time t s ( t,k ), due to the sound waves' propagation from and back to the PMUT, depends on the human body state t,k and it is measured by the built-in SoC as where c ac = 343 m/s is the speed of sound at the room temperature (i.e., 20 • C), M = [x, y] T is the known transducer's position, and S( t,k ) = [x k , y k ] T is the position of the human body in state t,k . Notice that the detection performance typically suffers from disturbance degradation in the presence of adverse environmental factors such as non-line-of-sight (NLOS) and multipath propagation effects [42] that typically induce biased received range data. However, these phenomena are generally unavoidable in real deployments, due to the inherent nature of the sound waves used for ranging that are prone to be reflected, refracted, and diffracted when encountering different kinds of obstacles (e.g., walls, ceiling, and ground) and even human beings that are present in the monitored area [43], [44].

B. Change Detection and CUSUM Features
In what follows, we investigate the distillation and segmentation algorithms applied to the IR and US sensors' raw data. To this aim, we exploit the CUSUM method (see Fig. 2) to extract the features related to the changes in the sensor readings due to specific human body motion activities.
Considering here an indoor environment without targets ( t,m = ∅), the sensor readings time series ε (IR) t and ε (US) t , obtained from the IR and US sensors, are here modeled by a pth-order AR model [6]. More specifically, the signal produced at time t by the mth thermopile element of the IR array is modeled iteratively as where the column vector ζ  (IR) t,m with zero mean and variance σ 2 m . Likewise, focusing on the US sensor, the AR model is defined as where the column vector ζ T of size p + 1 includes the background-specific AR parameters and the column vector ε u ] models the residual range as a zero-mean Gaussian white stochastic process with standard deviation σ u . In this article, we assume p = 1, while the proposed AR models are validated experimentally in Section IV.
A change in the environment, e.g., due to an object or a person, standing or moving in the area covered by the IR and the US sensors, alters the corresponding AR parameters, ]. Detection of changes in the AR parameters is implemented iteratively by monitoring the CUSUM function [30]. For a given inspection interval [t − T, t] of duration T , the CUSUM g t is defined at time t for both IR array and US sensors as and respectively, where the operator (·) + is defined as (·) + = max[0, ·], while the log-likelihood ratio terms j,m and j are defined as respectively. The supremum (sup) is computed over the (unknown) parameter set ζ (IR) 1,m and ζ (US) functions are the log-likelihood ratios for ζ (IR) 1,m and ζ (IR) 0,m and ζ (US) 1 and ζ (US) 0 pairs, respectively. For small changes of the ε t dynamics with respect to the empty environment, as typically observed during body movements, the terms j,m , j in (7) simplify to [30] j,m ≈ ν m u T IR · z j,m ζ where the constant terms ν, ν m > 0, and z j , z j,m are the score function vectors of size ( p +2)×1 defined as in [30]. The unitary vectors u IR and u US indicate the predominant directions of the change that maximize the impact of the perturbations on the selected model parameters (i.e., AR parameters and residual standard deviation). In our case, we assume that ν = ν m = 0.2 and an AR-1 (i.e., p = 1) model; in addition, u IR = u US = [sin(α), cos(α)] T , with α = (π/20) . The score functions further simplify as The CUSUM results from the IR and US sensors g t,m , m = 1, . . . , M, and g (US) t are then transferred to the FC, and the human body state t is recognized using both fused feature terms g (IR) t,m and g Finally, the proposed RFA method uses the features (10) as input for classification of body motion (fall detection). Note that the CUSUM-based processing is a proper solution for handling heterogeneous sensor inputs (i.e., it is not specific to IR and US signals) as it can accurately capture environmental changes by analyzing multidimensional time series regardless of the raw data sources [6].

C. Edge-Based Board: Hardware and Sensors integration
In this section, we present the edge-based integrated board that has been used to experimentally verify the proposed feature fusion methods. The edge device is assumed to be ceilmounted, as shown in Fig. 3(a). The developed edge board is shown in Fig. 5 that highlights also the specific hardware architecture and the related design details. The board integrates the following main components.
1) A power management integrated circuit (PIMC) that provides four voltage levels as outputs (i.e., 1.0, 1. to external units (e.g., ModBus devices) for further processing. The transceiver is interfaced to the ARM Cortex R4F by exploiting its embedded UART. The device may also be equipped with an optional module supporting LoraWAN radio for remote data sharing. Even in this case, the LoraWAN module is interfaced to the ARM Cortex R4F through another embedded UART. In Section IV, we will exploit the RS485 interface for data extraction, while the LoraWAN radio module will not be considered.

III. RANDOM FOREST MODEL AND DECISION-MAKING
FOR HUMAN SENSING This section discusses the human body fall detection problem by exploiting the proposed multisensory fusion setup. Human body motions are detected through the CUSUM features that discriminate a body fall event from a (safe) walking activity of a person inside the monitored space. In what follows, we describe the classification problem and the adopted RFA method based on the fused multisensory features.

A. Random Forest and Decision Tree Ensemble Model
Decision trees are frequently encountered in ML applications and sometimes used also for classification and regression [45] on resource-constrained edge devices [33], [46]. A decision tree is represented in general as a function T : R n → R on an n-dimensional feature space. It is defined by recursively partitioning the input feature space ∈ R n into regions and implementing a binary test in each resulting region. A function can be represented by a tree that consists of nodes (representing a test) and branches that show the specific test outcomes. Classification is typically based on the analysis of the paths in the tree. One way to improve the performance of decision tree classifiers is to combine the responses from multiple trees. In [47], a set of decision trees are trained using a random subset of the input features. This has the effect of decorrelating the outputs of the individual trees in the forest, combining the advantages of bagging and random selection [48], and thus improving the performance. RFA is comparatively reported to perform faster than boosting and bagging and it is robust to outliers and noise. It is also an effective solution to handle large datasets and missing data as often encountered in sensor fusion applications [49].
The RFA model R is described as an ensemble of N decision trees R = {T i } N i=1 where each classifier tree T i contributes with a single vote for the assignation of the most frequent class to the input vector . For the proposed sensor fusion setup detailed in Section II, we collect M + 1 = 65 CUSUM terms where 64 are obtained from the IR sensor array (8 × 8) [50] and 1 from the US sensor readings. Next, we select a group of k f ≤ M + 1 features targeting the fall detection accuracy of 0.9. The feature group serves as input feature space in (10) for R. Finally, the RFA [51] uses f randomly selected subset of features i ∈ as an input to each tree, while each tree grows independently using the C4.5 algorithm [52].

B. Motion Classification: Body Fall Versus Walking Detection
In the following, we tackle the problem of body fall versus walking detection. The goal is to recognize the human body state t ∈ { s , f } at time t, where s represents a subject in a safe state, i.e., walking/moving inside the coverage area, and f indicates a subject falling, comprising of preevent and fall event.
The change detector scheme described in Section II-B is used first to detect the body presence: the CUSUM features ( t ) in (10) are thus computed continuously with an inspection interval of T = 5 samples for both sensors (corresponding to 1 s for T s = 200 ms sampling time). Motion events are segmented using two different thresholds applied to the IR array sensors g (IR) t,m and the US g (U) t CUSUM functions and optimized as shown in [31]. Segmented CUSUM features, indicating the presence of a subject, are then used as an input to the RF algorithm for classification.
The human body state at time t, t , is obtained by majority voting over local decisions is the class prediction of the i th tree in the forest using the input feature subset i obtained at time t. Fall state f is an unpredictable event leading the subjects to lose the balance and rest on the ground/floor. As described above, a fall is detected through major voting over the N class predictionŝ i,t , obtained from each RFA tree, namely, where 1 y (x) is the indicator function, namely, 1 y (x) = 1 if x = y and 1 y (x) = 0, otherwise.

IV. CASE STUDY IN A SMART BUILDING ENVIRONMENT
The chosen experimental setup described above is placed in an indoor smart space and employed for safety operations. Data acquisition, feature extraction, fusion, and, finally, classification steps are discussed with special focus on fall detection applications. The multisensor device shown in Fig. 5 is ceiling-mounted (2.7 m height) and equipped with one US sensor and one IR array module.
During the on-field tests, first, we collect data corresponding to the empty environment, t = ∅, and then, we consider two scenarios: 1) a person entering inside the area, walking for 30 s and exiting form the room, and 2) a person entering the area, falling down on the floor and then staying on the ground for 30 s. Considering this last case, the sensors are able to track both the acceleration toward the floor, the shock between the faller and the floor (pre-fall activities) and the body lying on the ground (fall) [10].

A. Fall Detection Performance
First, we analyze the performance of the RFA-based human state classification using, alternatively, only one of the two sensors (i.e., IR or US). Next, we analyze the performance of the fusion algorithm. This is useful to compare the accuracy results of each sensor before and after the data fusion with features obtained from CUSUM. We define k f and f as the number of input features and the size of the feature subsets used by each tree T i , respectively. Finally, we consider the impact of feature fusion and perform the complexity analysis.
1) IR Sensor Array: Targeting the IR sensor array, Fig. 6 shows the fall detection accuracy observed for each thermopile element (here represented as pixels) using the proposed RFA model R = {T i } N i=1 with N = 25 trees. As shown in this example and usually found in almost all measurements, the pixels located in the center of the 2-D array show the larger accuracy. The above observation suggests that the selection of the best zone of pixels before implementing the RFA, see the features in (10), can provide marginal performance degradation in exchange for a significant reduction of the computational time by using less input features. As shown in Fig. 7, we define the zones z = 1, .., 5 as groups of pixels corresponding to colocated thermopile elements. Thus, the goal is to select the zone providing real-time activity recognition results with the largest accuracy. Fig. 7(a) shows the zones and the corresponding fall detection accuracy for each individual thermopile element. Focusing on the zones z = 3 and z = 5, Fig. 7(b) and (c) presents the accuracy results using a varying number of input features (k f ) obtained from the corresponding zones and a different number of trees N. In this case, each tree chooses the subsets of features randomly with size f ≤ (M) 1/2 = 8. The results confirm that pixel selection improves the fall detection accuracy, while the input features obtained from the thermopile elements located in the central zone z = 5 give the best performance. Fig. 8 now summarizes the fall detection accuracy in terms of the feature subsets size f and the number N of RFA trees.   are marginal when using more than N = 25 trees. Comparing both cases, namely, choosing the input features k f according to the zones z without optimizing f and using all the input features k f = 64 but optimizing the feature groups f , it is apparent that the latter option provides higher accuracy. In the following, we will exploit the last case.
2) US Sensor: The fall activity results using both the raw data and the CUSUM feature are now presented for the US sensor. Fig. 9(a) shows an example of observed raw data during a subject fall. A person enters the monitored area, walks, and at 120 s loses his balance and falls down. The CUSUM feature g (US) t is computed in Fig. 9(b): notice that the peaks clearly indicate the fall events. Both CUSUM and raw data can be used as an input to the RFA; therefore, Fig. 9(c) compares these two settings for a varying number of RF trees N. However, it is apparent that, if US raw data are employed as an input to the RFA method, it is possible to achieve an accuracy of 0.75 (for N = 20 trees), while the accuracy reaches 0.93 using the US CUSUM features. Using the CUSUM for raw data distillation is thus the preferred choice.

3) Fused IR and US Sensors:
In what follows, we now discuss the benefits of feature fusion using both the IR sensor array and the US sensor in terms of fall detection accuracy. Fig. 10 shows the accuracy observed after the fusion of the CUSUM features ( t ) = {g In this example  for the IR sensor, we use the pixels in the zone z = 5. Fall detection accuracy now is in the range of 0.94-0.98 while, in line with the previous analyses, the optimized number of RF trees is N = 25. While an appropriate pixel selection increases the accuracy, it reduces also the sensor FoV. Even if this drawback is tolerable for specific applications, it may not be acceptable for general-purpose scenarios. For this reason, in the following, we will not perform any pixel optimization. The computational cost/complexity of the proposed fall detection algorithm, including CUSUM feature calculation and RFA, is discussed in Section IV-B. To highlight the benefits of the proposed tool, a tradeoff analysis between accuracy and complexity is also considered with respect to conventional ML solutions.

B. Complexity Analysis
Computational complexity is discussed here through a comparative analysis between the proposed tool and conventional  Fig. 11. The figure also summarizes the parameters used to evaluate the complexity for all algorithms. For a summary of all other parameters, see the related sections in the following. The proposed complexity analysis accounts for the CUSUM-based feature calculations as well as the ML-based classification tasks. Considering the LSTM model [56], the complexity of one forward pass is ruled by the input layer size, the LSTM network edges, the FC neural network, and the output layer. Similarly, for CNN [55], the complexity depends on the input size, the 2-D convolutional layers, the FC, and the output layers. Finally, the RFA complexity [53] is determined by the input feature size, the number of trees, and the output layers. We will provide a detailed description of the parameters that rule the complexity for each case.
1) Random Forest Algorithm: It takes k f ≤ M + 1 = 65 features as inputs (from CUSUM) and chooses f ≤ k f features for each tree T i . The complexity of the proposed algorithm provides a measure of how many computing cycles (operations) the algorithm would take given an input of size k f samples. According to [53] and [54], the complexity is defined as T × k f CUSUM processing (12) where N is the number of trees, D is the average depth, f is the feature number, and T is the number of samples for the CUSUM calculation as in (5) and (6). Notice that, at the training phase, the complexity should also consider the input feature k f selection and it is calculated  where M × N is the input size, m l × n l is the filter size, L f l is the filter number per each layer, L n is the CNN layer, N f is the fully connected input units, and n o is the number of outputs [55]. Similarly, as in (12), the last term T × k f relates to the CUSUM calculation.
3) Long Short-Term Memory: The complexity of the LSTM algorithm is ruled by the total number of edges in the network (please refer to [56] for further details) and is defined as where L n is the number of LSTM layers, n c is the memory cell, k f is the input unit, N f is the fully connected input, and n o is the number of outputs. Notice that the computational time for a network with a moderate number of inputs is dominated by the term L n × n c × (4k f + n o ) since k f n c . Therefore, the run time complexity can be simplified as

C. Complexity and Accuracy Tradeoff
The accuracy and computational complexity tradeoff is discussed here by assessing the proposed RFA method alone and also by a comparative analysis with conventional DL designs. Table I now focuses in more detail on the RFA algorithm. In particular, it analyzes the tradeoff between training complexity and accuracy for the variable number of trees N and the feature subsets size f selected for RFA processing at each tree T i .
In the cited table, we set the number of input features to k f = 65. For all cases, the results confirm that increasing the number of trees and features almost produces an improvement in accuracy in exchange for increased complexity. Considering the hardware architecture of Fig. 5, balancing accuracy and computational cost is critical for practical implementation due to resource constraints. For example, setting an accuracy target of 0.96, the optimal number of the trees N and features f combination is found to be equal to f = 10, N = 50, while the complexity is equal to 1.15 × 10 5 operations.
Similarly, by setting N = 20, Fig. 12 shows the relation between the number of input features k f , the feature subsets size f selected for RFA, the computational complexity, and the corresponding accuracy. For example, for N = 20 and k f = 65 features, the best choice for feature subsets size is f = 5 as this maximizes the accuracy up to 0.94.
In Table II, we compare the complexity results with LSTM and CNN using the complexity parameters in Fig. 11 with the corresponding values of Table III. CUSUM processing is excluded in complexity calculation to compare the ML tools' complexity. It is apparent that the RFA computational complexity is three times less than the LSTM one and 13 times less with respect to the CNN as well.

D. Training and Processing Time Analysis
In what follows, we analyze the processing time, the training time, and the memory usage of the proposed algorithms. Since  TABLE III  SUMMARY OF THE PARAMETERS VALUES ADOPTED FOR THE COMPARISON OF THE COMPUTATIONAL  COMPLEXITY OF THE LSTM, CNN, AND RFA METHODS the actual processing time depends on the specific hardware and the software implementation, we will highlight relative comparisons. To simplify the analysis, computing times are measured using the same device, namely, a CPU @ 2.90 GHz and 16-GB RAM. RFA, CNN, and LSTM methods are compared, while the main results are summarized in Table II.
According to the complexity analysis previously discussed, RFA has the lowest training and inference processing time.
In particular, this is 50% less for inference than LSTM time and 75% less than the considered CNN algorithm. Regarding the training time, RFA uses ten times less time than CNN and LSTM. These results also confirm that a sensor fusion algorithm using RFA is a promising candidate for integration in the proposed multisensory edge device.
In Table II, we also quantify the memory usage for all the proposed methods. The goal is to compare the memory efficiency and the functionality of the proposed algorithm with conventional DL methods as integrated in an edge device with limited memory. Memory usage is quantified here as the size of memory used by each algorithm to produce a fall detection update. This is shown in megabytes (MB) in Table II. The result confirms that the proposed RFA method uses about 83% of memory (3.1 MB) compared with LSTM and about 78% with respect to the CNN algorithm.

V. CONCLUSION
In this article, we propose an integrated device that implements real-time feature-based fusion using data obtained from an IR array and a US sensor. The device implements data distillation by exploiting a change detection algorithm and extracts the features from both sensors using the CUSUM function. A low-complexity RFA is adopted for feature selection and processing. The algorithm has been validated using real data obtained from different field experiments to classify body fall and walk events in a smart space environment. The accuracy-complexity tradeoff is analyzed and compared with conventional ML approaches.
The results of the experimental analysis confirm that the proposed method is practical enough for implementation on resource-constrained edge devices with limited computational capacity and memory, specifically in the IoT applications. They demonstrated that the proposed approach can obtain an accuracy of 0.94, a reduction of the computational complexity of at least three times, 80% less processing time for the training, 50% less in the inference phase, and 10% less memory usage of the best ML competing method.
Although this article targeted IR and US sensors integration, the proposed method can be easily extended to integrate more sensors since the proposed CUSUM feature processing is (raw) data agnostic.
Future works will consider the deployment of the proposed device in an industrial environment targeting a human-robot workspace multisensory ecosystem. In 1986, he joined the National Research Council (CNR), Milan, as a Researcher. From 1999 to 2014, he has been an Adjunct Professor at POLIMI, where he taught courses on software-defined radio algorithms and architectures, and radio communication and navigation systems. From 2001 to 2022, he became a Senior Researcher at the Institute of Electronics, Computer, and Telecommunication Engineering (IEIIT-CNR), where he is a Research Associate. His research interests include signal processing algorithms and architectures for wireless communications, radio vision techniques for people detection and localization, body models for radio passive sensing, and federated learning algorithms for cooperative and green applications.
Mr. Rampa has been a member of the Steering Committee GTTS-4 of the National Cluster Fabbrica Intelligence since 2018.
Leonardo Costa is an Electronic Engineer with signal processing and digital systems expertise. He is also the Research and Development Manager at Cognimade S.r.l., Melzo, Italy. His job experience includes software development for radars, embedded software development for different architectures, such as ARM Linux and Labview-based systems, firmware development on ARM Cortex-M microcontrollers and fieldprogrammable gate array (FPGA), printed circuit board (PCB) design for the Internet of Things (IoT) devices, sensors, low-power radio-based products, industrial I/O, gateways, and data loggers. He also gained experience with different standards and rules, such as radio equipment device directive 2014/53/EU, ECSS software development life cycle, and electronic safety-related systems IEC 61508.
Denis Tolochenko is an Electronic Technician with printed circuit board (PCB) design and PCB testing expertise. He is also a Research and Development Expert at Cognimade S.r.l., Melzo, Italy. His job experience includes software development for radars and embedded software development for different architectures, such as ARM Linux. His job experience includes PCB design and testing for digital systems and radiobased systems, embedded software development for data loggers, and web-based Internet of Things (IoT) platforms.
Open Access funding provided by 'Consiglio Nazionale delle Ricerche-CARI-CARE' within the CRUI CARE Agreement