On Edge Human Action Recognition Using Radar-Based Sensing and Deep Learning

In this article, we propose a radar-based human action recognition system, capable of recognizing actions in real time. Range-Doppler maps extracted from a low-cost frequency-modulated continuous wave (FMCW) radar are fed into a deep neural network. The system is deployed on an edge device. The results show that the system can recognize five human actions with an accuracy of 93.2% and an inference time of 2.95 s. Raising an alarm when a harmful action happens is a crucial feature in an indoor safety application. Thus, the performance during the binary classification, i.e., fall vs nonfall actions, is also assessed, achieving an accuracy of 96.8% with a false-negative rate of 4%. To find the best tradeoff between accuracy and computational cost, the energy precision ratio of the system deployed on the edge is measured. The system achieves a 1.04 energy precision ratio value, where an ideal ratio would be close to zero.


On Edge Human Action Recognition Using Radar-Based Sensing and Deep Learning
Christian Gianoglio , Member, IEEE, Ammar Mohanna , Ali Rizik , Laurence Moroney , and Maurizio Valle , Senior Member, IEEE Abstract-In this article, we propose a radar-based human action recognition system, capable of recognizing actions in real time.Range-Doppler maps extracted from a low-cost frequency-modulated continuous wave (FMCW) radar are fed into a deep neural network.The system is deployed on an edge device.The results show that the system can recognize five human actions with an accuracy of 93.2% and an inference time of 2.95 s.Raising an alarm when a harmful action happens is a crucial feature in an indoor safety application.Thus, the performance during the binary classification, i.e., fall vs nonfall actions, is also assessed, achieving an accuracy of 96.8% with a falsenegative rate of 4%.To find the best tradeoff between accuracy and computational cost, the energy precision ratio of the system deployed on the edge is measured.The system achieves a 1.04 energy precision ratio value, where an ideal ratio would be close to zero.Index Terms-Action recognition, deep neural networks (DNNs), edge deployment, frequency-modulated continuous wave (FMCW) radar.

I. INTRODUCTION
F ALLS are a major public health concern and the main cause of accidental death in the senior population worldwide.Timely and accurate detection permits immediate assistance after a fall and, thereby, reduces complications of fall risk [1].Edge-based approaches are essential to support time-dependent healthcare applications [2].
Due to the advantages of portability, low cost, and availability, wearable devices are regarded as one of the key types of sensors for fall detection and have been widely studied [3], [4], [5], [6].The main drawback of these systems is the battery life which can limit the usability of the wearable devices.The second drawback is that the monitored subjects must always wear them, causing obvious discomfort, especially for the elderly.
Vision-based fall detection is another prominent method.Extensive effort in this direction has been demonstrated, showing promising results [7], [8], [9].Although cameras are not as portable as wearable devices, they offer other advantages that deem them as decent options depending on the scenario: most static RGB cameras are not intrusive and wired, hence there is no need to worry about battery limitations.One major drawback of vision-based detection is the potential violation of privacy due to the levels of detail that cameras can capture, such as personal information, appearance, and visuals of the living environment.The second drawback is the high sensibility to clutters and environmental conditions (e.g., smoke, light, etc.).In addition, a vision-based approach could introduce issues related to color bias [10].
Ambient sensors (e.g., ultrasonic, WiFi antennas, radars, etc.) provide another nonintrusive means of fall detection.Ambient sensing is drawing more attention which can be attributed to being device-free for users and can solve the problem of people's privacy and color bias.Ultrasonic sensor network systems are one of the solutions for fall detection [11].One drawback is that the waves are affected by environmental factors, such as humidity and smoke.These would affect the accuracy of the measurement.In [12], a fall detection approach uses WiFi signals, showing promising results in detecting falls.This approach is susceptible to security threats and has a limited range of efficiency.
Radars have become popular in recent years, proving to be an effective sensor to recognize human actions in cluttered environments [13].In [14] and [15], the human action recognition is performed by a machine learning (ML) algorithm trained on hand-crafted features extracted from the radar signals.The main drawback is the time and effort required for the processing of the data and the feature extraction operation.Some other works propose deep neural networks (DNNs) to automatically extract features for human action recognition and fall detection.In [16], stack autoencoders (AEs) are used to automatically extract the features from the gray-scale spectrogram and to classify four activities including the fall.In [17], the authors combine convolutional neural networks (CNNs) and AEs to classify 12 actions based on micro-Doppler signatures.In [18], two DNNs automatically extract the features from the time series corresponding to the fast time of an ultra-wideband (UWB) radar return signal and classify fall actions.In [19], bidirectional long short-term memory (LSTM) networks classify activities, in real-time, based on the fusion of data collected using radar and wearable devices.In [20], LSTMs with and without bidirectional neurons classify the activities based on the micro-Doppler spectrograms.The data are considered as a continuous temporal sequence.In [21], the authors adopt a generative adversarial network (GAN) to enrich with synthetic samples a dataset containing a low number of micro-Doppler signatures representing human actions.Using this method, the capability of generalizing on new data is increased.In [22], a DNN takes binary-masked spectrograms as input.Those are computed on the signals of the UWB radar, classifying falls.In [23], the authors propose an algorithm to extract the optimal range bin from the range-Doppler spectrograms of a moving target for the subsequent time-frequency analysis, then a DNN built upon a pretrained model classifies the falls based on the optimal resulted spectrograms.
This article proposes a system for human action recognition based on deep learning (DL).The blocks of the system are designed for edge deployment.The data are collected using a low-cost frequency-modulated continuous wave (FMCW) radar connected to the edge device.The device transforms the signals into range-Doppler maps, treated as a series of images by the DNN.The performance of the system is evaluated both in terms of generalization accuracy and computational cost measured on the edge device.This evaluation covered the multiclass classification (i.e., five human action classes) and binary classification (i.e., fall versus nonfall classes).
The following are the novel contributions of this article.1) The system is low-cost and deployed on the edge.
2) The proposed DNN classifies a series of range-Doppler maps through the medium of a time-distributed layer (TDL).
3) The DNN size achieves real-time edge inference.The rest of this article is organized as follows.In Section II, the multiclass human action recognition method is described.In Section III, the radar specifications, data collection, training procedure, and edge deployment procedure are detailed.In Section IV, experimental results including both model and edge system assessment are presented.Finally, Section V concludes this article.

II. METHODOLOGY
In this section, the authors present the multiclass action recognition system.Fig. 1 presents the block diagram of the action elaboration, from the acquisition of the signals to the classification.The block diagram consists of four stages after the radar.In the following, all the stages are detailed.

A. 2-D FFT
In the 2-D FFT stage, the signals received by the Position2Go FMCW radar [24] are processed by the edge device (i.e., the Raspberry Pi 4) to obtain range-Doppler matrixes, also known as range-Doppler maps [25].The radar receiver antenna receives a delayed and attenuated copy of the transmitted wave.An I/Q mixer demodulates the received wave and returns an intermediate frequency (IF) signal.The signal is sampled by an analog-to-digital converter (ADC) and data are organized in a 2-D matrix.Following the notation adopted in [26], the matrix can be represented as where A IF is proportional to received echo, T s is the sampling period, f D is the Doppler frequency shift, f b is the beat frequency, T PRI is the pulse repetition interval, and N c is the number of chirps with N s samples per chirp.According to (1), the signals in q IF are filtered producing a filtered matrix u(n s , n c ).The range-Doppler (RD) map is computed as where u is the filtered matrix after removing the clutters [25], ω are the Kaiser windows to be applied on the beat frequency and Doppler dimensions with a shape factor K sf , and Since an RD map is a 3-D tensor, it can be considered an image and can be formalized as RD ∈ N N R ×N D ×C , where N R and N D represent the height and width of the image, while C represents the number of channels (e.g., C = 1 in a gray-scale image and C = 3 in RGB format).

B. Image Transformation
A transformation can be applied to an image to convert it from one domain to another.Viewing an image in different domains enables the identification of features that may not be as easily detected in the initial domain.Among the image transformation techniques, edge detectors proved to increase accuracy in DL applications [27].In this article, five transformations have been applied to the RD maps (RD in Fig. 1), three of which are based on edge detectors (i.e., Canny, Sobel, and Roberts).During the transformation, the RD maps are also resized to cope with the dimension of the input tensor of the classifier.Fig. 2 shows the different transformations applied on an exemplary RD map.The RD map format is RGB [Fig.2(a)].As aforesaid, this

C. Image Sequence Collection
The content of RD maps varies over time.Therefore, the sequence of RD maps is time-dependent.To classify the action, T × RD images of each action are collected as a 4-D tensor datum, as shown in Fig. 3.The 4-D tensor can be formalized as The adoption of sequences of RD maps results in an increment of classification accuracy over the single image, as demonstrated for human action recognition in [28].

D. Classifier
Fig. 4 shows the proposed DNN (Classifier in Fig. 1) used to classify the 4-D tensor.
Human actions are dynamic and occur over time, and are thus captured in multiple RD maps, with each map constructed in approximately 30 ms, as described in Section III-A.To address the classification problem, we propose a hybrid model that utilizes a CNN to extract image features and an LSTM to capture the dependencies between consecutive maps.A TDL applies the same layer(s) to every time step of the input [29].In this article, the TDL uses a CNN to extract T feature maps from the T images of the 4-D input tensor and the output of the TDL is a sequence of feature maps.An LSTM layer learns the dependencies between the sequence of feature maps.Finally, two dense layers are the output layers of the DNN.The first one is a fully connected network with an ReLU activation function.The second is made of Neu neurons, i.e., the number of classes, with a Softmax activation function to assign the action label.
Three CNNs have been designed.The use of CNNs aims to automatically extract features by learning the kernels (i.e., filters) that convolve with data.This practice is adopted to topple the domain-specific feature extraction, which is usually manually crafted by experts.The CNNs have been designed as a proof of concept for the feasibility of the action recognition system.The model's number of parameters is correlated with the size/performance ratio.Therefore, three CNN architectures of different sizes are adopted to investigate the effect of distinct number of layers on the performance of the overall system.Two different instances of the CNN2 architecture have been also taken into account.CNN1 has an architecture similar to CNN2 but with a lower number of layers: it contains five blocks instead of six, each block has a convolutional layer less than Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I POSITION2GO RADAR SPECIFICATIONS
CNN2, and the global pooling is applied to the output of the fifth block (the last one in this case).On the opposite, CNN3 has one more convolutional layer in each block with respect to CNN2, thus presenting seven main blocks.In the following, DNN1, DNN2, and DNN3 will refer to the DNN presented in Fig. 4 that uses CNN1, CNN2, and CNN3, respectively.

A. FMCW Radar Specifications
The radar used in the experiments is the Position2Go FMCW radar model operating on the 24-GHz ISM band, by Infineon Technologies [24].The radar is equipped with a pair of arrays of microstrip patch antennas (one for transmitting and two for receiving) characterized by a 12-dbi gain and 19 × 76 degree beam widths, defining the field of view (FoV) in both elevation and azimuth axes, respectively.The sampling rate used for the data collection is 213 kHz to detect the high-frequency components of the signal (around 60 Hz) [28].The development kit allows the user to implement and test several applications at the 24-GHz ISM bands, such as localizing, tracking, and collision avoidance [24].Table I lists the radar parameters that are set for the data collection procedure.According to the specifications, each RD map is built upon 64 chirps.Every chirp consists of 300 μs of ramp-up, 100 μs of ramp-down, and 100 μs of steady-state period until the next chirp is generated.As a result, each map is thus computed over 32 ms of data.It takes 168 ms to transform a chirps frame into the RD map by applying the

TABLE II COLLECTED DATASET SUMMARY
2-D FTT.The next acquisition begins once the processing of the previous frame is complete, and the RD map has been saved.

B. Data Collection
The data have been collected in two different environments at the University of Genova, Italy.The environments contain clutters, such as desks, PCs, and metal lockers, as shown in Fig. 6.
A data sample consists of 15 consecutive range-Doppler maps acquired during 3 s of a human action performed in the FoV of the radar.The dataset contains five classes as follows.
1) Fall: The subject falls from a walking or standing state on a mattress.2) Bed-Fall: A couch [Fig.6(b)] represents a bed.It has been moved around the environment to perform the "fall from bed" action from different angles with respect to the radar's FoV. 3) Sit: The subject sits down on the couch or chairs positioned in different locations of the environment.4) Stand: The subject stands up from the couch or chairs positioned in different locations of the environment.5) Walk: The subject walks in the FoV of the radar.The actions have been performed by five healthy subjects aged between 25 and 30.Only one subject performed one action at a time.To eliminate possible data-collection selection bias, the subject executed the same action in different ways (e.g., change the starting and/or ending positions with respect to the radar, perform the action faster/slower, etc.).In total, 100 data samples, i.e., 20 per subject, have been collected for each class (50 for each environment).Table II summarizes the collected data.The first column shows the performed actions, the second column the corresponding class to each action, and the last column reports the number of data acquired in each environment per class.
Five datasets have been generated from the D Orig dataset by applying the transformation techniques described in Section II-A.The datasets can be formalized as with d ∈ {Gray, Canny, Sobel, Roberts, Binary}.Fig. 7 illustrates five RD maps per action, highlighting the most significant differences.Each map represents a snapshot of the action captured by the radar, where the x-axis denotes the distance of the human target from the radar, and the y-axis represents the speed retrieved by the Doppler FFT.A negative speed indicates that the moving target is approaching the radar, while a positive speed means that the target is walking away.The numbering in the bottom-right corner of each figure corresponds to the map position in the sequence.In the first row, the Walk action shows a target moving forward from the radar, with an increasing range and a speed lower than 5 km/h.The last map (number 12) displays a vertical distribution along the speed axis, indicating a Doppler spread caused by different ways in which the human body joints move, resulting in different Doppler frequencies.The Sit action in the second row shows a similar motion to the Walk action until the subject sits on the chair, resulting in a fixed position on the range axis in maps 5 and 15, and different frequencies due to body movements on the chair.On the other hand, the Stand action in the third row shows the subject sitting on a chair in maps 1 and 2, then standing up and starting to walk in maps 7, 10, and 15.The target approached the radar, resulting in a negative speed and a decreasing range.In the Fall action in the fourth row, the subject approaches the radar by walking in maps 1, 2, and 4. Map 6 captures the falling action, presenting a spread across both axes, most noticeable in the Doppler one.In the last map, the subject is lying on the mattress, resulting in a zero-Doppler frequency.The Bed-Fall action in the last row presents a similar pattern across all maps, with a zero Doppler frequency at a fixed range.Map 4 shows a frequency spread on the speed axis due to the falling action.

D. Training
The training procedure has been implemented offline on a desktop PC with an Nvidia GeForce RTX 2080Ti GPU.All the DNNs are trained with the following parameters.
3) Batch size bs = 10.4) Loss function lf = categorical cross entropy.5) Early stop on validation accuracy with patience p = 10.The stratified K-fold technique is adopted to provide fair results.According to the technique, a labeled dataset (population) is split into K parts containing the same proportion of data per class as in the population.This mechanism guarantees that the training and test sets contain the same proportion of data in each fold without affecting the approximation of the generalization accuracy.Each kth part is used, in turn, as the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.test set.The remaining K − 1 folds compose the training set.In our experiments, K is set to 5. In this way, each fold contains 100 data samples, 20 per class.The generalization accuracy results presented in the next section are averaged over the five folds.Early stop criterion is adopted: from the K − 1 folds used during the training, a validation set using a ratio of 1/4 over the number of training data is extracted randomly.Eventually, a learning rate decay is adopted: during the training process, the learning rate is reduced every 10 epochs, multiplying it by 0.9.When the early stop criterion is satisfied (i.e., the validation accuracy decreases continuously for p epochs), the training procedure ends.

E. Edge Deployment
During inference, all stages (Fig. 1) are deployed on an edge device, where the tradeoff between generalization accuracy and the time for retrieving the action label becomes crucial.When it comes to classifiers, it is necessary to evaluate the tradeoff between model size, latency, and accuracy.Hence, model optimization options must be considered during the model's conversion for edge deployment.
TensorFlow Lite (TF-Lite) is used to deploy DL models on mobile and edge devices, such as development boards and microcontrollers, offering optimization options for converting a TensorFlow model into the TF-Lite format.The adopted optimization option is quantization, which represents the model with lower precision [e.g., floating-point 16 (FP16) instead of the default floating-point 32 (FP32) representation].This reduces the memory occupied by the model and the inference time.In this work, two quantization options have been used: no quantization, where the model parameters are represented as FP32, and FP16, where the parameters are converted to FP16, reducing the model size without affecting the inference time.The edge device used in this research is the Raspberry Pi4, which includes a high-performance 64-bit quad-core processor and 4 GB of RAM.The 4 GB version of the Raspberry Pi has been proven to be a reliable edge computing device for the signal processing of data collected by an FMCW radar [26], [30].Posttraining quantization was performed on the host PC, where the TensorFlow-trained models were saved, converting all the weights of the network from FP32 to FP16.In the case of no quantization, the TensorFlow model was simply converted to TF-Lite format.Fig. 8 shows the representation of FP32 and FP16 numbers and provides a snippet of C++ code used to convert an FP32 number to an FP16 number according to the IEEE754 standard.Since the Raspberry Pi4 device does not support any operation but FP32, when the TF-Lite model is run, the FP16 numbers are cast back to FP32.Once the models are converted, they are deployed on the Raspberry device.With Python code, it is possible to load the sequence of RD maps, transform the original data (if necessary), and predict the label by running the TF-Lite model.In a real-time application, data are acquired from the radar by the Raspberry Pi4 using the script for data collection, and then the Python script is run to predict human action.

IV. RESULTS AND DISCUSSION
When introducing edge AI systems, it is crucial to assess two complementary aspects, i.e., the classification accuracy and the efficiency (e.g., inference time, power consumption, energy precision ratio, etc.).

A. Classification Accuracy
In this section, the results in terms of accuracy are presented.At first, the accuracy of the multiclass classification problem is assessed.Second, some performance metrics are evaluated in a binary classification problem.In this case, the classes Fall and Bed-Fall are considered fall actions (i.e., harmful), and the other classes are considered nonfall actions.The performance is computed on the multiclass classification results, by grouping the predicted labels into harmful and nonharmful classes.

Multiclass Classification
Table IV shows the average accuracy on the test set computed over the K folds for each of the three DNNs.The first column indicates the datasets and the other three columns report the average accuracy with their standard deviation.The accuracies, computed on the test sets, are averaged over the five folds.The best accuracy for each DNN is emboldened.Concerning DNN1, which contains the lowest number of parameters according to Table III, D Canny presents the highest accuracy.For both DNN2 and DNN3, the best accuracies are achieved on the D Gray dataset.The overall best accuracy (i.e., 93.2%), is obtained by DNN3 which contains the largest number of parameters (Table III), therefore the highest memory occupation and power consumption on an edge device.
Fig. 9 shows the confusion matrixes for the best DNNs, emboldened in Table IV, i.e., DNN1 trained with D Canny , DNN2 and DNN3 trained with D Gray .The predicted labels of the five

TABLE V METRICS COMPUTED ON BINARY CLASSIFICATION
folds have been merged, resulting in a hundred test samples for each class.The most reliable DNN for detecting falls, which coincides with the classes named Fall and Bed-Fall, as described in Section III-B, is DNN3.DNN2 shows a lower accuracy in detecting the Fall class.As one can notice from the first row in DNN3, the miss-classified samples are mostly classified as Sit.This is possibly due to the similarity between Fall and Sit actions.

Binary Classification
The following notation is used: True positives (TP) are fall actions correctly classified, false positives (FP) are nonfall actions incorrectly classified as Fall, true negatives (TN) are nonfall actions correctly classified, false negatives (FN) are fall actions incorrectly classified as NonFall.The following metrics are then adopted: 1) Precision (PR), PR = TP TP+FP , indicates how many predicted positive labels are positive.2) Recall or sensitivity (SE), SE = TP TP+FN , indicates how much a model is accurate to predict the positive class.
3) Specificity (SP), SP = TN TN+FP , indicates how much a model is accurate to predict the negative class.4) False positive rate (FPR), PR+SE , balances between PR and SE.The results presented in Table V prove that the proposed system is capable of distinguishing harmful from nonharmful actions.In particular, the FNR, which represents the percentage of harmful actions that do not activate the alarm since they are classified as nonharmful, is low, especially in DNN3.In general, DNN3 presents the best performance for all the metrics with respect to DNN1 and DNN2.
To further investigate the performance of the three best DNNs (in terms of accuracy) the receiver operating curves (ROC) and the area under curves (AUC) are computed on each fold.The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR), varying the threshold for the score (i.e., the probability) based on which the output neurons assign the label.
The AUC represents the area under the ROC curve and represents the degree of separability between classes.The AUC can be considered as an indicator of the performance of a classifier: the higher the value the higher the prediction accuracy.Fig. 10 reports the ROC and the AUC for the three best DNNs.Each plot presents seven lines: five colored lines refer to the ROC in each fold, the blue dotted line represents the average ROC curve, and the black dashed line corresponds to the baseline classifier, where FPR is equal to TPR.The plot legends report the AUC values corresponding to each fold and the average.
DNN3 presents the highest average AUC value.In general, DNN3 trained on D Gray is a reliable model for detecting fall actions, achieving the highest accuracy in the multiclass classification problem, the lowest false negative rate in the binary classification problem, and the highest AUC value.Also, DNN2 could be a valuable option: despite it presenting slightly lower generalization performance with respect to DNN3, it contains less than one-third of the parameters than DNN3 (Table III).Thus, it is expected that, when deploying the models on the edge device, the computational cost of DNN2 is lower than the DNN3 one.In fact, during the deployment not only the generalization performance is important but also the computational cost.In this article, the computational cost is measured keeping into consideration all the stages of data elaboration (Fig. 1).

B. Edge System Assessment
The computational cost is evaluated as the inference time, power consumption, size of the model, and energy-precision ratio.Power consumption is estimated using a USB multimeter that is attached to the power supply of the edge device while running the inference.The energy-precision ratio (EPR) can be computed as EPR = Error × EPI, where Error represents the classification error and EPI is the energy consumption per classified data item (Energy Per Item).According to Section III-E, two TF-Lite optimization methods are applied during the conversion of the three best DNNs in TF-Lite models.These classifiers and the previous stages are deployed on the Raspberry Pi4.
Table VI shows the computational cost and the classification accuracy of the quantized DNNs.All the results are averaged on the 500 test data used to evaluate the classification performance in the previous section.The first column depicts the models, the second column reports the quantization applied to each model for the deployment, and from the third to the last column, the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.As expected, the first model (i.e., DNN1) has the lowest inference time and energy consumption, because of the lowest number of parameters that affect the model size.Straightforwardly, DNN3 presents the highest inference time and energy consumption.As can be noticed, the quantization to FP16 in DNN1 and DNN2 slightly improves the classification accuracy because it can act as a filter removing bits related to noise.The best tradeoff between generalization accuracy and inference time is achieved by DNN2, following EPR column of the table.The inference time of all the stages, with DNN2 as a classifier, is 2.4 s, thus guaranteeing the generation of an alarm in less than 6 s also considering 3 s for data acquisition.

C. Model Interpretability
Model interpretability is a crucial aspect in the analysis of DL models, especially in safety-critical applications where the prediction's correctness is fundamental.In this work, we employ GradCam++ to visualize the input regions that have the most significant impact on the network's decision.Figs.11 and 12 show the results of GradCam++ [31] computed on the feature maps extracted from an average pooling layer in DNN3, specifically after the last residual block.GradCam++ produces heatmaps that highlight the parts of the input image that are most important in explaining the predicted class.In Fig. 11, the first row displays four RD maps of an input sequence consisting of 15 maps, where a fall action occurred correctly classified by the network.Maps 6 and 7 in the first row capture the falls, while map 1 refers to walking and map 10 to lying on the mattress after the fall.The second row shows the corresponding GradCam++ heatmap, highlighting the most relevant parts of the feature maps, which coincide with the most salient parts of the  input.The third row depicts the pointwise multiplication of the GradCam++ heatmap with the input, clearly showing the regions of the input image that contributed the most to the classification.Similarly, in Fig. 12, the first row displays four RD maps of an input sequence consisting of 15 maps, where a fall action occurred, specifically maps 10 and 11.However, this sequence was misclassified as sitting by DNN3.The second and third rows show the corresponding GradCam++ heatmap and pointwise multiplication, respectively.The heatmaps look more spread, especially in map 11, with respect to the former example because of the higher level of noise that affected the measurements.Thus, even though GradCam++ seems to highlight salient regions of the input, probably the noise collected by the radar influences the eventual decision of the classifier.

D. Comparison With Advanced Image Recognition Models
Table VII presents a comparison of achieved results with stateof-the-art (SoA) algorithms for image recognition that utilize pretrained CNNs on the Imagenet dataset.Specifically, we considered MobileNetV2 (MNV2) [32], Xception (Xcep) [33], and ResNet50V2 (RN50V2) [34], all of which were encapsulated in the TDL layer (Fig. 4).We trained these deep architectures using two strategies: first, we adopted the transfer learning paradigm and only tuned the LSTM and dense layers while keeping the CNNs frozen; second, we unfroze the last layers of the CNNs and fine-tuned them together with the LSTM and dense layers.Table VII displays the total and trainable number of parameters, as well as the average accuracy on the five classes using the five-cross fold validation technique.The subscript "F T " denotes CNNs that were partially tuned by unfreezing the last layers.We used dataset D Orig to train all networks since the RD maps were represented with three mandatory channels for these three pretrained CNNs.The results indicate that the DNN2 and DNN3 models achieve higher accuracy compared  to all other networks, despite having a lower number of parameters.It is noteworthy that the partially fine-tuned models exhibit higher accuracy compared to their corresponding frozen models.
Table VIII presents a comparison of the metrics computed for the three deep networks based on image recognition models and the proposed approach.Only the partially fine-tuned models were included in the analysis, as they achieved better accuracy in the five-class classification problem.Once again, DNN2 and DNN3 exhibit superior performance across all the metrics.
Table IX shows the assessment of the three deep networks and the proposed models on the Raspberry Pi.We report only the results obtained with the FP16 quantization, as we have previously demonstrated that quantization has no significant impact on classification accuracy.The MobileNetV2-based network presents a lower inference time compared to our models, owing to the use of depthwise convolutions, leading to a lower energy consumption per predicted datum.However, our models offer a lower EPR as they achieve an accuracy improvement of over 6% compared to the MobileNetV2-based network.

E. Generalization Performance Assessment
A dataset comprising 100 samples (20 per class) was collected from a private house kitchen and bedroom, to evaluate the generalization performance of DNN2 and DNN3 on two Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XI COMPARING THE PROPOSED SYSTEM WITH RADAR-BASED HUMAN ACTION RECOGNITION AND FALL DETECTION IN SOA
different subjects aged 26 and 32 years old.Based on the previous results, the networks were tested on the dataset using no transformation and grayscale.The datasets can be represented as T d = {(X , y) i ; X i ∈ N 224×224×1×15 ; y i ∈ {Fall, Bed-Fall, Sit, Stand, Walk}; i = 1, . .., 100} with d ∈ {Orig, Gray}.Table X summarizes the performance of the classification of the two networks on T d .The first six columns show the metrics computed in the binary classification, while the last column presents the accuracy achieved in the five-class problem.The results indicate that both DNN2 and DNN3 have good generalization performance on T d with both transformations, with DNN3 achieving higher accuracy in both the binary and fiveclass classification.In particular, DNN3 tested on T Gray achieves a 0% of false negatives recognizing all the fall actions and having a low rate of false positives.Fig. 13 visualizes the confusion matrixes.

F. Comparison With Radar-Based
Table XI presents a comparison of the proposed system with the most relevant SoA systems in radar-based human action classification using ML.The table provides detailed information on the number of actions classified, the type of input processed by each model, the acquisition time for each data point, the classification model, the strengths and weaknesses of each approach, and the inference time if available.The last row of the table presents the proposal and includes an analysis of the two most accurate models, DNN2 and DNN3, which were trained using D Gray .All of the works achieved high accuracy in radar-based human action recognition, but some did not specifically focus on fall detection, such as [35], [36], [37], [38], and [39].Several studies have focused on binary classification for fall detection among common daily activities, such as walking, sitting, standing, and other nonfalling actions [16], [18], [22], [40].Most of the works used raw data or computed spectrograms to train the classifiers, while only one [41] collected sequences of RD maps to continuously monitor human actions.Only two works addressed the implementation of classifiers on embedded devices, namely, [35] and [40].Although the authors designed tiny models with a very low inference time on the edge, they did not deploy the data preprocessing stages on the edge which were performed offline, e.g., blocks 2-D FFT, image transformation, and image sequence collection in Fig. 1.To the best of our knowledge, this is the first work on radar-based human action recognition with fall detection in which the entire pipeline has been deployed on the edge.

G. Comparison With Wearable and Cameras-Based Systems
In the SoA, many works addressed human action recognition and fall detection with a multitude of sensors, such as wearables, cameras, and radars.A recent review can be found in [45].Table XII summarizes most of the highly cited research on the topic.Each sensor category has its own strengths and weaknesses, which are highlighted in the table.Overall, the studies achieve high levels of accuracy, comparable to those obtained with radar-based approaches (see Table XI).The choice of which type of sensor to use depends on the specific application.Wearable-based systems are ideal when users can consistently wear the sensors, as they can monitor people at all times, are relatively low-cost, may require less complex processing, and guarantee privacy.Conversely, when wearable sensors are not an option, radars or cameras can be chosen instead.Radars guarantee privacy, can monitor people in cluttered environments, and are not affected by changes in lighting conditions.Camera-based systems are more accurate at detecting multiple people in the same room, but they do not guarantee privacy, and clutter can interfere with monitoring.

V. CONCLUSION
This article presents an on-the-edge radar-based human action recognition system using DL.The system uses sequences of range-Doppler maps extracted from a low-cost FMCW radar.A time-distributed layer processes the sequence of range-Doppler maps.The results showed that the model with the highest number of parameters (i.e., DNN3) achieves the best accuracy (93.2%) in the five-class classification using grayscale data transformation.Moreover, the same model distinguished harmful from nonharmful actions with an accuracy of 96.8% and a false-negative rate of 4%.Using a radar that has higher performance would certainly help reduce the classification error and the false negative rate.The proposed system was deployed on a Raspberry Pi4.The results showed that the system that uses DNN2 achieves the best tradeoff, with a slight drop in accuracy, i.e., lower than 1%, with respect to DNN3 and an inference time lower than 2.5 s.
In future works, we will tackle the multitarget classification problem by using radars that can overcome the limitations of range and speed resolution.To address this scenario, we will also explore techniques for separating targets and subsequently classifying their actions.In addition, we will focus on implementing a continuous monitoring system to reduce the acquisition and classification time, and we will test the system on a sample of the elderly.Finally, we will leverage semisupervised and unsupervised techniques with the aim of detecting a broader variety of human actions.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

r
and F N D D are the range-FFT and Doppler-FFT outputting sequences of length N R and N D , respectively.After the processing, the RD map has the dimension N R × N D .In this work, N R = N D = 256.

Fig. 5
Fig.4shows the proposed DNN (Classifier in Fig.1) used to classify the 4-D tensor.Human actions are dynamic and occur over time, and are thus captured in multiple RD maps, with each map constructed in approximately 30 ms, as described in Section III-A.To address the classification problem, we propose a hybrid model that utilizes a CNN to extract image features and an LSTM to capture the dependencies between consecutive maps.A TDL applies the same layer(s) to every time step of the input[29].In this article, the TDL uses a CNN to extract T feature maps from the T images of the 4-D input tensor and the output of the TDL is a sequence of feature maps.An LSTM layer learns the dependencies between the sequence of feature maps.Finally, two dense layers are the output layers of the DNN.The first one is a fully connected network with an ReLU activation function.The second is made of Neu neurons, i.e., the number of classes, with a Softmax activation function to assign the action label.Three CNNs have been designed.The use of CNNs aims to automatically extract features by learning the kernels (i.e., filters) that convolve with data.This practice is adopted to topple the domain-specific feature extraction, which is usually manually crafted by experts.The CNNs have been designed as a proof of concept for the feasibility of the action recognition system.The model's number of parameters is correlated with the size/performance ratio.Therefore, three CNN architectures of different sizes are adopted to investigate the effect of distinct number of layers on the performance of the overall system.Fig. 5 shows the CNN reference architecture (called CNN2).The CNN2 comprises six main blocks: the first one, B1, consists of six 2-D convolutional layers taking as input a 3-D tensor of dimensions N × M × C, where N and M represent the dimension of the map and C is the number of channels.In this article, the range-Doppler maps [Fig.2(a)] have C = 3, while all the transformed images [Fig.2(b)-(f)] have C = 1.Each one of the five blocks (i.e., B2, B3, B4, B5, and B6) consists of a batch normalization layer followed by a 2-D average pooling and a dynamic number of 2-D convolutional layers (the number of convolutional layers decreases by one along with the blocks).The last block, B6, contains only one convolutional layer.Following this block, batch normalization and global average pooling layers are added.In particular, the global pooling flattens the last feature map.Two different instances of the CNN2 architecture have been also taken into account.CNN1 has an architecture similar to CNN2 but with a lower number of layers: it contains five blocks instead of six, each block has a convolutional layer less than

Fig. 7 .
Fig. 7. Example of range-Doppler maps for the five actions.
C. DNNs Parameters According to Fig. 5, the CNN input has dimensions N × M × C. In this article, we set N = M = 224 and C = 3 in case of an RGB image [i.e., Fig. 2(a)] and C = 1 for all the other transformations [i.e., Fig. 2(b)-(f)].As mentioned in Section II-B, since the original RD maps (i.e., RD in Fig. 1) have dimension 256 × 256 × 3, all the images have been resized to 224 × 224 × 3 before applying the transformations.In the B1 block of all the models, the number of filters F is set to 8 while, in the further blocks, F is doubled while N and M are reduced by 50%.As a result, in the last block of CNN2 in Fig. 5 (i.e., B6 for CNN2) the output has dimension 7 × 7 × 256.Consequently, the output of CNN1 has dimensions 14 × 14 × 128, while the output of CNN3 has dimensions 3 × 3 × 512.The LSTM layer that follows the TDL layer in Fig. 4 has 128 neurons for all the DNNs.The first dense layer has 64 neurons in all the DNNs.The output layer has five neurons, corresponding to the five classes.Table III presents the total number of parameters for each DNN.DNN1 refers to the architecture of Fig. 4 using CNN1.Equally, DNN2, and DNN3 use CNN2 and CNN3, respectively.Each of the three DNNs is trained using the six datasets (3).

Fig. 9 .
Fig. 9. Confusion matrixes of the three best models computed over the five folds.(a) DNN1 trained on D Canny .(b) DNN2 trained on D Gray .(c) DNN3 trained on D Gray .

Fig. 10 .
Fig. 10.ROC and AUC computed over the folds of the best performing models.(a) DNN1 trained on D Canny .(b) DNN2 trained on D Gray .(c) DNN3 trained on D Gray .

Fig. 13 .
Fig. 13.Confusion matrixes of DNN2 and DNN3 computed on the new dataset with no transformation and in grayscale.(a) DNN2 tested on T Orig .(b) DNN2 tested on T Gray .(c) DNN3 tested on T Orig .(d) DNN3 tested on T Gray .

TABLE III NUMBER
OF PARAMETERS OF THE THREE DNNS

TABLE IV AVERAGE
ACCURACY ON THE 5-FOLDS

TABLE VI SYSTEM
ASSESSMENT ON RASPBERRY PI4

TABLE VII COMPARISON
WITH ADVANCED IMAGE RECOGNITION MODELS ON FIVE-CLASS CLASSIFICATION

TABLE IX COMPARISON
ON RASPBERRY PI4 WITH IMAGE RECOGNITION MODELS

TABLE X GENERALIZATION
PERFORMANCE IN ANOTHER ENVIRONMENT

TABLE XII COMPARING
RADAR WITH WEARABLE AND CAMERA-BASED SYSTEMS