On the Edge Recurrent Neural Network Approach for Ground Moving FMCW Radar Target Classification

In this paper, an approach for ground-moving target classification with an FMCW radar is proposed. In particular, data are collected using a low-cost 24 GHz off-the-shelf FMCW radar, combined with an embedded Raspberry Pi device for data acquisition and processing. An FFT-based processing scheme is then applied to obtain a sequence of range-Doppler maps, which are provided in input to different convolutional neural network (CNN) architectures for classifying the targets (cars, motorcycles, or pedestrians) eventually passing in front of the radar. Specifically, two approaches have been followed and compared. In the first one, single range-Doppler maps are processed alone using a convolutional neural network, and then a voting mechanism is applied to select the target classes. In the second approach, a sequence of range-Doppler maps is processed using a time-distributed layer feeding a recurrent neural network. The CNNs are deployed on the Raspberry Pi providing the target classification on a low-cost embedded device. The obtained results show that the proposed approaches allow for effectively detecting the different types of targets running on an embedded device in less than one second.

On the other hand, DNNs, accompanied by the use of micro-Doppler signatures [35], [36], are also an alternative approach for the classification of ground targets [37].This approach has proven effective in several applications targeting human recognition and activity classification [38], [39], [40], [41], as well as human-robot classification [42].Nevertheless, the extraction of a micro-Doppler signature is not a simple process, as it usually requires long illumination periods of the targets.This fact makes it hard to extract such features with low-cost radar devices, especially in the presence of relatively fast targets like cars and motorcycles [43].
At the same time, it is necessary to keep in mind that the adoption of DNNs usually raises issues because of the large amount of training samples that are needed [44].This indeed creates a problem, especially when working with lowcost radars that suffer from hardware limitations that hinder their capabilities to collect a sufficient amount of data in a limited time.However, this problem can be reduced with the adaptation of recurrent neural networks (RNNs), as they involve the time-varying information of the target in the final decision [45], [46].This could compensate for the low amount of available data and hugely affect the accuracy of the final decision.In fact, RNNs are known for their ability to extract temporal features from the available sequence of time-varying data.This indeed is an interesting approach to c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
be considered in radar applications since radar data is usually a collection of consecutive time-varying data frames.This temporal variability is usually caused by the movement of the target during the illumination period.Some examples of the use of RNNs in radar applications have been presented in the literature.RNNs were adopted by authors in [31], [47], [48], [49], for human recognition and target classification in radar security applications.In addition, synthetic aperture radar (SAR) applications are an interesting field of study to use RNNs, as presented by the authors in [50], [51], [52].This paper proposes a low-cost system for ground-moving target classification based on a DNN designed from scratch.An edge device collects the signals from an FMCW radar and transforms them into range-Doppler maps.The DNN, deployed on the edge, receives as input a sequence of range-Doppler maps treated as a series of images.The DNN consists of a convolutional neural network (CNN) that automatically extracts the features from each map and a recurrent neural network (RNN) that provides the classification of the moving targets.The performance of the system is evaluated in terms of accuracy and computational cost measured on the edge device.The computational cost is represented by the inference time and energy consumption.The evaluation is assessed on a three-class classification problem, i.e., pedestrians, cars, and motorcycles.For the sake of comparison, other three classifiers are designed and implemented on the edge device: a DNN designed upon a pre-trained CNN that classifies the sequence of range-Doppler maps, and two CNNs that classify a single range-Doppler map.One single-map classifier has the same architecture as the CNN employed in the DNN designed from scratch, while the other one has the same architecture as the pre-trained CNN.To increase the generalization accuracy of the single range-Doppler map classifiers, a voting mechanism is also applied.It consists of assigning the label to a sequence of maps classified as single inputs, based on the most frequent class.The contributions of this paper are summarized as follows: • A low-cost system, based on an FMCW radar and an edge device, is adopted for three-class moving target classification.• A DNN designed from scratch and deployed on the edge classifies a sequence of range-Doppler maps achieving high accuracy while providing a real-time inference measured on the edge with restrained energy consumption.The remainder of the paper is organized as follows.Section II presents the methodology behind using a sequence of range-Doppler maps as an input for the DNN.Sections III illustrate the collected datasets.Section IV shows the adopted network architectures, and the training procedure is detailed in Section V. Finally, the obtained results are discussed in Sections VI and VII.Finally, conclusions are presented in Section VIII.

II. METHODOLOGY
This paper proposes a system for the multi-class moving target classification consisting of a low-cost FMCW radar and an edge device powered by an external battery.The system is shown in [27].
The edge device, a Raspberry Pi4 (RP), hosts two stages: the pre-processing stage for the extraction of the range-Doppler (RD) maps from the signals received from the FMCW radar and the classification stage in which the moving target label is predicted.In the following, the two stages are detailed.

A. Pre-Processing Stage
In an FMCW radar, an up-chirp, namely a sinusoidal signal having constantly increasing frequency, is emitted.More in detail, a burst of N c up-chirps is transmitted using a dedicated (TX) antenna [53], [54].The related irradiated electromagnetic wave bounces off a target present in the monitored area and the related echo is gathered by a receiving (RX) antenna.The resulting signal at the input of the receiving chain is a copy of the transmitted burst delayed by a time τ = 2R/c, where R is the target range, and c is the speed of light.The short-range FMCW radars usually adopt an I/Q demodulator in the receiver chain.The signal at the output of this demodulator called the Intermediate Frequency (IF) signal, can be represented as [55], [56], [57]: where t c ∈ [0, T c ] is the time variable (also known as fast-time) inside a single up-chirp, A b is the amplitude of the IF signal, B is the sweeping bandwidth (namely the amount of change in frequency of an up-chirp), T c is the up-chirp signal duration, v r is the target's radial velocity (v r > 0 for departing targets), n c is the chirp index among the N c transmitted chirps (also known as slow-time), f D = −2v r /λ 0 is the Doppler shift (with λ 0 free-space wavelength), and T crp is the chirp repetition period (this usually includes T c and a pause time before the firing of the following up-chirp).
For each received chirp, a number of N s samples is considered, thus, forming a 2D data matrix of dimensions N c × N s .Such a matrix undergoes a 2D FFT [58], by performing Fourier transforms along the fast-time and slow-time dimensions, to obtain an RD map; a hypothetical point-like target having given radial velocity v r and range R will appear as a peak of intensity in the RD map, at coordinates strictly related to v r and R. Beyond the indication of the range and radial velocity of a target, the importance of the RD map is that particular features of the geometry and/or movement of a target appear on this map as time-varying patterns [43], [59], [60].This encourages the adoption of the machine learning algorithms mentioned in Section I.
To adopt standard deep networks for image processing and classification, which assume that the input is a monochromatic or color image, 2D images are obtained by considering the amplitude of the RD maps.Consequently, such images can be represented as 3D digital tensors, i.e., RD ∈ N N R ×N D ×C , where N R and N D are the number of samples outputted from the range and Doppler FFTs, and C represents the number of channels used for representing the RD map as an image.

B. Classification Stage
Besides the stage for the extraction of the RD maps, the RP hosts a DNN designed from scratch to classify a moving target by a sequence of RD maps.This sequence is time-dependent and can be formalized as a 4D tensor X ∈ N N R ×N D ×C×T , where T represents the number of RD maps in the sequence.The aim of collecting T RD maps is to increase the classification accuracy concerning a single RD maps classifier as shown in [31].The proposed DNN consists of a CNN that extracts automatically the features from each RD map employing a time-distributed layer (TDL).The TDL applies the same layers or architecture to every time step of the input.In this work, TDL wraps the CNN to extract features from the sequence of RD maps, producing a sequence of feature maps.A recurrent neural network (RNN) learns the time dependency between the outputs of the TDL.Finally, a Dense layer provides the moving target label.The DNN architecture is shown in Fig. 1.
To demonstrate the effectiveness of our design, three other DNNs have been employed for classifying the moving target.The first DNN classifies the sequence of RD maps and it is built upon the pre-trained MobileNetV2 CNN architecture [61].The other two networks predict the class of the moving target receiving as input only a single RD map as a 3D tensor.The first network is a 2D CNN having the same architecture as the one wrapped by the TDL in the proposed DNN, while the second corresponds to the MobileNetV2 CNN.To increase the generalization accuracy of the two single RD map classifiers, a voting mechanism is also applied: the eventual label is assigned based on the most frequent class in a sequence of classified data.In summary, six DNNs are compared both in terms of generalization accuracy and computational cost measured on the edge device: 1) a DNN design from scratch receiving a 4D tensor as input; 2) a DNN encapsulating the MobileNet V2 CNN receiving a 4D tensor as input; 3) a CNN having the same architecture as the one employed in point 1) receiving a 3D tensor as input; 4) a MobileNet V2 CNN receiving a 3D tensor as input; 5) same network of point 3) and applying the voting mechanism; 6) same network of point 4) and applying the voting mechanism.In the following, the DNNs will be named S − DNN, MN − DNN, S − CNN, MN − CNN, S − CNN Vote , and MN − CNN Vote , respectively.

III. DATASETS COLLECTION
The Distance2Go radar module [62], developed by Infineon, is used to collect the datasets described in [27], [28].Table I shows the complete set of radar sensor parameters.The number of points for the range and Doppler FFTs was chosen as N R = N D = 256.The outcome of the pre-processing stage on a moving target, described in Section II-A, is an RD map represented as a 3D tensor RD ∈ N 256×256×3 .To be compliant with the input size of the DNNs described in the following sections, the RD maps were resized as RD ∈ N 224×224×3 .
The RD maps were collected in two cluttered environments on three kinds of moving targets (i.e., pedestrians, cars, and motorcycles) by the FMCW radar-based system.The two environments, described in previous works [27], [28], involved different environmental conditions for clutter interference, side obstacles, and electromagnetic wave scattering.The number of RD maps of each target motion depended on the speed of the motion itself in the FoV of the radar.In this work, 12 datasets were extracted from the original data of [27], [28]: six concern the single RD maps, and six are the sequences of RD maps.The datasets are detailed in the two following subsections.

A. Single RD Maps Datasets
Two datasets were derived by RD maps collected by the system mounted on a pole at a 1.5 m height, near an internal road of the University of Genoa, Italy [27], [28].Concerning the previous works, more samples for each target were considered in this proposal.A set of 300 moving targets was gathered, which was equally divided among three classes, i.e., cars, pedestrians, and motorcycles.In [27], 60 cars (of which 30 were trucks), 30 pedestrians, and 30 motorcycles were adopted while, in [28], 95 cars (of which 7 were trucks), 31 pedestrians, and 56 motorcycles were considered.
The first dataset contains 3 RD maps for each target motion while the second includes 5 RD maps.Motions with a greater number than three or five maps were subsampled.The sub-sampling consisted of taking the first and the last maps and extracting the remaining maps randomly.As a result, the related datasets, namely SN G1 T with T ∈ {3, 5}, contain 900 and 1500 data (i.e., 300 and 500 RD maps per class), respectively.The datasets can be Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II SINGLE RD MAPS DATASETS
formalized as: where N sng 1,T = {900, 1500} represents the number of data in the first scenario depending on the choice of T.
In the second scenario [28], the radar was mounted on a pole near a different road in the city of Genoa.The radar was placed at a higher position (3 m) than the previous one.This set comprises 115 targets (i.e., 60 pedestrians, 43 cars, and 12 motorcycles).Also, in this case, more samples were included.In [28] only 10 cars (of which 5 were trucks), 5 pedestrians, and 5 motorcycles were tested.Other two datasets were built, namely SN G2 T with T ∈ {3, 5}, containing 3 and 5 RD maps for each target, respectively.The datasets can be formalized as: where N sng 2,T = {345, 575} represents the number of data in the second scenario depending on the choice of T.
A third couple of datasets stems from the merge of SN G1 T with SN G2 T .The resulting datasets are named SN GM T and SN GM T .The two datasets include 160 pedestrians, 143 cars, and 112 motorcycles, and each target motion is a collection of 3 and 5 RD maps respectively.Hence, they can be formalized as: where N sng M,T = {1245, 2075} represents the number of data in the merged scenarios depending on the choice of T.
Table II summarizes the six single RD maps-based datasets.The first column reports the dataset names, the second the labels of the three targets, the third the number of motions for each class, and the last column the total number of RD maps collected per class based on the value of T.

B. Sequence RD Maps Datasets
Following a similar approach as in [31], to assess the effect of including the time-varying components of the radar signals on the moving target classification, the RD maps of each target are stacked into a 4D tensor which contains all the information related to the moving target dynamics.As a result, six datasets containing sequences of RD maps for each target were obtained by the single RD maps datasets.Hence, the RD maps of each SN G dataset were stacked generating a sequence of maps for each target.In particular, in the dataset SEQ1 T with T = {3, 5}, three and five RD maps, respectively, collected for each target in the first scenario were stacked generating a sequence of RD maps.The datasets can be formalized as: where N seq 1 = 300.Straightforwardly, the other datasets can be represented as:  III summarizes the six sequence RD maps-based datasets.The first column reports the dataset names, the second the labels of the three targets, and the third the number of motions for each class corresponding to the number of RD maps sequences.
Figure 2 shows some examples of the RD maps for the three classes car, motorcycle, and pedestrian, respectively.The first row of each class corresponds to the pictures captured in three instants by a camera mounted next to the radar, while the second row contains the RD maps corresponding to the pictures.The x-axis and y-axis of the RD maps represent the range and target radial velocity quantities, respectively.Since the maximum unambiguous radial velocity measured by the radar is limited to 5.4 km/h, as reported in Table I, an aliasing phenomenon is present in the Doppler spectrum, resulting in Doppler peaks around 0 km/h for vehicles.By comparison, the vehicle's RD maps can change depending on the position of the moving vehicle with respect to the radar, whose position is fixed.When the vehicle shows its larger side to the radar, the contact surface between the radar beam and the vehicle is higher, corresponding to more reflections on the radar.Several of these reflections are received at different range and Doppler Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.bins, resulting in multiple peaks in the RD maps.The greater the contact surface the higher the possibility of RD maps presenting multiple peaks, in fact in the figure the car presents multiple peaks in t = t 2 and t = t 3 .
A different pattern can be observed from pedestrians' RD maps: their motions include the movements of different body parts.This implies that each part will generate its own Doppler frequency, causing a Doppler spread.The proposed approach based on recurrent neural networks employs the sequence of RD maps to learn the time-varying components of consecutive maps, significantly to differentiate between cars and motorcycles that can present more similarities in a single map.The results will prove that taking into account the RD maps sequences the classification accuracy outperforms the one achieved by predicting moving targets based on a single map.

C. Statistical Analysis on the RD maps
To justify the importance of merging the datasets collected from two distinct environments, we conducted the following statistical analysis.For each scenario, we randomly selected 50 RD maps from each class.For each RD map, we computed the histogram horizontal projection and the vertical one (a common approach in computer vision, e.g., [63]), related to the range and Doppler dimensions, respectively.From these, we extracted the standard deviations on the range and the Doppler dimensions, indicated as σ r and σ D , respectively.Thereafter, within each class in each scenario, we computed the mean and the standard deviation of both σ r and σ D , indicated as E(σ r ), E(σ D ), SD(σ r ), and SD(σ D ), where E(•) and SD(•) are the mean and standard deviation operators, respectively.Table IV illustrates the computed values.This indicates significant statistical differences between the two scenarios and between the classes within a scenario.Consequently, a training set built with observations coming only from a single scenario will lead to a classifier that probably poorly generalizes to test data coming from the other scenario.

IV. DEEP NEURAL NETWORKS FOR TARGETS CLASSIFICATION
This section describes the architectures of the six networks designed for the moving target classification.All the networks were designed in Python, using the Keras library provided by Tensorflow.

A. Single RD Map Classifiers
Two CNNs were employed to classify the single RD maps, the first designed from scratch and the second based on the MobileNetV2 architecture [61].As mentioned, they are named S − CNN and MN − CNN, respectively.Table V presents the S − CNN architecture, while Table VI the MN − CNN architecture.In both tables, the first column represents the size of the input tensor for each layer, while the second shows the operation applied to that tensor.As an example, in the second row of Table V five   ŷvote = ind best 14: end if 2. Return: ŷvote majority of labels assigned to the RD maps of the sequence.Procedure 1 depicts the voting mechanism.In particular, in the case of two or more most frequent classes, the label is retrieved by summing the T probabilities of the most frequent classes from the softmax layer and taking the highest one.

B. Sequence RD Maps Classifiers
Two DNNs, i.e., S − DNN and MN − DNN, were designed to classify the sequences of RD maps as 4D tensors.A timedistributed layer (TDL) wraps a CNN to extract the features from each RD map of the input sequence, thus obtaining a sequence of feature tensors.A recurrent neural network (RNN)  learns the dependencies between the feature tensors.Figure 3 shows an example of a CNN wrapped by TDL and applied over T maps of the same sequence to feed an RNN.
The two CNNs wrapped by the TDL are the S − CNN and MN − CNN architectures previously presented, excluding the two dense layers at the bottom of the networks.The Long Short-Term Memory (LSTM) layer is chosen as RNN because it proved to be suitable in many applications based on timeseries [64].Two dense layers represent the output for both networks.The first one is a fully connected layer with the ReLU activation function, while the second presents 3 neurons, i.e., the number of types of moving targets, with the Softmax activation to assign the label.Tables VII and VIII summarize the two architectures.

V. EXPERIMENTAL SETUP
This section provides an extensive description of the training setup for the classifiers described in Section IV.

A. Single RD Map Classifiers Training Setup
The S − CNN and MN − CNN classifiers were trained on SN G1 T and SN GM T datasets, with T = {3, 5}.The MN − CNN networks were tuned by unfreezing the last four layers of the pre-trained network corresponding to 412,800 tunable parameters.When the networks were trained on SN G1 T , the datasets SN G2 T were used as test sets.The first part Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.shows the hyper-parameters of SN GM 3 and SN GM 5 .A '/' symbol separates, if any, the differences in the hyper-parameter configurations between the grouped datasets.
In addition, the four training datasets were randomly split into the training/validation/test sets.The validation set was used for an early stopping criterion, setting the patience value to five for all the configurations on the validation loss.All the networks were trained by using the Adam optimizer [65].The number of data of the splits for each class and each pair model/dataset is shown in Table X.Specifically, the first half of the table shows the number of samples for each split and class of the datasets SN G1 T and SN G2 T , when adopting SN G1 T and SN GM T as training sets.The first column represents the split set, the second column the target labels, and from the third to the last table displays the number samples for each class in each split of the datasets.The training validation sets are grouped and separated by the '/' symbol.For the merged datasets, 30 and 50 random samples for each class were extracted for the validation splits when T = 3 and T = 5, respectively.As can be noticed, each class of the merged datasets, considering data from both scenarios, contains the same number of data for the training/validation splits, avoiding biased training due to class unbalancing.

B. Sequence RD Maps Classifiers Training Setup
Similarly to the single RD maps classifiers, the S − DNN and MN −DNN architectures were trained on the four datasets containing the sequences of RD maps (i.e., SEQ1 T and SEQM T , with T = {3, 5}).As previously, the last four layers of the pre-trained networks were unfrozen.
The second half of Table IX represents the hyper-parameters of the RD sequences classifiers.The RD maps sequence classifiers were trained by using the Adam optimizer as well.The second half of Table X, similarly to the first part, shows the number of samples of the datasets SEQ1 T and SEQ2 T when adopting SEQ1 T and SEQM T as training sets, for the classes and the splits.As before, each class of the

VI. GENERALIZATION PERFORMANCE RESULTS
When introducing edge AI systems, it is crucial to assess two complementary aspects, i.e., classification performance in terms of accuracy and F1-score and the efficiency of the edge device.The F1-score serves as a metric that balances between precision and recall, making it a robust choice for addressing the class imbalance.In this study, the weighted metric was calculated by inversely weighting the class frequencies, thereby assigning greater significance to the classes with fewer instances, such as motorcycles and cars.
In this section, the results in terms of generalization accuracy and F1-score are presented.

A. Generalization Performance on S − CNN and MN − CNN Networks
The first experiment regards the evaluation of the generalization performance in terms of accuracy and F1-score of the S − CNN and MN − CNN classifiers trained with SN G1 T and SN GM T datasets, according to Sections III-A and V-A.

SN G1
T and the whole SN G2 T .The second part shows the scores of the networks trained with SN GM T and tested over the test splits of the merged datasets, SN G1 T , and SN G2 T .The third and the fourth present the results of the networks trained and tested on the datasets of the first two parts and adopting the voting mechanism.
The results highlight that the network designed from scratch, i.e., S−CNN, exhibits a higher accuracy and F1-score than the MN − CNN architecture for all the cases.This is likely related to the larger number of parameters in these networks, which can increase the risk of overfitting.Additionally, it's possible that the frozen layers may not adapt optimally to the new task, resulting in slower convergence compared to networks built from scratch.Ultimately, the architectural design of the pre-trained networks may not be the best fit for the new task.A network designed from scratch can be tailored to the specific requirements of the task, potentially leading to faster convergence.Additionally, when the networks are trained with SN G1 T , they poorly generalize on SN G2 T , even adopting the voting mechanism.This can be explained considering that the two scenarios suffer from different distributions of the clutter.Moreover, the relative positions between the targets and the radar are different among the scenarios, and this affects the patterns on the RD maps.On the other hand, including a small amount of data from SN G2 T during training, the architectures achieve good performance across all the datasets.These performance are further improved by applying the voting mechanism.The S − CNN networks always exceed the accuracies and F1-scores achieved by MN − CNN models.When the voting mechanism is not adopted, the S−CNN architecture trained with SN GM 3 exhibits the best generalization performance, even though the F1-score on SN G2 3 is lower than that achieved with SN G2 5 .This difference in F1-scores can be attributed to the varying class weights resulting from the larger sample size in SN G2 5 .While, when the mechanism is used, the best generalization performance are achieved by the S − CNN Vote network trained with SN GM 5 , meaning that with a higher number of RD maps collected for each moving target, the voting mechanism is more effective.

B. Generalization Performance on S − DNN and MN − DNN Networks
The second experiment regards the evaluation of the generalization performance in terms of accuracy and F1-score of the S − DNN and MN − DNN classifiers trained with SEQ1 T and SEQM T datasets, according to Sections III-B and V-B.
The results show a similar trend as in Table XI: the S−DNN network achieves better accuracy and F1-score than the MN − DNN model across all the tested datasets.Both architectures, when the networks are trained with SEQ1 T , poorly generalize on SEQ2 T .On the other hand, by training the networks with the SEQM T datasets, the classifiers generalize well across all the datasets.In particular, S − DNN achieves the best performance concerning the MN − DNN network and the single RD map classifiers, i.e., S − CNN trained with the SN GM T datasets.Comparing the S − DNN classifiers with S − CNN Vote trained with the merged datasets, the outcomes highlight that, when collecting 3 RD maps for each target, the voting mechanism is more effective than adopting the sequence-based classifier.Instead, when employing 5 maps, the S − DNN network outperforms all the other classifiers in terms of accuracy.However, the F1-score on SEQ2M 5 is lower than the achieved with the voting mechanism even though the accuracy remains the same.As mentioned earlier, this difference can be attributed to the variation in class weights.Specifically, the S − DNN misclassified one motorcycle (the less numerous class) as a car, three cars as motorcycles, and one car as a pedestrian.On the other hand, the S − CNN Vote misclassified two cars as motorcycles, one car as a pedestrian, and two pedestrians as motorcycles.Importantly, S − CNN Vote correctly predicted all the motorcycles, which is the class with the highest weight in the computation of the F1-score.
In general, except one case, considering the time dependency between a sufficient number of RD maps extracted from a moving target leads to a higher generalization performance for single-map classifiers, even though the voting mechanism increases their performance.respectively, due to their highest performance in the previous analysis.In each plot, accuracy is represented by dashed lines, while loss is indicated by continuous lines.The training set data is depicted in blue, and the validation set data is in red, with the number of epochs shown on the x-axis.As outlined in Section V, we implemented an early stopping criterion with the patience of five epochs based on the validation loss.

D. Comparison With Other Approaches
Recently, other techniques have been proposed to classify moving targets in similar operational scenarios.In [66], authors proposed an SVM to classify the single targets achieving 95% accuracy in the three-class classification problem in the first scenario, misclassifying a motorcycle as a pedestrian.In [28], the authors obtained an overall accuracy slightly lower than 92% in the merged scenarios adopting a K-NN, misclassifying two cars and a pedestrian as motorcycles, and one motorcycle as a pedestrian.For both approaches, the trucks that represent the fourth class have been considered cars.Table XIII compares the performance of [28], [66] with the proposal in the first scenario and the merged ones.For the proposed approach, the best model has been chosen, i.e., S−DNN trained with SEQM 5 .The columns report the number of correctly classified samples per class for the total number of data in the class.The last column reports the accuracies and the F1-scores computed over the three classes.The proposed approach, in the first scenario, achieved 100% accuracy outperforming the SVM.In the merged scenarios, the DNN presents a slight deterioration with respect to K-NN in the classification of the cars, whereas it misclassifies a motorcycle as a car, and correctly predicts all the pedestrians, outperforming the K-NN overall accuracy.Finally, it is worth remarking that the proposed approach does not require defining and extracting a set of features to be used for classification, but instead, the DNNs computed them automatically through the input convolutional layers.

E. Generalization in a New Environment
We employed the proposed network architectures for classifying RD maps collected in another environment [67].Specifically, RD maps from drones, cars, and people, have been acquired in real outdoor scenarios using an FMCW radar.From the original dataset, which contains more than 17000 data, we collected 1500 maps from each class maintaining the time correlation between the data.Subsequently, we created new datasets tailored to our network architectures.To adapt the S − CNN and S − DNN models, which were originally trained on SN GM 3/5 and SEQM 3/5 to new data, we performed finetuning.The new datasets will be denoted as SN GN 3/5 and SEQN 3/5 .For the fine-tuning process, we partitioned the new datasets into training and testing subsets, allocating 70% of the data for training and the remaining 30% for testing.On the other hand, the proposed network achieves strong generalization on new problems while still maintaining a low computational cost (i.e., 696K parameters).

VII. EDGE DEPLOYMENT RESULTS
For a real-time application, the classifier deployed on the edge must not only achieve the highest possible accuracy but rather present a trade-off between classification accuracy and computational cost.In this paper, the computational cost is measured as inference time and energy consumption.The models were deployed on a Raspberry Pi4 through the TFLite toolbox provided by Tensorflow.The energy consumption was estimated using a USB multimeter that was plugged into the power supply of the edge device while running the inference and it was averaged on the number of tested data.According to the generalization accuracy results presented in the previous section, only the networks based on the networks designed from scratch and trained with the merged datasets were taken into consideration, i.e., S − CNN and S − CNN Vote trained with SN GM 3/5 , and S − DNN trained with SEQM 3/5 .Table XVI shows the results.The first column represents the classifiers, the second whether the voting mechanism is applied or not, the third the names of the tested datasets, the fourth the accuracies achieved by the classifiers on the tested datasets, the fifth the number of parameters of the networks, the sixth the inference time, and the last the energy consumption.
As a result, if the computational cost is a hard constraint, the best choice relies on the S − CNN models trained with the SN GM 3 dataset, achieving higher accuracy than the S − CNN trained with SN GM 5 .On the contrary, if the accuracy is more relevant for the application, the S − DNN network trained with the SEQM 5 dataset is the best option.Two  solutions present a valuable trade-off between accuracy and computational cost: S − CNN Vote trained with the SN GM 3 dataset, and S−DNN trained with the SEQM 3 dataset.As last consideration, S − CNN Vote trained with the SN GM 5 presents the highest computational cost both in terms of inference time and energy consumption, and the second-best accuracy, thus not representing a suitable solution for the classification of moving targets on edge.The results are also visualized in Fig. 5, considering only the test split of the merged datasets.The figure shows the accuracy vs. energy consumption of the models deployed on edge.All the considerations made for the table holds also for the figure.
Eventually, Figure 6 presents the confusion matrices of the four models that, respectively, achieve: the lowest computational cost (i.e., S − CNN trained with the SN GM 3 dataset and represented by a red circle in Fig. 5), the highest accuracy (i.e., S − DNN trained with SEQM 5 and represented by a blue square in Fig. 5), and the two models that present the best trade-off between accuracy and computational cost (i.e., S − CNN Vote trained with the SN GM 3 dataset and represented with a red '+' in Fig. 5, S − DNN trained with the SEQM 3 dataset and represented with a red square in Fig. 5).
Based on the results of Figures 6, it is worth highlighting that car and motorcycle classes have, in general, a higher missclassification rate compared to pedestrians.This is logical as both classes present a similar motion model with comparable speeds (at least in this study) and a higher metallic reflective cross-sectional area.On the opposite, pedestrians' gait is considered different [68] because the movement of the legs and arms can produce different Doppler frequencies in the RD maps [35].This phenomenon highly affects the temporal signature resulting from the movement of pedestrians, and consequently makes it much different from the one of rigid bodies.

VIII. CONCLUSION
In this paper, a range-Doppler (RD) maps sequence-based DNN architecture for radar ground-moving targets' classification on edge has been proposed.The classifier, designed by scratch, combined a convolutional neural network (CNN) with a recurrent neural network (RNN) to classify moving targets depending on their time-varying signatures.Two datasets were used to train the DNN module representing two real and diverse cluttered environments.In particular, the radar range-Doppler maps collected on three kinds of moving targets, i.e., pedestrians, motorcycles, and cars were used as inputs to the DNN.The data have been collected by a low-cost FMCW radar plugged into a Raspberry Pi powered by an external battery.The device performed data preprocessing to transform the signal from the radar raw data into RD maps.The generalization accuracy of the proposed DNN was computed and compared with three different DNN models: RD maps sequence-based classifier enclosing a pre-trained CNN, and two single-RD map classifiers with the same structure as the CNN architectures used in the RD maps sequence-based classifiers.A voting mechanism was also proposed to enhance the performance of the single-map classifiers.The single map (with and without voting) and maps sequence classifiers were tested on 3 and 5 RD maps collected from the same target.Eventually, the architectures were deployed on the Raspberry Pi to find the model that achieves the best trade-off between accuracy and computational cost measured as inference time and energy consumption.The results showed that the DNN designed from scratch and trained on the sequences of 5 RD maps per target achieved the best accuracy (96.6%) but demanded 678ms of inference time and 3.65J of energy consumption.On the other hand, the CNN designed from scratch and trained on single maps (3 maps per target) presented the lowest computational cost (175ms of inference and 0.94J of energy).A good trade-off between accuracy and computational cost has been achieved by the CNN designed from scratch and trained on single maps (3 maps per target) applying the voting mechanism (523ms and 2.86J) and by the DNN designed from scratch and trained on the sequences of 3 RD maps per target (415ms and 2.25J).The developed approach based on DNNs classification is able to provide higher accuracies than other techniques based on classic ML algorithms, at least when dealing with single-target identification.The next step will be implementing the proposed methodology to solve the multitarget recognition enhancing the classification results achieved in [27], [28].

Fig. 2 .
Fig. 2. Example of time-varying radar range-Doppler maps of a moving car, motorcycle, and pedestrian.
2D-Convolutional layers with 3 × 3 kernels are applied to a 224 × 224 × 8 tensor that is the output of the first 2D-Convolutional layer.The input dimension of the next row represents the output of the current one.All the convolutional layers used the ReLU as an activation function.The dimension of the CNNs output, not reported in the table, is equal to the number of predicted classes (i.e., three in this paper).The last row of the tables represents the total number of parameters of the networks.The S − CNN Vote and MN − CNN Vote networks, which exploit the voting mechanism, present the same architectures of S − CNN and MN − CNN, respectively.With the voting mechanism, the label of the moving target is estimated on a sequence of RD maps, elaborated by the classifiers as single inputs (i.e., as 3D tensors): the moving target label is assigned based on the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 3 .
Fig. 3. Example of CNN wrapped by the TDL and applied to the first (a) and second (b) frames of the input sequence.Each output feeds the RNN.

Figure 4
Figure 4 displays the training and validation accuracies and losses for the S − CNN and S − DNN networks, which were trained on the SN GM T and SEQM T (with T = {3, 5}),

Fig. 4 .
Fig. 4. Train/Validation losses and accuracy of S−CNN and S−DNN trained on SN GM T and SEQM T , respectively.

Fig. 5 .
Fig. 5. Accuracy vs. Energy Consumption of the models trained on the merged dataset.

Fig. 6 .
Fig. 6.Confusion matrices of a) S − CNN architecture trained on SN GM 3 , b) S − CNN Vote architecture trained on SN GM 3 , c) S − DNN architecture trained on SEQM 3 , and d) S − DNN architecture trained on SEQM 5 .

TABLE I RADAR
SENSOR PARAMETERS

TABLE IV STATISTICAL
ANALYSIS OF A RANDOM SAMPLE COMING FROM THE TWO SCENARIOS

TABLE V S
− CNN ARCHITECTURE TABLE VI MN − CNN ARCHITECTUREProcedure 1 Procedure of the Voting Mechanism Input: ŷ ← T predicted labels of an RD maps sequence, P ← predicted probabilities (Softmax layer output) of the T

TABLE VII S
− DNN ARCHITECTURE

TABLE IX HYPER
-PARAMETERS OF S − CNN, MN − CNN, S − DNN AND MN − DNN WITH RESPECT TO THE EMPLOYED TRAINING DATASET of TableIXreports the hyper-parameters adopted during the training of the two models for the training datasets.In particular, the first column represents the hyper-parameters (i.e., batch size Bs, number of epochs Ep, and learning rate Lr), and from the second to the last, the table lists their values.For the sake of good visualization, the datasets are grouped as follows: the SN G1 3/5 column contains the hyperparameters values of SN G1 3 and SN G1 5 , and SN GM 3/5

TABLE X NUMBER
OF SAMPLES IN EACH CLASS FOR THE TRAINING, VALIDATION, AND TEST SPLITS merged datasets contains the same number of data for the training/validation splits.
Table XI shows the results.The table is divided into four parts based on the training sets and whether the voting mechanism has been applied.The first part reports the scores of the two networks trained with SN G1 T and tested on the test split of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XI GENERALIZATION
PERFORMANCES OF S − CNN AND MN − CNN

TABLE XII GENERALIZATION
PERFORMANCE OF S − DNN AND MN − DNN Table XIV provides a breakdown of the number of data points Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XIV NUMBER
OF SAMPLES IN EACH CLASS FOR THE TRAINING, VALIDATION, AND TEST SPLITS FOR THE NEW DATASETS TABLE XV GENERALIZATION PERFORMANCE OF S − CNN, S − CNN Vote , AND S − DNN ON THE NEW DATASET in each split within the new datasets.Table XV presents the results, where the highest performance is attained by the S − DNN model in conjunction with SEQN 3 , showing an accuracy and F1-score of 96.9%.It's worth noting that, in [67], the authors achieved a remarkable 99.5% accuracy and F1-score by employing a network with over 3.8 million parameters.