Deep Neural Network-Based In-Vehicle People Localization Using Ultra-Wideband Radar

In this paper, we present an in-vehicle people localization technique using a deep neural network (DNN) model that is trained by the experimental data. First, an impulse radio ultra-wide band (IR-UWB) radar is installed inside the vehicle, and received signals are acquired by changing the arrangement of people sitting. Then, on the acquired data, we apply the DNN to train a classifier, which can predict whether a person is sitting or not in each seat. To design a network suitable for our system, we evaluate the performance by changing the type of activation function, the number of layers, and the number of nodes in each hidden layer of the DNN. In addition, we compare the performance of the proposed method with conventional machine learning algorithms such as support vector machine (SVM) and decision tree-based methods. From our measured signals, the proposed DNN-based method can classify all possible cases according to the location and number of people with an accuracy of 99%. Moreover, the advantage of our proposed method is that there is no need to extract features from a given radar signal.


I. INTRODUCTION
In recent years, there has been a growing interest in enhancing transportation safety and providing convenience to the general public. Especially, the use of sensors inside vehicles have been actively carried out by a number of scholars to improve the safety of drivers and passengers. For instance, the authors in [1]- [4] proposed a method for detecting driver drowsiness through sensors. The methods included using cameras to detect abnormal behaviors, or using electrodes to detect physiological signals such as eye movement or heart rate. In addition, hand-based gesture recognition technique inside a vehicle was suggested by [5]- [7] to help the driver perform various tasks while driving. In these works, sensors such as radio-frequency identification (RFID), radar, and data-gloves were used to identify specific hand gestures.
According to EuroNCAP 2025, detecting the position of passengers inside a vehicle is considered essential because accidents occur by leaving a baby or pet inside a locked The associate editor coordinating the review of this manuscript and approving it for publication was Chengpeng Hao . car [8]. To solve this problem, a thermal camera-based method was proposed in [9] to detect seat occupancy inside a vehicle. The thermal image was used as an input to the convolutional neural network and the number of people was estimated with a high accuracy. However, the main drawback of this approach is that the performance of temperature-based sensors can be degraded when the temperature inside the vehicle is too high. In addition, there are privacy issues when vision sensors such as camera are used.
Unlike camera sensors, the radar shows robust performance in a lightless environment and does not invade the privacy of individuals [10]. Also, radars using high frequency bands can be miniaturized and have a high range resolution, making it suitable for indoor detection purposes. In particular, the impulse radio ultra-wide band (IR-UWB) radar has been in the spotlight due to its excellent range resolution and immunity to external noise [11]. Several studies have been conducted to detect people indoors using an IR-UWB radar [12]- [14]. However, few studies have focused on in-vehicle environment. The authors in [15] estimated the location of people sitting inside the vehicle by using an IR-UWB radar sensor, but the method required extracting features from the received data.
In this paper, we propose a deep learning-based method for estimating the location of people inside vehicle using a single IR-UWB radar sensor. First, an IR-UWB radar is mounted on the rear view mirror position inside the vehicle. Then, we conduct radar signal measurements by changing the subject, the number of subjects, the location of subjects, and the type of car. In the case of a five-seater vehicle, there are a total of 32 possible cases depending on the number and location of people. To distinguish these 32 different experimental cases, we use a deep neural network (DNN) structure to design a classifier. The application of deep learning for the classification of radar sensor data have been reported in [16]- [18]. In our study, pre-processed time-sampled radar signals are used as the input to the DNN. At this time, the pre-processed signal itself becomes the input to the classifier without extracting any features from it. Then, to find an appropriate network structure for classification, we evaluate the performance by changing the number of hidden layers, the number of nodes in each hidden layer, and the type of activation function. We use 70%, 15%, and 15% of the total data as training, validation, and test sets to verify the performance of the proposed method. Our proposed method shows remarkable performance in recognizing the arrangement of people sitting inside the vehicle with an accuracy of 99%. In addition, we also compare the performance of our proposed method with conventional machine learning algorithms such as support vector machine (SVM) and decision tree-based methods. The classification results show that our proposed method outperforms the conventional methods. Unlike the method of [15], because our proposed method is not based on feature extraction, we do not need a deep understanding of radar signals. Also, it has the advantage of being able to monitor people inside the vehicle without compromising people's privacy.
The remainder of this paper is organized as follows. In Section II, the basic principles of IR-UWB radar system are discussed along with some signal processing techniques. Then, in Section III, we introduce our experiment environment for measuring the radar signals. Next, our proposed DNN-based people localization method and its results are presented in Section IV. Additionally, we compare the classification performance with other machine learning algorithms. Finally, we conclude our work in Section V.

II. ANALYSIS OF RECEIVED SIGNALS IN IR-UWB RADAR SYSTEM A. BASIC PRINCIPLES OF IR-UWB RADAR
In an IR-UWB radar system, the radar transmits a sequence of narrow pulse signals that occupy a wide frequency spectrum. The main advantages of this system is the improved range resolution by using short pulses, and immunity to external narrowband noise by using a wide frequency spectrum [19]. The transmitted signal can be written as where x(t) is the complex envelope of the pulse signal and f c is the carrier frequency. Among the various types of pulse waveforms, the Gaussian pulse is most widely used due to its relative ease of implementation and excellent time-frequency product [20]. A modulated Gaussian pulse signal is shown in Fig. 1, for both time-domain and frequency-domain. When the transmitted signal is incident on the target, the signal is backscattered and then received at the receiver. The received signals can be expressed as where a m and τ m are the attenuation coefficient and time delay of the m th path, M is the number of paths, and n(t) is the noise added on the receiving antenna, respectively. Then, the received signals are digitalized through a sampling process, VOLUME 8, 2020  which can be expressed as where T s is the sampling period and N is the number of samples. Since the duration of a pulse is very short, sampling needs to be performed very quickly in an IR-UWB radar system. For example, if the duration of a signal is 1 ns, sampling needs to be performed every 3.9 ps to obtain 256 samples per signal, which is extremely difficult to implement in real time.
To solve this issue, a parallel sampling is performed, which means that multiple samplers are used to sample the signal in parallel. The block diagram of parallel sampling is shown in Fig. 2. In Fig. 2, each sampler samples the signal every N s T s seconds, while the sampling offset between adjacent samplers is T s seconds. By combining the results from each sampler, it has an equivalent effect of sampling per T s seconds and fast sampling can be achieved.

B. PRE-PROCESSING OF IR-UWB RADAR SIGNALS
The time-sampled signal r [n] in (3) is the raw signal that contains undesired components such as DC bias, low-frequency noise, and clutters. To suppress these undesired components and to extract the signals from desired targets, signal preprocessing has to be performed. The block diagram of the overall signal processing procedure is shown in Fig. 3. First, the DC bias of the signal is removed by subtracting its mean value, which results in Then, the signal is passed through a bandpass filter to remove the undesired frequency components. The bandpass filter is designed by calculating the half-power bandwidth of the transmitted signal from Fig. 1, and then passing only those frequency components. The bandpass filtering operation can be expressed aŝ where F and F −1 denote the Fourier transform and inverse Fourier transform operation, and H [z] denotes the bandpass filter in the frequency domain. Finally, a matched filter is applied on the resulting signal to extract the desired target signal. This is done by convolving the received signal with a conjugated, time-reversed version of the transmitted signal, which can be expressed as where * denotes the convolution operation. The effect of applying the matched filter is shown in Fig. 4. As can be seen from the figure, the amplitude of the target signal becomes much higher, making it easily distinguishable from noise. Therefore, we used the processed radar signal p[n] for invehicle people localization.

III. EXPERIMENT ENVIRONMENT
In our measurement, we use the NVA-R661 radar module manufactured by Novelda, which is shown in Fig. 5. The radar is connected to laptop via SPI to USB cable (C232H) to store the data. The radar has one transmit antenna and one receiving antenna, which are 15 cm apart. In addition, each antenna has a dielectric lens mounted on it to narrow the beam pattern and increase the antenna gain. The specifications of the radar system are summarized in Table 1. The pulse repetition frequency is set as 100 MHz, which means that the radar transmits a pulse every 10 ns. However, the actual data capture rate, denoted as the frame rate, is only 120 Hz which is much slower than the pulse repetition frequency. This is due to coherent processing, where multiple pulses are combined and averaged to increase the processing gain. In addition, the time it takes to transfer the data over SPI to USB link further limits the frame rate. Moreover, the sampling period is set as 26 ps, so the time difference between adjacent samples is 26 ps. Since the time delay τ can be expressed as 2R/c where R is the distance and c is the velocity of light, the distance difference between adjacent samples is cτ/2 ≈ 4 mm. Therefore, by using a total of 512 samples, we can observe a distance of 4 mm × 512 ≈ 2 m.  We placed the radar at rear view mirror to simultaneously monitor all seats inside the car, as shown in Fig. 6. The distance between the radar and leg room of the car is approximately 1 m, and the distance between the radar and side rear seat is about 1.6 m. In addition, the seats are labeled from 1 to 5, as shown in Fig. 6. Since each seat has two possible outcomes (sitting or not sitting), there are a total of 2 5 different scenarios. Therefore, we conducted measurements by changing the arrangement of people for 32 different scenarios. These 32 measurement scenarios are labeled from class E1 to E32. For example, class E1 indicates the case when no one is inside the vehicle, and class E32 indicates the case when 5 people are sitting inside the vehicle.
Furthermore, we conducted measurements by switching the experiment participant and the order of sitting because each subject has a different body shape and composition. In addition, we performed an additional experiment by changing the type of car to strengthen the generality of our proposed method. Therefore, measurements were conducted on two different vehicles and the total number of measurements was 434. Since one measurement contains more than 500 raw radar signals, we obtained more than 220, 000 radar signals.

IV. PROPOSED DNN-BASED PEOPLE LOCALIZATION A. INPUT AND STRUCTURE FOR DATA LEARNING
Multi-layer perceptron (MLP) is one of the simplest class of DNN, in which each layer is fully connected to its neighboring layers [21]. The general structure of the MLP network is shown in Fig. 7. It consists of an input layer, multiple hidden layers, and an output layer. In addition, each layer is comprised of multiple nodes and nodes are connected to each other through edges. The network is trained through a repeated process of forward propagation and backward propagation. In the forward propagation stage, each layer passes its value to the following layer by using weights and an activation function. Let x (k) and y (k) denote the input and  output vector at layer k, and W (k) denote the weight matrix between layer k and k +1. Then, the input vector at layer k +1 can be expressed as where f denotes the activation function that gives nonlinearity to the network. In the backward propagation stage, the weight values are updated by computing the gradient of the loss function with respect to each weight. If the weight value before backward propagation is W before , the updated value after backward propagation is where α is the learning rate that determines the speed of the learning process and J is the loss function that indicates the error between the estimated and true values. In this paper, we used the cross entropy as the loss function, which is commonly used in classification problems [22]. This process of forward and backward propagation, denoted as an epoch, is repeated multiple times to train the weight parameters properly.
In our system, we used the processed radar signal p[n] in (6) as the input to the network. Thus, each time-sampled points of p[n] becomes the input to our network, and the number of nodes in the input layer is 512. Also, the number of nodes in the output layer was set as 32 to classify the 32 different scenarios. We used the one hot encoding method, which means that class E1 corresponds to [1000 · · · 0], class VOLUME 8, 2020 E2 corresponds to [0100 · · · 0], and so on. The important parameters that determine the performance of the network are the number of hidden layers, the number of nodes in each hidden layer, and the type of activation function. Therefore, we compared the classification accuracy by changing the above-mentioned parameters to find an appropriate network structure for our system.

B. LOCALIZATION RESULTS USING DNN
In this paper, we randomly selected 15% of the total data to avoid the data being biased. This resulted in 35,981 processed signals among a total of more than 220,000 signals. Then, we used 70% of the data as training set, 15% of the data as validation set, and 15% of the data as test set. The input is a 512 × 1 vector and the output is 32 × 1 vector, as mentioned in the previous section. The number of epochs was set as 1000, and the learning rate was set as 0.01. In addition, we considered two types of activation functions: the sigmoid function and the hyperbolic tangent function. The sigmoid function can be expressed as 1/(1 + exp(−x)) and the hyperbolic tangent function can be expressed as (exp(x)− exp(−x))/(exp(x) + exp(−x)). Fig. 8 shows the classification accuracy as a function of the number of nodes in a hidden layer. The number of nodes in a hidden layer was increased from 10 to 100 in intervals of 10, while the number of hidden layers was fixed as 1. As can be seen from the figure, the classification accuracy generally increases with an increase in the number of nodes. However, when the number of nodes is higher than 50, there is no prominent increase of classification accuracy regardless of the type of activation function. Therefore, we set the number of nodes in a hidden layer as 50. Moreover, since the classification accuracy is generally higher when hyperbolic tangent function is used, we used this type of activation function for our network. Next, we examined how the classification ccuracy changes according to the number of hidden layers, while maintaining the number of nodes in a hidden layer as 50 and using hyperbolic tangent activation function. As shown  in Fig. 9, the classification accuracy was highest when the number of hidden layers was 3. Therefore, we set the number of hidden layers as 3, and this resulted in a high classification accuracy of 99.5%.
Furthermore, we investigated the performance of the network by changing both parameters, the number of hidden layers and the number of nodes, since fixing one parameter might lead to inaccurate results. The number of hidden layers was changed from 1 to 10 and the number of nodes in each hidden layer was changed from 10 to 100 in intervals of 10, resulting in a 10×10 combination of the network structure. The results are shown in Fig. 10. In deriving the classification accuracy, a Monte Carlo technique was used to average the results for multiple iterations. In other words, a pseudorandom generator was used to extract 35,981 processed signals randomly and this process was repeated 10 times, which results in 10 data sets. Then, each data set was used to train the network in parallel, and the classification accuracy was derived by averaging the results. From the figure, we confirmed that the classification accuracy showed a similar trend regardless of the number of hidden layers. Since the computational complexity increases when more nodes and layers are used, we concluded that setting the number of hidden layers as 3 and the number of nodes as 50 is appropriate for our network.
Moreover, we compared our proposed method with other machine learning algorithms such as decision tree, support vector machine (SVM), and bagging with decision tree. The decision tree is a simple classification algorithm that uses a set of hierarchical rules [23]. The model is trained by dividing the data into smaller subsets based on the features of the data. The SVM is a method that divides the input data by finding the maximum margin hyperplane [24]. It can be applied to nonlinear classification problems by using kernel trick, in which the input data is mapped into high-dimensional feature space. Bagging with decision tree combines multiple decision trees and makes a decision by aggregating the prediction results from each decision tree [25]. The variance of the data is reduced through the bagging algorithm, making it robust to over-fitting problem. These algorithms have a similar property in that they all require extracting meaningful features from the data.
The comparison results are summarized in Table 2. A single decision tree was not effective in classifying the data due to its simple structure. The SVM with a Gaussian kernel function also showed poor performance because the dimension of the data is high. When bagging with decision tree is used, it resulted in a relatively high classification accuracy of 91.3%, but our proposed DNN method showed the most superior performance of all algorithms. Our proposed DNN method also has advantages from other machine learning algorithms because it does not require extracting features from the data.
Furthermore, we investigated how the classification accuracy changes by reducing the size of the input. To reduce the input size by half, we collected only the odd index data from the processed radar signal p[n] in (5). In other words, we used the signal p [n] = [p [1], p [3], · · · , p[511]] as input to our network, which has the same effect as doubling the sampling period of the radar system. The other parameters and settings were the same as before; only the input size was changed from 512 × 1 to 256 × 1. As a consequence of reducing the input size, the classification accuracy decreased from 99.5% to 99.2%. Since sampling is performed very quickly in an UWB radar system (26 ps), doubling the sampling period did not seriously affect the performance of the proposed network model.

V. CONCLUSION
In this paper, we proposed a deep learning-based method to estimate the location and number of people inside the vehicle. First, we accumulated the received IR-UWB radar signals in 32 different measurement scenarios by changing the subjects, the number of subjects, the location of subjects, and the type of vehicle. Then, we trained the classifier using the DNN with the received radar signals. Here, preprocessed radar signals were used as an input to the classifier. Unlike feature extraction machine learning techniques, our proposed method does not require a feature extraction stage. To design a suitable DNN structure for classification, we evaluated the performance of the network by changing the number of hidden layers, the number of nodes in each layer, and the activation function. The classification results showed that our proposed method can be used as an effective invehicle localization technique. In addition, the classification performance was improved compared to the conventional machine learning techniques.