Multi-View CNN-LSTM Architecture for Radar-Based Human Activity Recognition

In this paper, we propose a Multi-View Convolutional Neural Network and Long Short-Term Memory (CNN-LSTM) network which fuses multiple “views” of the time-range-Doppler radar data-cube for human activity recognition. It adopts the structure of convolutional neural networks to extract optimal frame based features from the time-range, time-Doppler and range-Doppler projections of the radar data-cube. The CNN models are trained using an unsupervised Convolutional Auto-Encoder (CAE) topology. Afterwards, the pre-trained parameters of the encoder are fine-tuned to extract intermediate frame based representations, which are subsequently aggregated via LSTM networks for sequence classification. The temporal correlation among the views is explicitly learned by sharing the LSTM network weights across different views. Moreover, we propose range and Doppler energy dispersion and temporal difference based features as an input to the CNN-LSTM models. Furthermore, we investigate the use of target tracking features as an auxiliary side information. The proposed model is trained on datasets collected in both cluttered and uncluttered environments. For validation, an independent test dataset, with unseen participants, in a cluttered environment was collected. Fusion with auxiliary features improves the generalization by 5%, yielding an overall Macro F1-score of 74.7%.


I. INTRODUCTION
Human Activity Recognition (HAR) is an active area of research since decades and has been a key enabler of various emerging technologies, such as, smart-homes, smarthealth, smart-security and smart-offices. Activity recognition is essential to these application domains as it allows computer systems to monitor and analyze human behavior and assist us in our daily lives. A reliable HAR is still a challenging problem and is faced with many technical issues. On one hand, the privacy issue can be solved by employing methods that The associate editor coordinating the review of this manuscript and approving it for publication was Brian Ng .
involve ambient-sensors instead of camera based methods, but on the other, a reliable and a robust feature extraction in the presence of heterogeneous sensory data, is still quite challenging and requires significant research efforts.
Most of the work in radar-HAR assumes human-centric uncluttered background scenes, where participants are free to perform actions. For an automated indoor HAR system to work reliably and be able to classify actions with low error-rate, it is vital for the system to take into account heterogeneity in the sensor data, which may arise due to various external factors, such as, the users, the sensors and the environment. Firstly, each user or participant is different, hence their actions may also differ depending on their VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ habits and morphology. Secondly, these habits may change over time, making this phenomenon time-variant. Moreover, variations in the sensor placement may result in variations in the aspect angle, relative to the radar sensor, which can greatly affect the feature extraction. Finally, the environment layout might vary from room to room, hence resulting in a different background clutter, and partial occlusion of the human body from the nearby objects, thus giving rise to a shadowing effect leading to a wide range of intra-class variations in the observed features. A reliable radar-HAR system should work regardless of these external factors. For example, when we move from an ideal uncluttered scenario to a more realistic cluttered scenario, a robust radar-HAR system should be able to generalize over the heterogenous sensory data, without much loss of performance. A complex human-activity can be decomposed into motion sequences of walking, stand-to-sit or sit-to-stand actions [1], [6]. Hence, it is vital for radar-HAR systems, to be able to map an observed human motion to either of these actions. Therefore, in this work, we focus on classifying the following four action classes, 1) Walking-Towards (WT) and 2) Walking-Away (WA) from the radar-sensor, 3) Stand-to-Sit or Sitting-Down (SD) and 4) Sit-to-Stand or Standing-Up (SU). We pose the radar-HAR problem as a 4-class sequence classification problem.
In this paper we investigate a radar-sensor based approach to HAR. Our main goal is to investigate the generalization capability of the proposed learning methods and the models in the presence of different persons, different aspect angles, cluttered environment and multi-path. The major contributions of this paper include: • A Multi-View Convolutional Neural Network and Long Short-Term Memory (CNN-LSTM) network which fuses multiple ''views'' of the time-range-Doppler radar data-cube for human activity recognition. The views are obtained by projecting the 3D data-cube in the range, Doppler and slow slow-time dimensions; • A kind of soft-attention in the radar data-cube via range and Doppler energy dispersion based pre-processing; • The use of target tracking features as auxiliary side information for the CNN-LSTM architecture; • A multi-view fusion approach, achieved by sharing the weights across the views in each LSTM layer to learn the correlation between the views and auxiliary tracking features; • A model training, combining an unsupervised feature extraction step, followed by a fine-tuning step to make feature extraction class agnostic and more robust to initializations. The rest of the paper is organized as follows. In Section II we give a brief overview of the current state-of-art. Section III briefly describes the sensor-setup and the input from the radar-sensor. In Sections IV and V we describe the proposed methods and the models used in this research, which is followed by the discussion about the evaluation setup and the results of the proposed methods.
Finally, Section VI concludes the paper with the future work.

II. RELATED WORK
A large and growing body of literature has investigated radarbased HAR classification. An exhaustive revision of the previous methods lies beyond the scope of this paper. We, therefore, briefly give an overview of the related works. For an in-depth review of the research area, the interested reader is referred to [11], [16].
Different representations of the radar signal, that include time-frequency domain based time-Doppler images, and integrated slow-time based range-Doppler images, have been considered. The time-frequency based features contain information about the rate of change of motion, of human body parts over time, while range-Doppler images provide both velocity and range information.
Over the years, several studies have adopted hand-crafted feature based representations extracted from time-range, range-Doppler, cadence velocity maps [3], [25], along with machine learning models such as support vector machines and random forests. In [2], the authors estimated the phase, velocity, rate of change, mean, standard deviation and range of the I and Q signals, and applied a random forest classifier for HAR classification. The authors of [7] proposed a dynamic range-Doppler trajectory (DRDT) method to recognize various human motion. First, range-Doppler frames consisting of a series of range-Doppler images are obtained from the backscattered signals. Next, the DRDT is extracted from these frames to monitor human motion in time, range, and Doppler domains in real time. Then, a peak search method is applied to locate and separate each human motion from the DRDT map. Finally, range, Doppler, radar cross section, and dispersion features are extracted and combined in a multidomain fusion approach as inputs to a machine learning classifier. In [10], the authors trained a random forest classifier with time-Doppler and range-Doppler based regionof-interest features such as velocity centroid, dispersion, and instantaneous energy based features, as proposed in [17], together with tracking based (such as target location, velocity, acceleration, range and azimuth) and point cloud features.
Recent radar-HAR efforts have applied deep neural networks. In [27], a CNN with three convolution layers, has been applied on range-Doppler images for multiclass HAR classification. To take into consideration the temporal characteristics of radar signals, the authors of [32] applied a 1D-CNN to extract spatial features from the spectrograms, followed by an LSTM network to learn time-dependent information. The authors of [20], applied a similar CNN architecture containing three convolutional layers, two max pooling layers and two fully connected layers, for classifying kitchen activities. The CNN model, as an input, takes an image with two spectrograms from two radar sensors. A multiscale residual attention network, for joint activity recognition and person identification, has been proposed in [12]. The architecture consists of a CNN, with a residual attention mechanism, which extracts features from the time-Doppler images. The embeddings are fed to a fully connected layer performing the classification task. In contrast with the above works using time-Doppler or range-Doppler images as an input, the authors of [19] applied a LeNet-5 based CNN on the features extracted from an auto correlation function. The authors of [31] proposed an end-to-end deep learning based framework called the Fourier Convolutional neural Network (F-ConvNet). The input of an F-ConvNet consists of raw frames of radar data. Next, multi-scale features are extracted using three convolutional layers. The results are sent to a so called Fourier layer, learning the real and imaginary parts separately. Compared to the above CNN based approaches, the authors of [28] proposed a stacked Bidirectional LSTM (Bi-LSTM) network on spectrograms to perform radar-HAR. Bi-LSTMs can capture both temporal forward and backward correlated information within the radar data-cube.
Considering that the deep learning based models require a large amount of training data, some researchers propose transfer learning based methods. In [8], a ResNet based pre-trained CNN on ImageNet database, is fine-tuned on time-Doppler spectrograms. The authors of [29] proposed a generative adversarial based image-to-image translation approach to transform time-Doppler signatures into a pseudoaudio representation, and fine-tuned a pre-trained VGGish CNN to classify the obtained representations.
Unsupervised feature learning based methods were also investigated. The authors of [26] used a three-layer CAE that used an unsupervised pretraining to alleviate the demand for training data, followed by a supervised fine-tuning of the CNN to extract spatially localized features for classification. The authors of [4] extended the CAE based unsupervised feature extraction, and proposed an attention-augmented CAE, wherein the convolutional maps are concatenated with a multi-headed attention output [30]. The CAE model is first pre-trained in an unsupervised fashion, and then both the convolution and attention parts of the encoder are fine-tuned separately through supervised training. Next, the convolution and attention parts are trained jointly to learn the final configuration for classification. Different from the above image-based models, the authors of [15] employed a generalized point cloud model to simultaneously represent the time-range-Doppler signature. First, the radar echoes are transformed via range-Doppler processing along the time axis. Then, the target information is gathered by a constant false alarm rate detection algorithm. The point cloud features are aggregated by motion sculpture construction and iterative farthest point sampling. Finally, the resulting point cloud is fed to a hierarchical Point-Net module [5] to recognize human activities.
Recently, researchers considered information fusion by merging complementary radar information at different abstraction levels: signal, feature and/or decision. In [13] a combination of time-range, spectrogram and integrated range-Doppler information are processed with sparse autoencoder to extract features that are then classified by a softmax layer for each of the three inputs. A voting principle is then used as decision fusion. In [9], the radar data-cube is preprocessed with an extended CLEAN algorithm to eliminate unwanted noise/distortions. Then, a multi-dimensional Principal Component Analysis (PCA) approach is applied for feature extraction, followed by a linear discriminant analysis implemented as a shallow neural networks for classification.

III. FMCW RADAR SENSOR SETUP
In this study we use a single 3 × 4 multiple-input multipleoutput Frequency Modulated Continuous Wave (FMCW) radar sensor setup. The sensor operates, in an indoor environment, at a center-frequency (f 0 ) of 60 GHz with a bandwidth (B) of 2.3 GHz. The radar sensor setup and the waveform parameters are described in more detail in [10].
In the following, we give an overview of the used 3D radar data-cube (x raw ) and describe its related 2D feature ''views'' and the auxiliary features used in this study. The reader is referred to [10] for the details of the detection and tracking steps together with the radar signal processing and 3D radar data-cube creation.

A. 3D RADAR DATA-CUBE STRUCTURE AND CHARACTERISTICS
We follow a conventional FMCW signal processing pipeline (as described in [10]). The targets are detected and tracked in the radar's Field of View (FoV), and an estimate of the target's centroid in the radial-range and the azimuth-angle dimensions is used to create a 3D radar data-cube.
We extract 32 range bins and 2 angle bins around the target's centroid. The STFT based micro-Doppler processing yields a (T × D) 32 × 32 time-Doppler or micro-Doppler spectrum, which is observed in (R×A) 32×2 range and angle bins, thus resulting in a uniform-sized 3D radar data-cube of cardinality (T ×R×D×A), where (R×D×A) are the three spatial dimensions evolving in slow slow-time (T ). Furthermore, we keep 1 sec worth of activity by maintaining a buffer of size N = 12 previous frames of radar data-cubes. This results in an ordered sequence of 3D radar data-cubes (x (1...N ) raw ).

B. 3D FEATURE STREAMS AND AUXILIARY FEATURES FOR RADAR HAR
After extracting the 3D radar data-cube (also known as the 3D time-range-Doppler data-cube [16]), the range, Doppler and the slow slow-time features are pre-processed before being used for activity classification. The pre-processing step involves, projecting the 3D data-cube in the range, Doppler and slow slow-time dimensions, scaling the features with the range and Doppler energy dispersion based profiles, and calculating the difference features based on the temporal differences of features. We consider four categories of features: 1) raw features, 2) energy dispersion based features, 3) temporal difference based features, and 4) auxiliary features. limbs and torso as they move in the radar's FoV, and therefore is a function of range (r), Doppler (d), angle (a) and slow slow-time (t) dimensions. A 2D range-Doppler map (or view) is obtained by projecting the 3D time-range-Doppler data-cube in the range-Doppler space (x n rd,raw ), in the same way we obtain 2D time-range view (x n tr,raw ) and 2D time-Doppler view (x n td,raw ), by projecting the 3D time-range-Doppler data-cube in the time-range and time-Doppler space, respectively as illustrated in Figure 1.
• Energy Dispersion based features: are derived from the 3D time-range-Doppler data-cube x n raw . The idea is inspired by [17], however unlike [17] we use range and Doppler profile energy dispersion for creating a kind of soft-attention in the slow slow-time dimension of the 3D data-cube. We first, estimate the instantaneous range (r n cent ) and Doppler (d n cent ) profile energies, by using the 2D time-range and the 2D time-Doppler views, respectively as follows: (2) The energy profiles are then used for creating a softattention in the slow slow-time dimension of the 3D time-range-Doppler data-cube:  • Temporal Difference based features: are derived from the raw 3D time-range-Doppler data-cubes, with the idea of putting more emphasis on the most recent events (compared to the previous data-cube) occurring in the raw radar data-cube: Moreover, these recent events can be further highlighted by estimating a dispersion-difference based data-cube: From the above 3D energy based data-cubes we extract 2D time-Doppler views x n td,l , time-range views x n tr,l and range-Doppler views x n rd,l for l ∈ {disp, diff, disp-diff} ( Figure 1).
• Auxiliary features: are estimated using the target tracking results, encoded as target location (x, y), range (r), azimuth (θ) and Doppler (d n =arg max d T t=1 A a=1 x n td (t,d,a)) as follows ( Equation 6): x n aux = [x n -x n-1 y n -y n-1 r n -r n-1 d n -d n-1 θ n -θ n-1ẋnẏnẍnÿnḋ nθ n ] T (6)

IV. PROPOSED MODEL
We propose a multi-view fusion approach, where we fuse multiple 2D views and auxiliary features from a single radar sensor. For data-fusion and sequential feature learning, an LSTM layer with shared parameters is proposed. As a result, the proposed model, denoted as Multi-View RAnge Doppler Activity Recognition Network (MV RADAR-Net), is composed of space-time frame based model, E s (.) with s ∈ {rd, td, tr}, and a sequential model H (.). The space-time frame based E s (.) are pre-trained using CAE models as illustrated in Figure 2a and discussed in Section IV-A. The pre-trained encoders (E s (.)) are then used with the sequential model (H (.)), illustrated in Figure 2b and discussed in Section IV-B, where we optionally fuse the auxiliary features.

A. SPACE-TIME CAE MODEL
The CAE models for each view with an encoder and decoder can be formally defined as follows: • For each 2D view s in {rd, td, tr}, we define a spacetime feature stream from l in {raw, disp, diff, disp-diff}. For a given view-stream pair, a space-time frame based encoder model E s (.) and a decoder model D s (.) is defined (Figure 2a).
• Each space-time encoder (E s (.)) model consists of two CNN layers, where each layer has two convolution operations. Prior to applying convolutions the features are symmetrically padded. Activations, after both convolution operations, are normalized and rectified using batch-normalization and RELU-activation functions, respectively. The first CNN layer implements a 2D convolution operator with a stride length of 2 and a VOLUME 10, 2022 spatial-filter of size (3 × 3 × 8). The second CNN layer uses the same configuration (i.e. the same stride length and spatial-filter size is used), however extracts 16 features instead of 8. This results in a compressed spacetime feature sequence (q (1...N ) s,l ) of cardinality (N × 8 × 8×16). Similarly, the space-time decoder (D s (.)) models also consists of two CNN layers, where in each layer, the deconvolution operation is realized by an upsampling and a convolution operation. Prior to applying convolutions the features are upsampled and symmetrically padded, in both CNN layers. The activations are normalized and rectified, resulting in a reconstructed input sequence of cardinality (N × 32 × 32 × 2).

B. SEQUENTIAL MODEL
After training the CAE for each view, the encoders are used for training the sequential model (H (.)), and is formally defined as follows: • For each 2D view s ∈ {rd, td, tr}, we define a sequential model (h s ), consisting of a spatial-pyramid pooling layer, a fully connected layer and an LSTM network sharing it's parameters with the LSTM networks of the other views. h s starts by transforming the compressed space-time codes (q (1...N ) s,l ) to a sequential feature vector. This is accomplished by using a spatial-pyramid pooling operator, where each (8 × 8) 16-dimensional code map in N , is max-pooled by using three spatial grids of sizes (2 × 2), (4 × 4) and (6 × 6). A sequence of 16-dimensional 56-element feature vector sequence is generated. The pooled feature vector sequence, is transformed via a fully connected layer into a 64-dimensional vector sequence. The activations are normalized and rectified, before being fed to an LSTM network.
• For each view s, we define an LSTM network with 64-dimensional hidden and cell -states, which is unrolled for N = 12 frames, to learn from 1 sec worth of activity. The LSTM network performs two tasks, 1) vector sequence to sequential feature conversion, and 2) space-time feature fusion.  (12) where, z n−1 s is the hidden-state vector in, c n s is the cellstate, i n s is the input control gate, f n s is the forget control gate,c n s is the intermediate cell-state vector and represents the element-wise product. Once the N th cellstate (c N s ) vector is available, the N th hidden-state (z N s , Equation 12) vector from each view, is made available for feature concatenation to the subsequent model.
• For the auxiliary features view, we define a sequential model h ax (Figure 2b), consisting of a fully connected layer and an LSTM network where the parameters are shared across views (s). The auxiliary feature sequence (x (1...N ) ax ) is transformed via the fully-connected layer into a 64-dimensional vector sequence, which is followed by a normalization and RELU-activation operations, before being fed to the LSTM network. The hidden-state weight matrix of the auxiliary LSTM network is shared with the LSTM networks of the other views s ∈ {rd, td, tr}, thus facilitating mid-level datafusion of the auxiliary features with the radar data views. The output of the LSTM is mapped to a 64-dimensional context vector w ax via a fully connected layer.
• The sequential feature vectors (z N s ) from each view (s) are concatenated in a concatenation layer, and mapped to a 64-dimensional feature vector (z cnt ) via a fully connected layer. This feature concatenation and transformation is realized in h cnt of Figure 2b. The concatenated and transformed sequential feature vector z cnt is further adapted as follows: where, is an element-wise product [24]. Finally, z w is passed to a fully connected layer (h C in Figure 2b) and a softmax activation function for a multi-class classification task.

V. EXPERIMENTAL RESULTS
A reliable radar-HAR requires an efficient use of the range, Doppler and slow sow-time features in the radar data-cube. The generalizability and robustness of an indoor radar-HAR system for an unseen room and a cluttered environment is largely based on the training methodology, feature selection and the use of externally available auxiliary information from the tracker. Furthermore, environment clutter can greatly affect the feature extraction step, which directly affects the model's performance. In this study we focus on the generalizability and reliability of the DL based models and methods, proposed for an indoor radar-HAR system. As described in detail in [10], the goal is to be able to generalize for an unseen room and participants, with an unseen environment and cluttered background, i.e. the model's ability to perform in the context of layout-generalization and person-generalization.
We collect data in three rooms, in both cluttered and uncluttered environments. Section V-A details the data collection campaign carried out during this study, while we explain the feature selection approach in Section V-B. Furthermore, we study the ablation of auxiliary feature fusion in Section V-D, which is followed by a detailed discussion about the results in Section V-E.

A. DATASET
This study uses 4 datasets from [10] containing radar data related to three activities of daily living in three different rooms, with two different layouts (as shown in Figure 4).
We briefly describe the characteristics of the four databases, for more details refer to [10]: • Action Primitive Oriented (DB1): is layout free and focused entirely on the primitive actions. Participants were asked to perform the SD, SU, WA and WT actions with a single chair placed in the center of the room (as shown in Figure 4a). The actions were performed without varying the aspect angle of the participants with the radar sensor. Fifteen subjects were asked to participate, which resulted in over 2000 3D radar data-cube sequences (of 1 sec, N = 12) for SD and SU classes and over 5000 samples for WA and WT classes (as shown in Table 1).
• Aspect Angle Oriented (DB2): Unlike DB1, here we asked the subjects to perform the actions with four variations in the aspect angle, i.e. the participants were asked to follow paths A-D (as shown in Figure 4a). This resulted in another 5000 samples for WA and WT classes (as shown in Table 1) and approximately 2000 3D radar data-cube sequences for SD and SU classes.
• Multi-path, Aspect-angle, and Shadowing Aspect Oriented (DB3) for validation: This database was recorded in the same room as DB1 and DB2, however now included background clutter, from the furniture (as shown in Figure 4b). We asked five new participants, which were not included in the DB1 and DB2, to perform the actions. This resulted in 5000 samples for WA and WT classes (as shown in Table 1) and approximately 2000 3D radar data-cube sequences for all classes.
• Multi-path, Aspect-angle, and Shadowing Aspect Oriented (DB5): Unlike DB3, this database was recorded in a new room with a cluttered background (as shown in Figure 4c) and with twenty one new participants (not included in either DB1, DB2 or DB3). This resulted in approximately over 9000 samples for all the classes (as shown in Table 1).

B. 2D SPACE-TIME FEATURE STREAM SELECTION
Single view/stream CNN-LSTM models are trained in an End-to-End (E2E) learning fashion, to select the best feature  The room layouts used for collecting data are shown, room in a) is an uncluttered room where we collected DB1 and DB2, room in b) a cluttered room where we collected DB3, and DB5 was collected in a cluttered room shown in c), which is with a different layout than DB3 (adapted from [10]).
stream from l = {raw, disp, diff, disp-diff}, for each 2D view s = {rd, td, tr}. Figure 3 illustrates the architecture in which the CNN has the same configuration as the space-time encoder E s (.) defined in Section IV-A, and the LSTM part is equivalent to the sequential model h s of Section IV-B, and the classification head h C . DB1, DB2 and DB5 are used for training, while DB3 is used for validation. The results are ranked and the best view-stream pair is chosen for the next steps.
As can be seen from Table 2, the energy dispersion based stream resulted in the best performance, for the 2D range-Doppler (x (1. ..N ) rd,disp ) and time-Doppler (x (1. ..N ) td,disp ) views, yielding an average Macro-F1 score of 0.70 and 0.67 respectively. For the 2D time-range view, the original raw feature based stream (x (1. ..N ) tr,raw ) provided the best Macro-F1 score compared to the other space-time feature based streams ({disp, diff, disp-diff}).
In summary, the energy dispersion based stream was found to be more robust for the 2D range-Doppler and time-Doppler views, while the raw feature stream is preferred for the 2D time-range view. The energy-dispersion data-cubes create focusing in the slow slow-time dimension when the energy dispersion in the Doppler dimension is higher (Equation 3). This, by definition, creates soft-attention in the radar datacube, when participants are in motion. Whereas, less attention is created when the energy dispersion in the range dimension is higher compared to the energy dispersion in the Doppler dimension (Equation 3). Thus, filtering out events where multipath from the floor or nearby objects is strong, or when the resulting energy in the data-cube is not due to participant's limbs, thorax or head. Soft-attention in the context of a WA action, in time-range, Figure 1 (c,l), and time-Doppler, Figure 1 (c,m), leads to more signal preservation, compared to an SD action, in time-range, Figure 1 (h,l), and time-Doppler, Figure 1 (h,m). The temporal-difference based data-cube highlights the most recent events compared to the previous data-cube, thus creating a data-cube encoding high frequency features (Equation 4), increasing the noise level in the transformed data-cube, and leading to poor discriminative features.

C. MV RADAR-NET TRAINING
Having selected the space-time feature streams for each view, we proceed with the proposed unsupervised training of the space-time models, and the supervised training of the sequential model. The model training is composed of two phases.
During the first phase the space-time feature encoders (E s (.)) are trained in an unsupervised manner. For this step we utilize the encoder-decoder architecture (as shown in Figure 2a). The data from both cluttered and uncluttered domains (i.e. DB1, DB2 and DB5), is utilized for training. For validation the reconstruction on an independent test dataset from a cluttered domain (i.e. DB3) is used. We use a batch size of 16 and a learning rate of 1E −6. The models are trained for 70 epochs using an ADAM optimizer [14]. The following L1-L2 reconstruction loss is minimized as an objective: In the second phase, the sequential model (H (.)) is trained with the space-time feature extractors (E s (.)) (as shown in Figure 2b). The parameters of the feature extractors are initialized using the pre-trained encoders from the previous unsupervised training phase and are frozen during the supervised training of the sequential model. Next, the full architecture is fine-tuned with a reduced learning rate of 1E − 7, while the batch size, epochs and optimizer is kept the same as in phase-one. We train the models using DB1, DB2 and DB5, while DB3 is used for validation and minimize the following cross-entropy based focal-loss [18]: where, γ is defined as 2.0 and y i is an element in the classlabel vector (Y ) representing the i-th class.

D. ABLATION STUDY
To evaluate the benefit of the proposed approach, an ablation study was carried out. We compare the proposed two-step training (with fine-tuning) of the MV RADAR-Net using the view-stream pairs defined by {{rd, disp}, {td, disp}, {tr, raw}} with auxiliary feature fusion, denoted as U2SF-p3/wAux, to the following models: • The MV RADAR-Net of Figure 2b without auxiliary features trained in an End-to-end fashion (with parameters trained from scratch), denoted as E2E/woAux.
• The MV RADAR-Net of Figure 2b without auxiliary features with frozen E s (.) and initialized from the pre-trained CAE models of Figure 2a, denoted as U2SF-p2/woAux. Here E s (.) is used for features inference.
• The MV RADAR-Net of Figure 2b without auxiliary features trained using the two-step training with fine-tunning approach of Section V-C, denoted as U2SF-p3/woAux.
• The MV RADAR-Net of Figure 2b with auxiliary features trained in an End-to-end fashion (with parameters trained from scratch), denoted as E2E/wAux.
• The MV RADAR-Net of Figure 2b with auxiliary features with frozen E s (.) and initialized using the pre-trained CAE models of Figure 2a, denoted as U2SF-p2/wAux. The above models were trained on DB1, DB2 and DB5 and evaluated on DB3. Table 3 lists the recognition accuracy on the validation set DB3.

1) WITHOUT AUXILIARY FEATURE FUSION
In the context of an E2E/woAux, the precision, recall and F1-scores of WA and WT classes remain above 0.80, while the model struggles around 0.50 for the SD and SU classes. This results in a baseline average accuracy score, average macro-F1 and weighted average F1 score, of 0.71, 0.7 and 0.71 respectively. Furthermore, the U2SF-p2/woAux and U2SF-p3/woAux did not improve the performance compared to E2E/woAux, except for adding 1% to the overall average accuracy score, during the fine-tuning stage. As a result, the average accuracy, average macro-F1 and weighted average F1 scores, without auxiliary feature fusion were 0.72, 0.7 and 0.71 respectively (as shown in Table 3).

2) WITH AUXILIARY FEATURE FUSION
Auxiliary feature fusion, in E2E/wAux, significantly improved the performance scores for WA and WT classes. The overall recall, precision and F1-score for WA and WT classes, with auxiliary feature fusion, was more than 0.85, however the performance for the SD and SU classes still struggled around 0.50. In contrast, using the U2SF-p2/wAux approach, the performance score of the SD and SU classes improved, resulting in an average recall, precision and F1-score of more than 0.55. Overall, with auxiliary feature fusion, in the context of U2SF-p3/wAux, the baseline average accuracy, average macro-F1 and weighted average F1 scores improved by 5% as shown in Table 3.

E. DISCUSSION
The proposed 3D raw radar data-cube and it's derived representations encode local state of the targets, while with the auxiliary tracking features we are able to encode the global context of the target. This allows the model to infer with high accuracy, the state of the target, i.e. if the targets are in the state of motion or if they are static. This is evident from the ablation study conducted in Section V-D. As shown in Table 3 using the auxiliary feature fusion, we get an overall improvement of 3% in the average Macro-F1 score, in the context of E2E/wAux compared to E2E/woAux. This is further improved by another 2% when the proposed two-step learning approach is used, in the context of U2SF-p3/wAux. These results are also aligned with the projections of the auxiliary features in the UMAP manifold [23], as depicted in Figure 5.
As shown in the Table 3, most degradation in the performance is resulting from the SD and SU classes. This is precisely because the participants were asked to perform SD and SU actions on two chairs with different orientations as shown in Figure 4b. For a chair facing towards the radar, in case of an SD action, while sitting down, the participant first moves away from the radar, upon completion of the action, they would move towards the radar. For an SU action, on the same chair, the order of these events would reverse.  However, the problem becomes complicated when the SD and SU are performed on a second chair with different orientation, i.e. facing away from the radar, thus resulting in a confusion between SD and SU classes. This confusion is shown in the UMAP [23] and t-SNE [22] manifolds in Figure 6. Furthermore, apart from the variation in the orientation of the actions, most actions of the participants are occluded behind the tables, especially the SD and SU actions, hence suffering from the shadowing effect due to partial occlusion.
The generalization capability of the learning methods, with a single radar-sensor is still a challenging problem. In this paper we propose to solve this problem by learning from both uncluttered and cluttered domains, and tend to generalize for an unseen cluttered environment with unseen participants. However, based on the experimental results, on one hand we could improve the results by fusing the auxiliary features as side information. But on the other hand, it is evident from the results that for some actions, such as SD and SU, with a single radar-sensor, we are fundamentally limited in terms of discriminative features.
One potential solution for this problem could be, to combine the SD and the SU classes, i.e. solving for a multi-class classification problem with Walking Away (WA), Walking Towards (WT), and Sitting Down and standing Up (SDU) classes. The result of this solution with MV RADRA-Net, with and without auxiliary feature fusion, using an E2E learning approach, is shown in Table 4. This approach results in a average accuracy score of 0.84 and 0.92 without and with auxiliary feature fusion, respectively.

VI. CONCLUSION
We study the generalization capability of the learning methods for an unseen cluttered environment, and unseen participants. Our contribution of this research is as follows: • We propose energy-dispersion based features for a reliable radar-HAR problem. These features were shown to be robust compared to the raw features.
• Apart from the range-Doppler based features, we investigate the utility of the dynamic auxiliary features from the tracker, which allows us to encode the global context of the targets.
• We propose MV RADAR-Net, which utilizes the full radar data-cube, and performs the mid-level multi-view data-fusion using a novel LSTM layer based approach employing a shared hidden-state matrix across multiple views.
• We propose a two-step learning framework, which learns class-label agnostic features, in an unsupervised manner and allows us to reuse pre-trained feature extractors, thus making the overall learning procedure less dependable on the initialization process. The generalization capability of the learning methods, with a single radar-sensor is still a challenging problem. A potential solution could be, to utilize the absolute context of the targets, however this requires a robust online continual learning framework, to be able to generalize to an unseen environment and will be considered in a future work.