Few-Shot User-definable Radar-based Hand Gesture Recognition at the Edge

Technological advances and scalability are leading Human-Computer Interaction (HCI) to evolve towards intuitive forms, such as through gesture recognition. Among the various interaction strategies, radar-based recognition is emerging as a touchless, privacy-secure, and versatile solution in different environmental conditions. Classical radar-based gesture HCI solutions involve deep learning but require training on large and varied datasets to achieve robust prediction. Innovative self-learning algorithms can help tackling this problem by recognizing patterns and adapt from similar contexts. Yet, such approaches are often computationally expensive and hardly integrable into hardware-constrained solutions. In this paper, we present a gesture recognition algorithm which is easily adaptable to new users and contexts. We exploit an optimization-based meta-learning approach to enable gesture recognition in learning sequences. This method targets at learning the best possible initialization of the model parameters, simplifying training on new contexts when small amounts of data are available. The reduction in computational cost is achieved by processing the radar sensed data of gestures in the form of time maps, to minimize the input data size. This approach enables the adaptation of simple convolutional neural network (CNN) to new hand poses, thus easing the integration of the model into a hardware-constrained platform. Moreover, the use of a Variational Autoencoders (VAE) to reduce the gestures’ dimensionality leads to a model size decrease of an order of magnitude and to half of the required adaptation time. The proposed framework, deployed on the Intel® Neural Compute Stick 2 (NCS 2), leads to an average accuracy of around 84% for unseen gestures when only one example per class is utilized at training time. The accuracy increases up to 92.6% and 94.2% when three and five samples per class are used.


I. INTRODUCTION
H CI represents a primary field of study to enable the communication between humans and systems [1]. A classic and widely used HCI method exploits the conductivity of a user's finger or skin touch with a capacitive surface [2], [3]. Although a precise technology, this approach requires direct contact with the user and may not be versatile in specific contexts [4]. In recent years, the development of technologies such as optic or radio-frequency has radically increased the interfacing capability in all application areas [5]. Many advances in the field focus on vision-based inter-facing, i.e. the use of camera sensors such as Red Green Blue (RGB) and Time of Flight (ToF) [6]- [9]. In Fact, Camera sensors bring the advantage of touchless communication. Nevertheless, Camera-based solutions lead to potential privacy issues and failures with poor light conditions in the environment. In comparison, radio-based methods are not directly affected by light and can also be used to estimate user actions through walls or barriers [10]. Wi-Fi-based systems can be robustly deployed in the HCI context even when the usage environment or the user orientation changes considerably [11]- [13]. Yet, Wi-Fi technology often requires the generation of high output power and a continuously running module to ensure operation. In contrast to this, radar technology, thanks to a more adaptable system power mode management, is finding increasing interest in the field of HCI applications [14]. Among the various radar modulation techniques, the Frequency Modulated Continuous Way (FMCW) is particularly suitable in the context of action recognition by providing simultaneously accurate information of the range and the velocity of targets [15], [16].
Among the various interfacing approaches, hand gesture represents a natural and easily interpretable communication mean [17], [18]. For this particular purpose, radars find wide use and can even be miniaturized and integrated into smartphones or other portable devices, such as the Google Soli [19]. State-of-the-art technology can allow hand movement sensing with high spatial resolution but must be coupled with an action recognition algorithm to enable HCI communication. Camera-based systems can find solutions based on computer vision techniques, such as skin color, skeleton, or motion recognition [20]. For radar applications, however, given the difficulty of recognizing the shape and contours of the hands, Deep Learning solutions are often adopted [21].
Machine learning finds applications in the most varied research areas, both for direct task solving and as a powerful computational tool for speeding up and modeling processes. Multiple topologies such as VGGNet [22], ResNet [23] and Inception [24] have been developed in the recent years to solve complex tasks with very high accuracy. Such networks, however, to be trained and adapted, require a fair amount of computing power and resources, which is not suitable for deployment on most edge devices [25]. Appropriate models for edge devices require specific topologies and learning processes, often leading to a trade-off between performance and adaptability. Research in the edge domain focuses mainly on two areas, namely, the topologies optimization for deployment and post-training adaptation [26]. Effective methods for reducing the size of models and the computation parameters include the use of information compression methods such as SqueezeNet [27] and depth-wise separable filters like the MobileNets [28]. Post-training model optimization can instead be achieved without important loss of performance, by employing techniques like quantization [29], factorization [30], distillation [31] and pruning [32]. Edge efficient models development has recently led to an industry movement toward such a framework. Indeed, devices with embedded deep learning components account for a large portion of stateof-the-art HCI and Internet of Things (IoT) solutions [33]. In most of today's industrial applications of deep learning, however, models and related learning algorithms are tailormade for specific tasks [34], [35]. While application-tuned models can achieve tremendous performance in complex and multidimensional problems, they also imply visible adaptability and interpretability weaknesses [36], [37]. The target algorithms often employ a lot of data to achieve high and robust performance. In addition, data labeling can be expensive because it may require experts, or might be sparse and depending on real-time applications [38].
A relatively new branch of machine learning, called Meta-Learning [39] has emerged to find proper solutions to problems where the adaptability on few data is essential. The idea behind Meta-Learning is to use contextual information, so-called meta-knowledge, to build a more robust model, easily adaptable to new tasks with little data. A specific subclass of meta-models called optimization-based [40] allows the transfer of meta-information between tasks via gradient method or parameters averaging. The general optimizationbased approach is to learn, for a set of tasks, the best possible initialization of the parameters of a model, to make it easily adaptable in new contexts. The optimization is usually performed in the form of an episodic adaptation within two iterative steps. In base learning (inner-loop), a model learns how to solve an N-ways task, where N is the number of classes randomly sampled from the large set of training classes (if classification). In the outer loop, called meta-learning, an algorithm adapts the model following a generalization learning objective. The examples (shots) used in the inner loop are called of support, while the data used with the objective of generalization are called query. While many metalearning techniques rely on complex topologies and forms of gradient transmission to achieve high-performance [41]- [43], optimization-based techniques, given their generality, can enable the deployment of optimized models on current edge technologies.
In this paper, we propose a meta-learning optimizationbased approach that enables the fast model adaptation on new gestures also at the edge. We first design a radar-based setup and preprocessing suitable for the meta-learning context. Using the sensor BGT60TR13C FMCW [44] we gather data for a total of twenty hand gestures performed by five users in three different environments. The collected raw data follow a definite frequency preprocessing and are then elaborated on the time axis for dimensionality reduction without relevant information loss. Then, employing Model Agnostic Meta-Learning (MAML) [45] as the base algorithm, we introduce some methods to increase the generalization capabilities of the model over new tasks. Respectively we introduce dynamic metaclass weighting (DMCW), task-specific gradient clipping (TSGC), and evaluation-based Gaussian noise summation (EGNS). We then describe how, by using part of a pretrained Convolutional Variational Autoencoder (Conv-VAE) in the classifier, we can greatly reduce the size of the metamodel without a major loss in generalization performance. The block diagram of the proposed model is depicted in Fig.  1.
We then compare the achieved results with other state-ofthe-art meta-learning algorithms, showing how our solution leads to an optimal trade-off between network size, accuracy, and latency time. Finally, we perform an offline adaptation of the base model on Raspberry ® Pi4 with Intel ® Neural Compute Stick 2 (NCS 2), to enable the embedded application and the fine-tuning on eight defined test gestures. In this context, the training time required to tune the model to a new task on  Block Diagram of the proposed model. For each gesture, the sequence of raw radar frames is initially processed in frequency. It is then elaborated and concatenated in the time domain to obtain the range, velocity, and azimuth angle of arrival information of the targets. A VAE, pre-trained on 12 training gesture classes, compresses the three-channel image into a constrained multivariate latent distribution of dimension 15. The meta-algorithm training is done on a sequence of randomly sampled tasks, exploiting the support and query data in an N-ways K-shots approach. As the meta-iterations progress, the adaptability performance is assessed on tasks sampled from the 8 test classes.
Raspberry ® Pi4 and the inference time per single prediction on NCS 2 are provided. The main contributions of this paper are as follows: 1) Implementation of a proof-of-concept user-definable radar-based hand gesture recognition system at the edge. To the best of our knowledge, the first implementation at the edge in the field of radar-based userdefinable gesture recognition. 2) Use of a specific preprocessing aiming at simplifying both time domain dependency and computational complexity. 3) Conceptualization of some techniques aimed at increasing the generalization capability of the algorithm on unseen gestures. 4) Design of a dimensionality reduction method, through a Conv-VAE, suitable for the optimization-based metalearning at the edge.

II. RELATED WORKS
In this section, we first analyze both general and radar-based methods for hand gesture recognition. We then focus on the specific works that involve the use of little training data, such as Meta-Learning. A large part of the literature focuses on the use of visionbased techniques for gesture recognition [46]. Sagayam et al. [47] proposed a method for interpreting and classifying RGB Camera-based hand gestures using a 1-D hidden Markov Model (1-D HMM). Instead of complex dynamic programming methods, a heuristic method called Artificial Bee Colony (ABC) is used for the 1-D HMM optimization. The presented algorithm leads to accurate and fast models compared to other state-based methods. The statebased approach, however, can be too slow and unsuitable for adaptation in new contexts. De Smedt et al. [48] presented a method for classifying dynamic hand skeletal data using the linear Support Vector Machine (SVM). Kinematic descriptors of gestures are extracted from the input data and then statistically and temporally coded. The pre-segmented data are then fed to the SVM for recognition. The method leads to a very low computational latency in all experiments and high performance on various datasets, but it is highly dependent on the time encoding. Liao et al. [49] illustrated a system for hand gesture-based alphabet recognition using both RGB and depth information. The Hough transform applied to the depth information is used to remove the background from the color images. The feature extraction is done through a Double-Channel Convolution Neural Network (DC-CNN). The method achieves robust performance on a large dataset but, the multi-channel approach makes it unsuitable for recognition based on other classes of sensors. Tran et al. [50] proposed a method that uses an RGB-D camera and a 3D Convolution Neural Network (3DCNN) ensemble to accurately and robustly recognize both gestures and fingertip position in real-time. Recognition is achieved through the hand skeleton-joint extracted by the recordings' in-depth information. The model leads to a satisfactory accuracy of 97.12% on the test data. Despite the accuracy, the method is computationally expensive and complex to adapt to new gestures, such as those featuring finger-tip oscillations. Azad et al. [51] presented a method for classifying sequences of hand depth maps by analyzing and sampling temporal information at various levels. Gesture features in the form of spatiotemporal information are derived using Weighted Depth Motion Maps (WDMM). The extracted information is further reduced by Principal Component Analysis (PCA) and classified by a single hidden layer feed-forward neural network (SLFN) with an Extreme Learning Machine (ELM). Their proposed method achieves satisfactory results in three VOLUME 4, 2016 different datasets, outperforming the results obtained by deep learning methods. Although this algorithm is less computationally complex than most deep models, its architecture is also closely related to the nature of the data and difficult to generalize to other types of input.
Other types of sensors used for touchless gesture recognition solutions involve ultrasonic sensors and Wi-Fi technology. Das et al. [52] explored the use of ultrasonic sensors for gesture recognition as low power and low-cost alternative to optical sensors. The classification is achieved by combining a CNN and a Long Short-Term Memory (LSTM) for both spatial and temporal feature extraction. Ultrasonic sensors can represent an alternative approach to radars but, if compared to the latter, can be subject to interference phenomena and not always application-adaptable. Zheng et al. [53] presented a system for gesture recognition via Wi-Fi that enables adaptability in various domains (i.e. orientation of people, locations, and environments). The method exhibits zero-effort cross-domain adaptability employing a domain-independent body-coordinate velocity profile (BVP) estimation method. A Deep Neural Network (DNN) trained on a set of BVPs thus allows for robust recognition of as many as 15 hand gestures across domains without re-training needs. Despite the versatility of the approach, the method still requires 5,000 samples for training and is not easily adaptable to new types of gestures.
The literature on recognition using radar sensors mainly focuses on Doppler or FMCW modulated radars.
Skaria et al. [54] illustrated a method for classifying 14 types of gestures captured by a Doppler radar via deep CNN. The radar device employed is a miniaturized, low-cost dual-channel receiver model. To successfully differentiate among Doppler radar sensed gestures, the phase difference between the two antennas is exploited to infer the angle of arrival (AoA). The method shows a classification accuracy of 95% on the test and a clear differentiation between classes. However, Doppler radars, due to their limitation in spatial resolution, find limited use for gesture recognition commonly employed for HCI. Lee et al. [55] presented a method to improve the prediction accuracy in hand gesture recognition by BGT60TR13C FMCW using deep learning. The algorithm uses domain adaptation to address the problem of gesture misrecognition due to performance differences as users vary. The information extracted from the FMCW radar is frequency processed to obtain Range-Doppler Maps (RDMs). A 3D-CNN with an Inception structure processes the spatio-temporal sequence of the RDMs for classification. In parallel, an adversarial domain discriminator is used to minimize the differences between gestures performed by different users. With this method, the accuracy of 98.8% is achieved on seven gestures performed by ten users. The domain adaptation represents a powerful generalization tool in the presence of few data but requires a related source domain rich in labels to succeed. Chmurski et al. [56] showed how a neural network with depthwise separable convolutions can lead to high accuracy values for FMCW radar-based gesture recognition while operating in a low-power and resourceconstrained environment. The model, built on eight hand gestures is optimized and deployed on the Coral Edge TPU Board. This approach although efficient is hardly adaptable to new actions.
In recent years, HCI research is evolving towards the adaptability of systems in new contexts and with little data. Rahimian et al. [57] presented a class of few-shot learning architectures for gesture recognition via electromyography. The designed approach succeeds in the generalization with only a few examples per gesture by combining temporal convolution with an attention mechanism using a meta-learning approach. The contextual information acquired with experience allows the model to adapt quickly even to new gestures which have never been observed in the training phase. Lu et al. [58] illustrated a one-shot method for gesture recognition using 3D-CNN by exploiting transfer learning methodology from models trained with big datasets to strengthen the one-shot predictor. The approach is tested on several Vision benchmark datasets leading to good classification and latency results. Madapana et al. [59] explored Hard Zero-Shot Learning (HZSL) on vision-based datasets for dynamic gesture recognition. The work tries to solve the classification problem by exploiting only limited training information as semantic description. Although the performance is far from direct data classification, this paper shows that even minimal information can lead a model to learn to generalize.
Some work focused directly on the use of self-learning techniques for radar-based gestures. Fan et al. [60] showed how a Meta-learning approach can bring high generalization benefits for radar-based gesture recognition using FMCW modulation. The information obtained by radar for a set of seven gestures is preprocessed in the form of time maps to extract the information of range, velocity, and angle of arrival of the hands. The data is then fed in the form of tasks to an LGM-Net-based architecture [61]. The method leads to an accuracy of 97.3% on the 2-ways task employing 5 test samples per class. However, the multi-branch structure and the elaborate learning process make it computationally complex. Zent et al. [62] have recently presented a work that focuses on gesture recognition using a Doppler sensor. The information is processed as micro-Doppler spectrograms to map over time the change in frequency caused by the hand displacement atop the sensor. Rather than learning a direct mapping between gestures and labels, the presented method, called Weighting Network, based on Relation Networks [41], learns to compare the test spectrograms with those used for training. The presented solution has the great benefits of not requiring adaptation training for new gesture types and a relatively small number of parameters. However, the architecture needs inherently to learn the direct relationship between the support and query examples in the comparison module. This characteristic, intrinsic to Relation Net-based models, can lead as exposed in [63] to lack of adaptation in the testing phase compared to other methods. Further, in [64], it has been shown how an optimization-based method can be effectively employed for HCI via FMCW radar by exploiting simplified interfacing based on hand gesture sequences and a classical CNN for classification.

III. SYSTEM DESCRIPTION AND RADAR PREPROCESSING
In this section, we present the various components of the system (i.e., hardware details, operating parameters, and recording setup) and the proposed preprocessing of the data collected via radar.

A. GENERAL OVERVIEW OF THE PROPOSED FRAMEWORK
The proposed framework is shown in Fig. 2. First of all, the raw radar signals are preprocessed to extract both frequency and time information. The data obtained for each gesture in the shape of range, Doppler, and AoA temporal maps, are then used as Meta-Dataset for the optimization-based metalearning approach. Twelve types of gestures are used to train the classifier, while the other eight are utilized for testing. After the training process, the model is deployed through the Raspberry ® Pi4 on the NCS 2 and, adapted on new test gestures to exhibit the proof-of-concept for adaptability.  Data acquisition through FMCW radar, signal preprocessing, meta-dataset generation, and training and testing process for the proposed meta-learning-based hand gesture classifier. The orange-colored parts involve the use of hardware. In yellow is the data processing, while in green is the classifier part. The frequency analysis is enabled by Fast Fourier Transform (FFT).

B. RADAR BOARD
In this work, gesture sensing is performed by the BGT60TR13C FMCW radar sensor [44], manufactured by Infineon Technologies AG. The sensor is equipped with a Transmit (TX) and tree Receive (RX) channels with an included antenna integrated into the package. The information is processed channel-wise in several steps, through the board to which the sensor is connected Fig. 3. The operating principle of the sensor relies on linear frequency modulation of continuous waves. The TX transmits periodic signals called chirp and, the RXs receive signals reflected from the targets located in front of the sensor. During operations, the instant local oscillations are mixed with the reflected signals and result in an output signal called the Intermediate Frequency (IF). The IF signal is then passed to a baseband chain and digitalized through an analog-to-digital converter (ADC) with 12-bit resolution. The BGT60TR13C is a miniaturized solution with a center frequency f 0 of 60 GHz and a bandwidth of about 6 GHz that enables a high range solution (≈ 2 cm). The phase analysis of the IF signal, exploiting the micro-Doppler effect [65], can also enable the discrimination of displacements with millimeter accuracy. Thanks to the 3 RX channels orthogonal to each other, the radar enables the estimation of both azimuth (between 65 and -65 degrees) and elevation (between 45 and -45 degrees) AoA of targets. This system also features power mode management and an operationoptimized duty cycle to reduce power consumption to only 5 mW for applications in the 5 m range. The BGT60TR13C represents so, a low-power and miniaturized solution for short-range sensing applications.

C. RADAR PARAMETERS CONFIGURATION
The BGT60TR13 system allows to transmit for each so-called radar frame, a sequence of N c chirps with a single signal duration time t c along the slow-time dimension. Each chirp also consists of a number n s of samples along the fast-time dimension. The transmitted signals use the saw-tooth wave function modulation to enable a linear behavior during the chirp rise phase. For an FMCW radar, the range resolution ∆r and the maximum detection range R max can be derived through these formulas: where c is the speed of light and B w represents the frequency bandwidth around the central f 0 frequency. A bandwidth of 6 GHz, between 57 GHz and 63 GHz, has been chosen to enable a high range resolution of about 2.5 cm. The number of samples per chirp has been set to 32 for enabling the detection of targets up to a range of 40 cm. Further, an ADC sampling frequency F s of 2 MHz has been chosen not to limit R max because of signal conversion. The  velocity resolution ∆v and the maximum detectable velocity in a given direction V max can be computed by the following formulas: A number of 64 chirps per frame N c with single signal duration time t c of 390.4 µs, has been chosen to allow a V max of about 3.14 m/s and a ∆v of about 9.8 cm/s respectively. The parameters used for radar configuration in the hand gesture sensing application are in Table 1.

D. RADAR SIGNAL PREPROCESSING
The raw sensed radar data are not easily interpretable due to spatial resolution constraints and the influence of noise and environment surrounding the targets. While it may be possible to develop an application based on raw data as input, this would involve the training on a large amount of data that only partially contains the target information. In this work, we propose to process the signals first in frequency to extract and separate the shifts in range and velocity caused by the hands located in front of the sensor. For each detected gesture, the information is processed frame-wise and then concatenated in time to project the range and velocity contents in the 2-D plane. In such a way, Range Time Maps (RTM) and Doppler Time Maps (DTM) are generated. Exploiting the signal sensed by two RX channels, the AoA azimuth is also estimated via Capon beamformer algorithm [66]. The azimuth information is then processed and projected on the temporal plane for each frame to form Angle Time Maps (ATM).

1) Single Frame Preprocessing
For this application, the IF signal S IF (n) for each of the three available RX channels n ∈ N RX is employed to build a frame. For each n channel, the data are arranged in a 2-D matrix with slow time for the x-axis (raws) and fast time for the y-axis (columns). For each frame, by frequency analysis using Fast Fourier Transform (FFT), the Range-Doppler Image (RDI) is first calculated. The AoA azimuth is then estimated using the Capon algorithm to build the Range Angle Image (RAI). Fig. 4   to discriminate targets against unwanted background information, aka clutter (5).
where α is a parameter in the range [0 -1] set to 0.9, and S IF (n) the updated moving average for each frame. 7) A Constant False Alarm Rate (CFAR) algorithm is used for each channel n to filter the frequency peaks and increase the Signal-to-Noise Ratio (SNR).
To further increase the SNR for the RDI computation, the absolute value of the average of S IF (n) over the N RX , as shown in (6).
After using CFAR, the S IF (n) associated with the two RX channels placed in the horizontal plane is processed by Capon beamforming for the AoA computation. The absolute value is then calculated and the RAI is generated. Gesture sensing begins when an average S IF for the three RX channels is higher than a defined threshold, which is computed every time the sensor is turned on for a new recording session (i.e. new environment or new user). The threshold is determined as the average value of the last 20 collected frames (2 s) and it is used for comparison at every timestamp during operation. A gesture is considered gathered when the threshold is not exceeded for 5 consecutive frames.
The recording window has a length of 3.1 s and therefore contains up to 32 frames for every performed action. The stored frames, are then preprocessed in the form of RDI and RAI and mapped into a lower-dimensional space to compute the RTM, DTM and ATM. For each RDI or RAI, belonging to a sequence of matrices definable as A : where m × n represents the range and Doppler dimensions for the RDI and range and angle dimensions for the RAI. The information, corresponding to the distance and velocity of the target from the sensor, is extracted by taking from the RDI the

E. RECORDING SETUP
In this work, we sensed gestures via radar for a total of twenty classes. The recording setup for data collection consists of a Raspberry ® Pi4 and the BGT60TR13 board. The radar board is mounted on a tripod through a 3D-printed case. The setup in its components is shown in Fig. 6. The actions have been performed by a total of five users and in three different environments (office, hall, and outdoor). The consent has been obtained from users prior to data collection and as much anonymity and privacy as possible were maintained during the data collection and processing phases. The data have not been saved in online archives and/or published.  phase, to demonstrate the offline proof-of-concept of the system's adaptability to new gestures, the developed model is deployed on Raspberry ® Pi4 and NCS 2. This setup is shown in Fig. 7.
Copyright © Infineon Technologies AG 2021. All rights reserved.
Experimental Setup 2 2021-11-07 restricted FIGURE 7. Recording setup for the offline proof-of-concept of the system generalization capability at the edge. The Raspberry ® Pi4 is used for data preprocessing and script running. The NCS 2 enables the deployment of the developed meta-learning model for a specific setup.

F. GESTURES DATASET
For the meta-learning approach, the gestures are split by classes between a meta-training D m−train and meta-test D m−test sets. Fig. 8 shows the twelve training and eight test gestures, respectively. The division of gestures has been performed randomly, with the only constraint to keep in the two datasets the sets of gestures that are opposite to each other. A t-distributed Stochastic Neighbor Embedding (t-SNE) representation of the gestures in two components is shown in Fig. 9.   Extracting both range and azimuth information is crucial for correctly distinguishing some gestures from others. Examples where RTM and ATM clearly allow a distinction between two classes are shown in Fig. 10 and Fig. 11, respectively. Velocity information can improve the separation between classes, especially concerning the spatial plane in which they are performed. In addition, such information can help distinguish actions characterized by local finger oscillations, such as rubbing and tickling Fig. 12.

IV. PROPOSED METHOD
In this section, we propose our approach, which belongs to the class of optimization-based meta-learning algorithms. We first introduce some methods to increase the model's generalization capability in comparison to the state-of-theart. We then present the adopted CNN topology and the benefits of using a pre-trained Conv-VAE as a backbone in the meta-learning phase to reduce the number of parameters.

A. OPTIMIZATION-BASED META-LEARNING
In a conventional optimization-based meta-learning approach for deep learning, the optimization consists of two iterative steps performed over the distribution of tasks p(T ), to train a  model represented by a parametric function f θ with parameters θ. The two optimization steps are the following: 1) In base-learning, for a batch of N tasks, an inner learning model f θ ′ n with parameters θ ′ n , tries to solve each task T n , given a dataset D T n and a task related loss function to minimize L T n (f θ ).
2) In meta-learning, an outer algorithm makes use of the information obtained through back-propagation of the gradient in the inner learning phase to update the internal algorithm. The model trained during base learning also minimizes an outer loss function L ext (f θ ′ n ).  If the loss function defined for the task is differentiable, the internal optimization is often performed by Stochastic Gradient Descent (SGD) in K batches of training examples belonging to D T n . The θ ′ parameters are computed as: where γ is the inner loop learning rate of the meta-algorithm.
In [45], Finn et al. present a very general method called Model Agnostic Meta-Learning (MAML) where the metaoptimization across tasks is also performed via SGD, by minimizing the function f θ ′ n with respect to θ, for each single task or N tasks sampled from p(T ).
T n (f θ ) ) where β in (9) is the outer loop learning rate. In MAML for few-shot supervised learning, two different data sets are defined for each task T n . Support samples D n for base learning and query D ′ n for the inter-tasks generalization step in the meta-learning phase. As can be seen in (8), metagradient involves a gradient through a gradient and can lead to instability during training and as a result, can be computationally expensive. Antoniou et al. [67] present various modifications to the MAML to enhance the learning stability and also the generalization capability.
In our work, we adopt MAML as the base algorithm, with a task batch size N of 1 and, we exploit some of the methods presented in [67] to improve the training stability. Specifically, we leverage the following contributions: • Multi-Step Loss Optimization (MSL): instead of minimizing the outer loss function after the completion of all base learning steps for support set task D n , we do an update after each inner-epoch i ∈ I, composed of K batches, using D ′ n . Specifically, we exploit a set of importance weights v i that enables a higher loss contribution for the latest i in I.
In addition, as the meta-iterations performed on the distribution of tasks p(T ) progress, the relative weights of early epochs are decreased and, those of the late epochs are increased. This strengthens the ability to learn from every individual T n task without potentially destabilizing learning. In comparison to the method proposed in [67], where the update of the outer loss is performed after each step towards the support set task, we suggest an update after each inner-epoch. This leads to a trade-off between intra-task learning steps and computational complexity.

VOLUME 4, 2016
• Derivative-Order Annealing (DA): the use of the second-order gradient involves some computational expenses and can make the optimizer inefficient and unstable during the early training phase of MAML. To overcome these problems, we anneal the derivative order in the first 50 meta-iterations by exploiting the first-order gradient information only. • Cosine Annealing of Meta-Optimizer Learning Rate (CA): to fine-tune the optimization via the outer algorithm as the meta-iterations progress, we apply a cosine annealing scheduling on the optimizer. This yields an increase in generalization performance without impacting the per task computation T n . We besides propose some methods that can increase the generalization capability of MAML without bringing any increase in computational complexity in evaluation and testing. Respectively, for this purpose, we present the Dynamic Meta Class Weighting (DMCW), Task-Specific Gradient Clipping (TSGC), and the Evaluation-based Gaussian Noise Summation (EGNS).

1) Dynamic Meta Class Weighting
In a task learning approach with only a few data, a model can easily overfit the training instances leading to weak classification performance on the testing instances. Few examples per class may not be informative enough for the description and lead to significant misclassifications in testing. One way to counter this is to use in the inner loop, for each task T n , a set of class weights ∀ c ∈ C, where C represents the number of ways. Specifically, we propose to compute after each innerepoch, for each c ∈ C, a weight v c which is inversely proportional to the number of correct predictions. The idea is to sample for each task where y m represents each instance-associated label,ŷ m the predicted label after every inner-epoch and, v c the computed weights before normalization. The resulting v c weights are used both in the base learning and the meta-learning updates after each batch k in K. Respectively: and for the meta-learning update, through MSL: Each inner update improves intra-task classification performance by bringing more attention to minimizing L T n on classes whose examples have been poorly classified. In addition, the outer update allows inter-task propagation of the information obtained with the weights v c to improve generalization performance.

2) Task-Specific Gradient Clipping
Task training performed with little data for a given number of epochs I, brings benefits in some cases but can also lead to gradient explosion and instability in others. The model can so overfit on a given task, making generalization to others less effective. One solution to this is performing gradient clipping for the intra-task updates when the gradient exceeds a threshold, as presented by Pascanu et al. [68]. In our case, we suggest using clipping in the intra-task phase, for each batch k in K on the when the gradient g computed for L T n exceeds a certain threshold h: where ||g|| represents the L2 norm computed on the gradients. We propose further not to use gradient clipping for the intra-task update on queries via L ext . By doing so, the query update grants a higher contribution to the whole optimization-based procedure.

3) Evaluation-based Gaussian Noise Summation
Training on a sequence of tasks for a large number of metaiterations can make the algorithm too specific on D m−train and thus decreasing the generalization capability on D m−test leading to the so-called meta-overfitting. One way to counteract such behavior on D m−train is to increase the complexity of the task when the performance becomes very high. One way to make a task n more complex is to add Gaussian noise to the examples x n in D n or to their embedded representations as to the output of the hidden layers of the model. Specifically, we propose to sum to the output of various depths of the model, random Gaussian noise in the interval [−σ ; σ] from the distribution N (µ, σ 2 ) generated for each batch k in K. This Gaussian noise is activated for a new training task only when the validation accuracy, performed on a sequence of tasks, sampled by D m−train , exceeds a defined threshold.

B. PROPOSED TOPOLOGIES
For the optimization-based meta-learning approach, we propose the use of two topologies. First a traditional one, consisting of sets of convolutional layers for features extraction. Then, a structure that uses part of a Conv-VAE as a backbone to considerably reduce the number of parameters in the overall topology. For both neural networks, the goal is, given a task T n , to map the sequence of RTMs, DTMs, and ATMs belonging to a gesture to the respective class.

1) Convolutional Neural Network
The first topology consists of three convolutional layers with the final dense layer. The convolutional layers use 128, 256, and 512 filters respectively, with a kernel size 3 × 3 and a stride of 2. Each of these layers is followed by batch normalization, to increase the training stability for each batch k, and by the ReLU activation function. A Flatten layer and a Dense layer are attached to the last of the three convolution blocks. The Dense layer output neurons correspond to the number of classes in the experiment. The classification is enabled through the Softmax activation function, which maps the output vector into a classes probability distribution. The topology is depicted in Fig. 13.

2) Conv-VAE and Dense
The second topology exploits part of a Conv-VAE, pretrained on D m−train , to significantly squeeze the input size. The Conv-VAE compresses the three-channel information (RTM, DTM, and ATM) into a constrained multivariate latent distribution of dimension 15. The Encoder part of the Conv-VAE model is then extracted and concatenated to a sequence of Dense layers for task training. The Dense layers consist of 256, 128, and N output neurons respectively, corresponding to the number of ways for the experiment. Also for this topology, the last layer outputs are mapped in a classes probability distribution through Softmax. The layers extracted from the Conv-VAE are also trained during optimizationbased meta-learning for the N-ways classification objective. The topology is shown in Fig. 14.

V. EXPERIMENTAL SETUP
In this section, we present and analyze the performed optimization-based meta-learning experiments. Specifically, we conducted 1-shot 2-ways, 1-shot 5-ways, 3-shots 5-ways, and 5-shots 5-ways experiments. The algorithm and methods presented are mainly analyzed in the 5-ways setup, to depict their advantages. The algorithm has been developed in the Python programming language through the TensorFlow ® module. The performance tests for the state-of-the-art comparison, have been performed on a 5-core CPU. At the edge side, the Raspberry ® Pi4 and NCS 2 have been employed. Consequently, the RaspbianOS operating system has been utilized. To run the model on NCS 2 and optimize the inference process, we used the OpenVino module on Python.
Copyright © Infineon Technologies AG 2021. All rights reserved. For the 5 ways experiments, the task training is performed through 4 inner epochs and an inner-batch of size 2 for 1shot, and size 3 otherwise. For both base-learning and metalearning phases, the Adam optimizer is used with β 1 and β 2 equal to 0 and 0.5 respectively. The inner learning rate is set to 8e-4, whereas the meta-learning rate has an initial value of 7e-4 with a decay step of 2,000. The chosen number of meta-iterations is 2,200, while the classes for each task are randomly sampled by D m−train . The loss function chosen for the classification is categorical crossentropy. In the evaluation phase, accuracy statistics are saved and processed every 220 iterations in the shape of box plots. For experiments with the EGNS, a task buffer of length 5 has been chosen, with a D m−train validation accuracy threshold of 89%, 95% and 98% for 1-shot, 3-shots, and 5-shots, respectively. For the TSGC experiments, the gradient is clipped when the L2 norm exceeds 0.5. For the DMCW, a total of 10 samples per class is used for the computation of the weights. The generated models are finally tested on 1000 tasks sampled by D m−test . For DMCW and EGNS, the final task training is performed as a traditional single-task optimization approach. For TSGC, gradient clipping is also executed on the training batches.

B. PERFORMANCE EVALUATION
We first present the results obtained on a single experiment, showing the benefits achievable on unseen classes thanks to an optimization-based meta-learning approach. Then, we conduct an ablation study, by analyzing the contributions of the individual proposed methods, for both proposed topologies. Next, we compare the results obtained on a 5-core CPU with those of some existing techniques in terms of neural network size, prediction accuracy, and latency. Regarding adaptation at the Edge, we display the results of adaptation time to new tasks and model deployment on NCS 2.

1) Experiment Analysis
The metric used to evaluate the training performance of each model is validation accuracy. This parameter is estimated after each meta-iteration, by evaluating the model on new sampled tasks. For each validation two tasks are sampled by the D m−train and D m−test respectively. A box-plot of task statistics is built every 220 meta-iterations. Generalization ability can be assessed by observing the variation in the box plots as meta-iterations progress. In a successful experiment, we observe the increase of the median accuracy on the sequence of box plots, as well as the reduction of the intervals of percentiles and whiskers. The trend of box plots for the experiment with EGNS for the CNN topology is shown in Fig. 16 The contribution of EGNS is combined with the basic MAML + MSL + DA + CA algorithm, which we term + MAML. Another possible way of assessing the generalization capability is to observe the distribution of validation accuracy as meta-iterations increase. Usually, for the first training tasks, the accuracy tends to assume a multimodal shape due to different complexity in tasks resolution. In the training time, the model learns to resolve better new tasks thanks to the improved parameters' initialization. This leads the accuracy distribution to have a negatively skewed tendency toward 100% correct classification. The accuracy density histograms, generated for the first and last 220 metaiteration box-plots, are shown in Fig. 17 for the CNN -EGNS experiment. The quartiles and range percentages are noted in the middle plot on a Gaussian distribution that could be associated with the box plot. Roughly by definition, 50% of the values are contained between the first and third quartiles of the box plots. The actual accuracy distribution, however, as can be seen, does not assume a Gaussian shape. The generalization outcome can even be observed on the individual classes by generating a cumulative confusion matrix for sets of meta-iterations. In Fig. 18 are depicted the confusion matrices of the first and last 550 meta-iterations for the EGNS with CNN topology experiment. As can be noticed from the matrices, as the iterations progress, the model learns to solve quicker new tasks thanks to the updated initialization. This also applies to the unseen classes belonging to D m−test .
Some actions are more complex to distinguish between each other because of similarities in patterns, thus leading to specific prediction errors. It can be noted, for example, the misclassification between right swipe and diagonal nwse in the confusion matrix on D m−train and specularly that between left-swipe and diagonal nw-sw for D m−test .

2) Results Analysis
All the experiments have been performed for both the proposed topologies, analyzing the combination of the presented methods against the base algorithm + MAML. Each experiment, tested on 1,000 final test tasks, has been repeated three times. The average accuracy results for the 5-way experiments are presented in Table 2 and Table 3, respectively.  For the CNN topology experiments, the total number of trainable parameters in the model is 1,562,629. This large number of parameters, as can be noticed through accuracy results in Table 2, allows the model to generalize well, guaranteeing with a fast-adaptation, excellent results on unseen  For simulations with the Conv-VAE+Dense topology, the total number of trainable parameters in the model drops to only 118,851. Due to input information mapping to small size, individual tasks may be more affected by overfitting phenomena. In this case, the DMCW introduces benefits compared to the basic version of the algorithm. This contribution is also beneficial with 3 and 5 shots, probably supporting the classification of the compressed information squeezed by the backbone. For this topology, the best results are achieved by combining the DMCW and TSGC methods. The classification of low-dimensional representations is aided by class weighting for individual tasks. The TSGC, on the other hand, avoids exploding gradient and gives more importance to the outer loop update at the end of each inner epoch. The combination of the three methods brings equal or less satisfactory results than combining two of them for 1 and 3 shot experiments. With the use of 5 shots, the single techniques contributions do not lead to better results than with + MAML. This is probably due to the higher amount of data available, which leads to smoother task training. The accuracy results for both topologies in the 1-shot 2-ways approach are presented in Table 4.      For 1-shot 2-ways experiments, the greatest benefits are achieved through TSGC for both the topologies. For a 2-ways application, the DMCW contributions are counterproductive or not significant. Class weighting with only two categories can easily skew the learning towards one of them, especially with small input sizes as for the second topology. For similar reasons, the model can learn to over-depend on noise augmented inputs via EGNS and rank worse on the test data. For CNN, the use of combined EGNS and TSGC brings some benefits, mainly preventing overfitting in the base-learning phase, given the higher simplicity of the tasks. The accuracy reached with the three techniques combined depicts how preventing over-dependence on the individual tasks can favor the generalization aim.
The average adaptation times to new tasks on a 5-core CPU for the two topologies are listed in Table 5.
As can be seen from the table, the model size of the Conv-VAE topology, which is an order of magnitude smaller than the CNN, allows a reduction of the adaptation time by half for the 1-shot experiments. The time required to adapt to a new task is further reduced for Conv-VAE when more than 1 example per class is employed. Regardless of the method used, the inference time on CPU to predict the class of a single example is on average 64 ms for both topologies in the 5-ways approach.

3) Comparison with Existing Techniques
The best-achieved results, obtained through the various experiments and topologies, are compared with both metalearning state-of-the-art and classical optimization-based algorithms, trained on D m−train and tested on D m−test . Re-spectively, the Reptile [69] and MAML algorithms for the optimization-based class and Weighting Net and LGM-Net, employed in the papers [61] and [62] are trained over the gesture data. For the comparison, similar evaluation conditions are used. The Reptile and MAML (2nd Order) algorithms are utilized to train the proposed CNN topology for 2,200 meta-iterations. The topology presented in [61], adapted to 3-channel gesture information, has been employed for the LGM-Net. The Weighting Net, with a feature dimension of 64, has been adapted to the shape of the gestures and, the relative embedding module has been trained to extract features only from the support instances. The accuracy comparison results, averaged over three repetitions, are presented in Table  6.
As can be noticed from Table 6, the proposed method with CNN topology performs the best in the 1-shot 5-ways experiment, leading to better results than the Weighting Net by around 3 %. In all the other experiments, the proposed method performs slightly less accurately only compared to the Weighting Net. With more than one shot, the Weighting Net has the advantage of being able to mediate the predictions obtained thanks to a sequence of comparisons of the query image with those of support. However, with the availability of only one example per class, it lacks this great feature and loses robustness. The proposed methods though, lead in all the experiments to better results than all the other optimization-based methods. For simple experiments (2-ways) or a higher number of shots, the difference in accuracy obtained between the methods gets narrower. In such conditions, even the simplest algorithms can achieve high feature extraction from samples. So, the resolution of the tasks becomes less dependent on the initialization making the employed generalization techniques less effective.
The comparison in terms of model size is presented in Table 7.
In terms of the number of parameters, the Conv-VAE+Dense approach enables the generation of an order of magnitude smaller model compared to the CNN. Even if in terms of accuracy the second topology performs a few percentage points worse than the Weighting Net, it requires about half as many parameters for tasks resolution. Furthermore, among the compared methods, the Conv-VAE topology results in the one with the least number of required variables. Table 8 presents the time required for adaptation to a new task (Ta) and the single-sample inference (Ti) for the considered algorithms. Reptile and MAML are tested using the same methodology as the proposed optimization-based models. As they are utilized on the CNN topology, they lead to results very similar to those of the proposed methods and are, therefore, excluded from this table. For LGM-Net, the two values of Ta and Ti are summed, given the high degree of interdependence between the modules (Embedding, MetaNet, and TargetNet) in its structure. For the Weighting Net, Ta is estimated as the required time to map the support examples to a reduced size via the EmbeddingNet. In this case, Ti is computed as the needed time to process and classify a query example through the entire model pipeline after the support adaptation. In terms of adaptation time (Ta), the proposed models take longer than the Weighting Net. On the other hand, the optimization-based models enable the instance classification in a significantly short time (Ti) and in a way that is independent of the number of training shots. In the 5-shot experiment, the proposed topologies require only a quarter of the time needed by the Weighting Net for prediction. This brings a huge advantage in real-time applications or implementations at the edge.

4) Edge Implementation
The topologies presented in this paper use only NCS 2 compatible layers and procedures. All models, pre-trained with the optimization-based approach on 5-cores CPUs, are adapted to single tasks generated by D m−test via Raspberry ® Pi4. The models are then deployed on the NCS 2 and, prediction inference for each test sample is conducted at the device level. For the various experiments, the achieved results in terms of task adaptation time on Raspberry ® Pi4 and deployment on NCS 2 are presented in Table 9.
As can be noticed from the table, as the number of samples per class increases, the time to adapt to a new task increases significantly for the CNN topology, requiring up to more than 21 s for an adaptation. On the contrary, the Conv-VAE+Dense, given the much smaller number of parameters, requires less than 7 s for a 5-shots adaptation. Conv-VAE+Dense can therefore lead to a saving of up to about twothirds of the time. The needed time for a single prediction through NCS 2 after model adaptation is topology dependent. For both 2-ways and 5-ways experiments, the model requires an average of 5 ms and 4 ms for CNN and Conv-VAE+Dense, respectively. Because the adaptation is performed offline, the single inference time (Ti) does not consider the time required for gesture sampling (Ts) and preprocessing time (Tp). These times are dependent on the type of gesture performed, its intrinsic duration, and the number of recorded frames before applying zero padding. Table 10 presents the computed Ts and Tp times for one example per class of each of the 20 gestures. The Tp values are obtained over an average of 10 preprocessing repetitions of the same example performed on Raspberry ® Pi4. Thus, the total (end-to-end) time consists of the sum of Ts+Tp+Ti.

VI. CONCLUSION
In this paper we present a complete pipeline based on hand gestures performed on an FMCW radar, to exhibit a proofof-concept of user-adaptability for novel unseen hand poses. The system solution, based on data collected for twenty different types of gestures, from five users in three different environments, allows not only the extraction of useful features of performed actions but also a fast adaptation to new gestures. The pipeline is composed of a first preprocessing phase, then a meta-learning approach to generate the best possible model initialization, and an edge-suitable adaptation to new tasks and classes never faced in the training phase. The specific preprocessing employed, thanks to the combination of techniques both in the frequency and time domain, allows processing the main information of the gestures, thus significantly reducing the size of the raw data collected by radar. The information constructed for each gesture, in the form of 3 channels, represents the hand distance from the  radar, the action velocity, and the azimuth angle of arrival. A meta-learning optimization-based approach, trained on twelve of the processed gestures, thus shows how it is possible to better solve new never faced tasks, thanks to the context information extracted in the training phase. Three techniques are presented to increase the generalization ability of the model in comparison to the state-of-the-art: dynamic meta-class weighting, task-specific gradient clipping, and evaluation-based Gaussian noise summation respectively. The introduced methods have the great advantage of improving the model's parameters initialization in the training phase without directly affecting the final adaptation setup on the eight test classes. This enables both a more versatile implementation at the edge and a very fast prediction on new samples, reducing remarkably the computation latency. Further, compared to other state-of-the-art techniques, the optimization-based approach doesn't involve the comparison of the query samples with the support ones in the test phase, thus, bringing to an additional time latency reduction. Two different topologies for task resolution are presented. A first topology based on a series of convolution layers consents feature extraction for each sample thanks to a large number of defined parameters. A second topology instead, employs the encoding part of a Conv-VAE as a backbone to efficiently extract features, thus greatly reducing the number of model parameters. For such a topology, a greater effect of the presented optimization techniques is visible, thanks to the various contributions that counteract the effects of overfitting, exploding gradient, and meta-overfitting. Thanks to these features, this topology enables the generation of models that perform very well in terms of accuracy but with half the variables required in comparison to state-of-the-art. Moreover, the results obtained at the edge optimistically show how these algorithms can also be used for real real-time applications, aiding the adaptation to new users, gestures, and situations.
To the best of our knowledge, this is the first user-adaptable model implemented at the edge for radar-based HCI.
On the other hand, the generated models lead to an accuracy that is lower than the state-of-the-art in several experiments. Other meta-learning algorithms, based on the classification of relations among examples, have the inherent advantage of bringing more robust predictions. Future work will explore the application at the edge of relational algorithms and potential methods of reducing the model size without harming generalization capabilities. Experiments with a broader set of gestures and examples will also be conducted, examining the generalization ability of the models across various splits of users and environments.