Gesture-ProxylessNAS: A Lightweight Network for Mid-Air Gesture Recognition Based on UWB Radar

Hand gesture recognition with radar sensors is essential because they can detect gestures despite environmental factors like lighting, dust, and complex backgrounds. Considering the complexity of a system, it is challenging to design CNNs on CPU devices and realize the carry-on mid-air gesture recognition. We propose a mid-air gesture recognition method based on a novel discriminant feature, and it be used as part of a measurement system of hand movements using an ultrawideband (UWB) radar. The Gesture-ProxylessNAS (GPNAS) is presented to enhance the adaptability of model search and overcome the challenge of the network's computational complexity. In order to fully extract local spatial discriminant features and prevent information loss, local binary pattern (LBP) encoders are utilized to extract local spatial information. In the meantime, multilayer ShuffleNet with depthwise separable convolution is used to gradually leverage high-level spatial features. The GPNAS module revisits the multilayer ShuffleNet's design spaces using an optimization problem, greatly reducing the network's parameters and computational complexity. According to experimental verification on real UWB hand gestures, the proposed framework provides more satisfactory recognition performance and efficiency with a deeper network structure and fewer parameters. The proposed hand gesture recognition system can recognize gestures with a promising accuracy of 96.52% on the UWB-gestures public dataset.

Gesture-ProxylessNAS: A Lightweight Network for Mid-Air Gesture Recognition Based on UWB Radar Lihong Qiao , Zhixin Li, Bin Xiao , Yucheng Shu , Weisheng Li , Member, IEEE, and Xinbo Gao , Senior Member, IEEE Abstract-Hand gesture recognition with radar sensors is essential because they can detect gestures despite environmental factors like lighting, dust, and complex backgrounds.Considering the complexity of a system, it is challenging to design CNNs on CPU devices and realize the carry-on mid-air gesture recognition.We propose a mid-air gesture recognition method based on a novel discriminant feature, and it be used as part of a measurement system of hand movements using an ultrawideband (UWB) radar.The Gesture-ProxylessNAS (GPNAS) is presented to enhance the adaptability of model search and overcome the challenge of the network's computational complexity.In order to fully extract local spatial discriminant features and prevent information loss, local binary pattern (LBP) encoders are utilized to extract local spatial information.In the meantime, multilayer ShuffleNet with depthwise separable convolution is used to gradually leverage highlevel spatial features.The GPNAS module revisits the multilayer ShuffleNet's design spaces using an optimization problem, greatly reducing the network's parameters and computational complexity.According to experimental verification on real UWB hand gestures, the proposed framework provides more satisfactory recognition performance and efficiency with a deeper network structure and fewer parameters.The proposed hand gesture recognition system can recognize gestures with a promising accuracy of 96.52% on the UWB-gestures public dataset.

I. INTRODUCTION
M OBILE devices, including tablets and phones, have be- come essential communication and entertainment tools in daily lives as a result of advancements in electronic technology and artificial intelligence.Generally, users can only interact with these devices through touch or keyboards.One of the present methods of interaction for mobile devices is gesture interaction, which takes into account the convenience of boosting humanmachine contact.The existence of comparable hand gestures for several frequently used actions is thought to be useful.Additionally, the operation times of some complex instructions can be greatly shortened.Therefore, hand gesture recognition is becoming a promising technique for human-machine interaction, with numerous applications in a wide range of industries.
Many contemporary studies on gesture recognition include a variety of sensors, including cameras [2], [3], audio, WiFi, radio frequency identification (RFID), Bluetooth, Doppler radar [4], and ultrawideband (UWB) radar [5], [6], [7].Variations in different brightness, contrast, and exposure levels have an impact on the accuracy of gesture detection in different ways for vision-based systems.On the other hand, because it may reduce the negative impacts of background disturbances when UWB radar is utilized for mid-air gesture sensing, the mid-air gesture interaction method based on UWB radar is the most promising area for study.UWB technology sends and receives impulse signals to locate moving objects.It benefits from having great antimultipath capabilities, excellent precision, and powerful penetration.It eliminates the privacy issues that have plagued conventional antenna technology for a very long time.
The main procedure is selecting features with high discrimination since the gesture is shorter and the spectrum is more noise-sensitive.As model parameters and FLOPs increase, it becomes more challenging to achieve quick inference speed on mobile devices.To extract and recognize hand motions, conventional machine learning [8], [9], [10], [11], [12] requires predetermined features.The significant features for recognition are typically unidentified [13] since the majority of features are arbitrarily defined [14].Contrarily, deep learning algorithms, particularly convolutional neural networks (CNN), do not require predetermined features but discover key features via training that have a significant impact on gesture recognition precision [15].A potential method for hand gesture recognition involves creating and training a convolutional neural network as it enhances gesture categorization effectiveness and does not require a predetermined set of features.
In particular, we concentrate on the following essential questions: How to increase network visibility without increasing latency to ensure more effective feature presentation?What features improve the precision of CPU-based lightweight models?How to design a network architecture to achieve both high recognition performance and lightweight performance?
Based on these problems, we focus on finely designing a lightweight network architecture on the framework.We introduce the multi-layer ShuffleNet with depthwise separable convolution module to improve the recognition performance, and the Gesture-ProxylessNAS training approach to compress the model and improve its real-time performance.Specifically, we utilize the efficient network architecture design of Shuf-fleNet [16], [17], which achieves an efficient and lightweight network architecture through group convolution and channel shuffling.We introduce the design of the network architecture and form the basic network architecture of gesture recognition.
However, in the later design of the framework, as the number of convolution layers increases, the accuracy of gesture recognition increases, but this leads to larger model sizes.How to compress the model while minimizing accuracy has become a pressing challenge.Finally, we specially designed a neural architecture search method for the gesture recognition algorithm, Gesture-ProxylessNAS, which is improved by Proxyless-NAS [21].
What is more, after visualizing the UWB data, we find that each piece of data (image format) has obvious textural features, and through the ablation experiment, we find that local binary pattern (LBP) encoder is very helpful for improving the performance of gesture recognition network.
We propose a simple gesture recognition neural network called GPNAS, based on Gesture-ProxylessNAS.The UWB radar's extracted input signals for the GPNAS under consideration are gesture signals.For the purpose of obtaining local spatial discriminant features, we employ LBP encoders.The Gesture-ProxylessNAS Module reevaluates the design spaces of multilayer ShuffleNet and combines them with a proxyless neural architecture search unit, which substitutes the traditional convolution and significantly lowers the network's parameters and computational complexity.It encapsulates an optimization network by figuring out the best path to take.In order to extract the spectral-spatial feature, the proposed GPNAS framework uses lightweight multiscale attention structures.This results in fewer parameters, lower computing costs, and a deeper network structure.The experimental results further illustrate that GPNAS is capable of producing a recognition result that is more pleasing while using fewer parameters and less processing time.This article's overall summary and scope are presented in Fig. 1.The contributions of this article are summarized as follows: 1) We present a UWB radar-based system for recognizing mid-air hand gesture.Based on the LBP (local binary pattern) encoders, which extracts local spatial discriminant features.
2) The multilayer ShuffleNet with depthwise separable convolution is presented with low computational cost.
3) The Gesture-ProxylessNAS Module employs a Proxyless-NAS network by selecting the optimal network, resulting in a significant reduction in time cost by approximately one-third with great precision.The rest of this article is organized as follows.In Section II, we provide a brief overview of the research relevant to our work.In Section III, we introduce the UWB-gestures dataset.We present the proposed method in Section IV.In Section V, we perform the recognition experiments on the UWB gesture signal.Finally, Section VI is our discussion for the work and Section VII is the conclusion.

II. RELATED WORK
Radar-based gesture recognition has garnered a great deal of interest recently.Seong et al. [22] developed a CNN model to recognize digits written in mid-air using hand gestures after collecting the gesture signals using three impulse radio ultra wide-band (IR-UWB) radar sensors.Using two-antenna doppler radar, Skaria et al. [13] were able to record these gestures and map the two beat signals into the three input channels of a DCNN for gesture recognition.This radar can generate the in-phase and quadrature Doppler components of the beat signals.Hendy et al. [23] introduced five more neural network models, including FNN, 2D-CNN, 3D-CNN, 2D-CNN-LSTM, and 3D-CNN-LSTM.They did this by utilizing two alternative data representations for the obtained gesture signals using a single UWB radar.Skaria et al. [24] examined four deep-learning recognition methods, including FCNN, k-NN, SVM, and LSTM, using a 3-D tensor made up of a range-Doppler frame sequence to describe gesture signature.However, the majority of these efforts primarily focused on hardware-based signal decoding and gathering.These works employ machine learning techniques or simple convolution layer stacking CNN models in the following recognition algorithms.The algorithms that identify motions could be further improved by these researchers.These approaches still have great potential for further development in terms of recognition algorithms.Furthermore, using a commercial frequency-modulated continuous wave (FMCW) multi-input-multi-output (MIMO) millimeter-wave radar, Xia et al. [25] developed a lightweight multi-channel convolutional neural network to represent and learn the range-Doppler (RD).Additionally, RD is a multidimensional angular property of learned gesture.Li et al. [26] proposed a unique approach for the identification of sign language and hand gestures based on the cumulative distribution density feature retrieved from the spectrogram of UWB radar signals.In order to achieve HGR, Wang et al. [24] introduced a novel model named CMFF-HGR, which fuses multidimensional features of gesture signals.
Franceschini et al. [27] proposed an ultrasound system for person identification that exploits hand gestures.The system works as a sonar, measuring the ultrasonic pressure waves scattered by the subject's hand, and analyzing its Doppler information.
However, they must carefully take into account multispatial aspects to avoid any loss of information.Additionally, the present deep learning-based algorithms have a drawback: the model parameters are excessively huge, resulting in high time consumption and poor recognition accuracy.

A. Introduction of Dataset
The data used in this article are mainly derived from a public UWB-gestures dataset [1].To construct the UWB-Gestures dataset, the authors used three UWB radars with eight participants to capture 12 different hand gestures.Each participant was asked to perform 12 gestures, with each gesture repeated 30 times.Each kind of gesture has 2400 pieces of data.During the capture process, participants stood in an area 1.5 m away from the three UWB radars facing them.

B. Data Preprocessing
In the UWB-Gestures dataset, the authors conducted preprocessing to mitigate the impact of noise and clutter.Specifically, they used a method based on loopback filtering to estimate and remove clutter.This method extracts the current clutter term using the previously estimated clutter and the current received radar signal, and uses a weighting factor to control the learning rate of the filter.The authors applied this method to the raw data, resulting in clutter-removed data.
To input the UWB-Gestures dataset into our UWB gesture recognition method, we record the UWB signals of each gesture using MATLAB software and stored them as MAT files.After that, we convert these MAT files into CSV format and transform the raw radar data matrix into images.

IV. METHODOLOGY
The gesture have different features as its period is shorter and the spectrum is more noise-sensitive.To enhance the gesture recognition results, selecting features with strong discriminant is the key procedure.Besides, our focus has switched from a manually constructed architecture to an architecture that adaptive conducts a systematic search as a result of advancements in GPU hardware.Taking these qualities into account, we design a lightweight architecture named Gesture-ProxylessNAS, which is shown in Fig. 1.In the proposed GPNAS, we use the local binary pattern encoders to extract discriminant spatial knowledge and depress the noise.In the Gesture-ProxylessNAS Module, we reexamine the design spaces of the multilayer ShuffleNet and combine it with a proxyless neural architecture search unit.This unit replaces the standard convolution and dramatically reduces the parameters and computational complexity of the network.The overall network structure is shown in Fig. 1.

A. Local Binary Encoder
By carefully comparing the features of different classes of images in this UWB-gestures dataset, we find that the significant differences between other classes of images are mainly due to the different detected gesture movement trends.Namely, the textural feature of different classes of images are very distinguishable.Instead of the standard input of the gesture features, we utilize the local binary encoder of the gesture image through a random binary convolution layer to extract the textural feature of the detected images.
Local binary pattern (LBP) is a simple yet very powerful handdesigned descriptor used to describe the regional textural feature of an image [28].The traditional LBP operator operates on image patches of different sizes, such as (3 × 3), (5 × 5), etc.Each operation takes a pixel as the central point, and then the pixel value of the center point is compared with the pixel value of the surrounding neighbor, and the neighbor with the high pixel value is assigned a value of 1. Reads all the neighbor assignments in a fixed order and maps them to a decimal number as the eigenvalue Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
of the center pixel.This process can be expressed as follows by a formula [28]: (1) where (x c , y c ) denotes the coordinates of the center pixel, P c denotes the pixel value of the center pixel, and P n denotes the pixel value of the nth neighboring pixel.
To get the local binary encoder of the gesture image, we utilize the random binary convolution layer [29].This method combines the convolution operation and extraction operation of LBP features, which means that the extraction of LBP features is realized with convolution operations.In this method, eight sparse convolution kernels are used to convolute the input image separately to obtain eight differential maps, and then these differential maps are activated by the nonlinear activation function to generate 8-bit maps.To implement backpropagation with the random binary convolution layer, we can use the sigmoid function or the ReLU function in this nonlinear activation function.Finally, a linear combination of learnable weights produces the response output of the final random binary convolution layer.It is expressed as follows [28]: where X s l denotes the input image of the convolution layer.X t l+1 denotes the output image of the convolution layer.s b s i are the sparse convolutional kernels.σ is a nonlinear activation function, e.g., sigmoid, ReLU, etc. V t l,i is the learnable weights of linear combination.
In detail, we first use two random binary convolution layers to extract the LBP features of the hand-gesture image.At the same time, we use a structure like residual connections to extract features from gesture images of varying classes and dimensions.As shown in Fig. 2, the radar data in the form of images are fed into the convolutional neural network with a layer of ordinary convolution layer and two layers of local binary coded convolution layer.The features of the two output parts are then combined.

B. Gesture-ProxylessNAS Module
After applying the local binary encoder to the UWB-gestures dataset, we introduce the Gesture-ProxylessNAS module.We first utilize the multilayer ShuffleNet with depthwise separable convolution.To further improve network structure, we further use Gesture-ProxylessNAS module on the multilayer ShuffleNet Module.The module can directly learn the architectures with lower memory consumption and facilitate usage on target hardware platforms.The total structure of the Gesture-ProxylessNAS module is shown in Fig. 3. Before the backbone network, we use the residual structure of the random binary convolutions as the feature extraction module.After the backbone network, we add depthwise separable convolution layers to further enhance the representation capability of the network.
1) Multilayer ShuffleNet With Depthwise Separable Convolution: ShuffleNet v1 utilizes group convolutions on 1 × 1 convolution (also called pointwise convolution in [18]) rather than 3 × 3 convolution, which is conducive to improving representation capability and reducing computational cost simultaneously [16].In addition, it employs channel shuffle operation to solve the problem of blocking the information flow between channel groups.
Motivated by ShuffleNet, we adopt the ShuffleNet unit [16] as our network's backbone unit.It is a residual block that applies pointwise group convolutions, 3 × 3 depthwise convolution, and channel shuffle operation in its residual branch.Using pointwise group convolution and channel shuffle operation from a Shuf-fleNet unit reduces the channel numbers and the computational cost.Followed by the ShuffleNet unit, 3 × 3 depthwise convolution enables the exchange of information between channels appropriately in the case of the computational economy and the second pointwise group convolution is used to restore the original channel dimension.
Four layers of depthwise separable convolution are added to improve the representation capability of the structure described above.The differences between these four layers of depthwise separable convolution are mainly in the input and output dimensions, and whether SE blocks are used.Compared with ordinary convolution, depthwise separable convolution is primarily divided into two steps: depthwise convolution and pointwise convolution, which makes the number of parameters and the operation cost relatively low.Intuitively, this involves splitting the ordinary convolution into two convolutions.The first convolution of (h, w, input_channel) is used to reduce the model parameters, and the second convolution of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.(h, w, output_channel) is used to increase the model's channel depth.The whole structure of multilayer ShuffleNet with depthwise separable convolution is shown in Fig. 4.
2) Gesture-ProxylessNAS Training Approach: To optimize the efficiency of the designed neural network structure, we utilize ProxylessNAS to further design effective neural network structures of multilayer ShuffleNet with depthwise separable convolution.
Neural architecture search (NAS) is an optimization algorithm for neural networks that seeks to optimize both the network's structural parameters and weight parameters [31].NAS can be summarized as the optimization of neural network structural parameters and weight parameters.Structural parameters determine the neural network's architecture, while weight parameters determine the contribution of each structure to the output.NAS needs to filter out a relatively efficient set of parameters from a certain number of structural parameters and weight parameters through a series of selections and optimizations so that the neural network can perform better.
The NAS algorithm begins by constructing a reasonable search space composed of all structural parameters and weight parameters, forming parameterized neural network architecture.The design of the initialized search space is essential.Too large search space will complicate the subsequent optimization process, leading to high NAS training costs or even convergence issues.On the other hand, if the search space is too small, there may be no optimal set of parameters, and the network architecture will not improve significantly even after NAS training.The initialized search space must be optimized and evaluated by testing it against a suitable dataset, which corresponds to different groups of parameters.After a certain number of searches, theoretically, we can obtain a better set of parameters.However, naively traversing directly in the search space causes GPU memory explosion.Therefore, algorithms such as reinforcement learning [30], [31], evolutionary learning [32], Bayesian optimization [33], and so on can be employed to accelerate the optimization process.However, there is still room for optimization and improvement in the NAS search space described above.Unlike traditional NAS algorithms [34], [35] that directly search for large-scale task methods that result in prohibitive computational requirements, ProxylessNAS [21] addresses the issues of high memory consumption and computational cost, enabling it to directly learn the architectures for large-scale image tasks.
To enable a direct tradeoff between width and depth, we initiate an overparameterized network and allow the Proxyless-NAS automatically search for a more optimal network structure and relevant parameters after training.The visualization Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. of the Gesture-ProxylessNAS module process is shown in Fig. 3.
We define the NAS training process as a path optimization selection problem.Each set of structural parameters and weight parameters represents a path in the search space, and the goal of NAS is to identify the optimal path.As shown in Fig. 3, precisely, for each convolution layer, the structure parameters α represent the candidate structures in that convolution layer, such as each circle node in the figure; these structures include 3 × 3 convolution, 5 × 5 convolution, pooling, etc.And the weight parameters β represent the weight coefficients in the operation process in each convolution layer.
However, the complexity of finding the optimal path in the given N candidate paths of the initialized hyperparameter network needs to be lowered, making most NAS methods achieve higher performance and bring enormous training time costs.
To reduce memory usage and computational costs, we break down the problem of choosing a path from N candidate paths into multiple binary choice problems.Thus, we can transform the real-valued path into binary gates g = binarize (P 1 , . .., P N ) = [1, 0, . .., 0] with prob P 1 [0, . .., 0, 1] with prob P N .
Based on the binary gate g, the output of the parameter training process is given as As illustrated in [31] and Fig. 3, the path selection problem is defined as a binary selection problem to reduce the algorithmic complexity of network optimization.This approach ensures that only one of the paths is active during the parameter update, which simplifies the optimization process.In the optimization process of the parameter network, the two training processes mainly include weight parameters and structure parameters.When updating the structure parameters and weight parameters in each path optimization, we first freeze the structure parameter and sample binary gates according to (5) for each batch of input data.The weight parameters can be updated directly by the gradient descent method.We further use gradient-based optimization to learn structure parameters and weight parameters.The specific derivation of the gradient-based optimization method [31] is as follows: According to ( 6), (7), and (8), parameters can be approximately estimated by using the gradient w.r.t. and using ∂L/∂g i in replace of ∂L/∂p i : where Simultaneously, during each update step, one sampled path is enhanced (path weights increase), and another sampled path is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I PERFORMANCE ON THE UWB-GESTURES PUBLIC DATASET
attenuated (path weights decrease), while all other paths remain unchanged.After each update, the algorithm prunes the paths, keeping the path with good performance and cutting out the ones with poor performance.
We optimize the network structure through the continuous update of the weight parameters and structure parameters.This approach does not increase computational complexity.Thereby, the memory requirement is reduced to the same level of training as a compact model.

V. EXPERIMENTS AND RESULTS
We mainly evaluate the proposed recognition method of hand gestures on the UWB-gestures public dataset [1].We follow most of the standard training settings and hyperparameters.

A. Settings
The UWB-gestures public dataset consists of 12 gesture classes with 28 K training images [1].We randomly sample 20% images from the training set as a validation set updated by using the Adam optimizer with an initial learning rate of 0.01.The overparameterized network is trained on the remaining training images with batch size 128.We train 300 epochs on the model in each part of the experiment.The configuration of the experimental equipment is N V IDIA GeF orce RT X 2080T i with 1 GPU (12 GB) and 4 CPUs.The input image size is 224 × 224.The batch size is 32 for GPU.FLOPs row and FPS list the complexity at 224 × 224 input size.

1) Experiment 1:
The UWB-gestures public dataset [1] is taken into consideration for evaluation in this experiment.The quantitative comparison of the proposed method with cuttingedge UWB gesture classification methods is shown in Table I.With comparable numbers of flops, parameters, and FPS as MobileNeXt, our strategy increases the recognition accuracy by roughly 4%.
2) Experiment 2: For a comparable accuracy level (about 96%), GhostNet and EfficientNet must incur significant computing expenses.Specifically, GhostNet's FLOPs and parameter volume are 1.3 times higher than ours.EfficientNet's FLOPs and parameters are even more than four times ours.More importantly, our method achieves significantly higher resource efficiency, with our FPS is 1.5 times higher than GhostNet and 10 times higher than EfficientNet.
To evaluate the performance of the proposed method, the average recognition confusion matrix based on GPNAS is shown in Table II.The first column of each row represents the original class, while the first row represents the predicted class of gestures.The diagonal line values indicate the correct recognition rate of each of gesture, while the off-diagonal values indicate the false positive rate.For example, the value on (i, j) suggests the probability that the ith kind of gesture image is misclassified as the jth kind of gesture.From the confusion matrix, this model has a relatively balanced recognition accuracy rate for various gesture images (basically reaching 95%), and there is no problem with unbalanced class recognition.It turns out that GPNAS can not only achieve roughly good recognition performance but also exhibits strong robustness.
To further investigate the rationality of the network framework proposed in this article and the efficiency of the Gesture-ProxylessNAS Module, we used two series of experiments to verify.The specific experimental settings are as follows.
First, we only use the method's backbone network (combining the local binary encoder with multilayer ShuffleNet with depthwise separable convolution), without the Gesture-ProxylessNAS module.We conduct three experiments to replace the convolution kernel size in the Shuffle module, originally 3 × 3, with kernel sizes of 5 × 5 and 7 × 7. Subsequently, the Gesture-ProxylessNAS module is added to the initial three experiments to obtain six experimental results, as illustrated in Fig. 5(a).
Following the above experimental method, we carried several more experiments.The specific experimental settings are as follows.First, we use the backbone network of the method, without the Gesture-ProxylessNAS module.Three experiments are conducted to alter the number of deeply separable convolution layers from 2 to 6. Subsequently, the Gesture-ProxylessNAS module is added to the original three experiments to obtain six experimental results, as shown in Fig. 5(b).
Setting the convolution kernel sizes of ShuffleNet to 3 × 3, 5 × 5, and 7 × 7, respectively, yields the results as shown in Fig. 5(a).Among them, the experimental results obtained by setting the convolution kernel to 3 × 3 are the best (whether it is the backbone or the addition of the Gesture-ProxylessNAS module).At the same time, the accuracy of experimental results using 5 × 5 convolutional kernel size ranks second, and 7 × 7 experimental results ranks last.It demonstrates that too large a convolution core is not conducive to gesture recognition for gesture images.The horizontal comparison makes it clear from the graphic that the Gesture-ProxylessNAS module performs exceptionally well.Whether it is a 3 × 3, 5 × 5, or 7 × 7 convolution kernel, the model's accuracy is unchanged after adding the Gesture-ProxylessNAS module, and its parameter amount is significantly reduced.
Similar to this, setting the number of layers of depth-separable convolution layers to 2, 3, 4, 5, and 6 yields the result shown in Fig. 5(b).The chart shows that as the number of deep separable convolution layers rises, so does the number of parameters and the recognition accuracy.However, when the number of convolution layers is increased to 5 and 6, the model's price/performance ratio (the ratio of recognition accuracy and parameter quantity) is much smaller than that of 2, 3, and 4 layers.According to this perspective, the model can perform at its best when using four convolution layers.This also verifies the rationality and high performance of the method proposed in this article.
Similarly, in the horizontal comparison, it is clear from the figure that the performance of the Gesture-ProxylessNAS module is excellent.Regardless of the number of depth-separable convolution layers (ranging from 2 to 6), the module consistently maintains the model's precision while significantly reducing the parameter count.

C. Ablation Study
The proposed recognition method of hand gestures consists of three parts: LBP feature extraction, recognition networks, and Gesture-ProxylessNAS training methods.In this section, we evaluate them, respectively.We perform our ablation experiments on the UWB-gestures public dataset [1].As shown in Table III, based on the backbone, the introduction of deep separable convolution layers can immensely improve the gesture recognition accuracy of the network (around 3%) but also introduces many parameters and FLOPs.

VI. DISCUSSION
UWB gesture recognition is often used in scenarios with high real-time requirements and weak hardware device performance, such as in-vehicle interaction and smart homes.Despite this, the existing literature still primarily focuses on recognition accuracy as the sole metric to evaluate algorithm performance.To address this, we propose combining recognition accuracy with additional metrics such as parameter quantity, FLOPs, and FPS, to measure model lightness and real-time performance.And based on these metrics, we propose a lightweight network for detecting mid-air gesture based on UWB radar.Based on the lightweight network architecture multilayer ShuffleNet with depthwise separable convolution, it employs LBP encoders and adopts the model compression method named Gesture-ProxylessNAS, which improves the accuracy and efficiency of gesture recognition.Our experiments demonstrate that our method outperforms previous studies on different gesture recognition algorithms [1] and lightweight neural networks [17], [18], [19], [20], [36], [37].Although the performance of this method in the experiment is superior to other methods at present, there are still limitations.First, due to the limitations of dataset, this method only verifies the performance in an open dataset [1].Second, after the introduction of Gesture-ProxylessNAS module, the recognition accuracy of this method still experiences a small reduction.Thus, future iterations of UWB gesture recognition data collection and model compression methods may have even greater potential.

VII. CONCLUSION
This article proposes a lightweight deep-learning network framework for UWB gesture recognition named GPNAS.By utilizing the advantages of multilayer ShuffleNet and Gesture-ProxylessNAS module, GPNAS learns spatial information through lightweight structure.Therefore, to improve the embedding of the textural features of the UWB picture, spatial features are encoded with the LBP module before multiscale features are retrieved with ShuffleNet.In order to completely use the feature information across different layers and enhance the recognition impact, multilayer ShuffleNet is used in the recognition stage to extract the solid complementary information between various hierarchical structures.Additionally, the Gesture-ProxylessNAS module reexamines the design spaces of the multilayer ShuffleNet with an optimization problem, which dramatically reduces the parameters and computational complexity of the network.Generally, the proposed GPNAS framework is an efficient, lightweight deep network architecture that can provide satisfactory recognition accuracy and performance with fewer parameters.The recognition results of real UWBgestures datasets confirm the feature generalization ability of the proposed GPNAS.Future studies will focus on improving network recognition performance by creating a lighter, more effective network framework using an autonomous architectural search algorithm.

Fig. 1 .
Fig. 1.Structure of our gesture recognition method.The method is divided into three main parts: a public UWB-gestures dataset (the source of the dataset), data preprocessing and GPNAS, the most crucial gesture recognition network.The LBP feature extraction module and the Gesture-ProxylessNAS Module are the two main components of GPNAS.Additionally, the Gesture-ProxylessNAS training approach updates the Gesture-ProxylessNAS Module's parameters.

Fig. 3 .
Fig. 3. Structure of the Gesture-ProxylessNAS module.The upper part is the network structure of the Gesture-ProxylessNAS module.The bottom part is a schematic diagram of the Gesture-ProxylessNAS training approach in the Gesture-ProxylessNAS module.

Fig. 4 .
Fig. 4. Total structure of multilayer ShuffleNet with depthwise separable convolution.The network mainly consists of three shuffle modules and one depthwise module (DW module).The difference between shuffle modules is only the size of the convolution kernel and the depthwise module contains two depthwise separable convolution layers (DS Conv) and depthwise separable convolution layers with squeeze-and-excitation module (DS Conv + SE).

Fig. 5 .
Fig. 5. Visualization of comparative experiment results performed at different experimental settings.

TABLE II AVERAGE
RECOGNITION CONFUSION MATRIX OF 12 GESTURES BASED ON GPNAS

TABLE III PERFORMANCE
OF ABLATION STUDY ON UWB-GESTURES PUBLIC DATASET.MOST OF THE SETTINGS FOR THIS EXPERIMENT ARE THE SAME AS THOSE DESCRIBED ABOVE