An Ultra-Efficient Approach for High-Resolution MIMO Radar Imaging of Human Hand Poses

The capturing of hands, including their poses, shapes and motions, has numerous potential applications, such as human-machine interfaces and medical use cases. However, in the radar context, most existing methods only allow for the recognition of dynamic hand gestures based on Doppler evaluations due to the respective systems’ limited lateral resolution. Radar-based high-resolution three-dimensional (3D) imaging using multiple-input multiple-output (MIMO) radars is currently the state-of-the-art in personnel security scanning. However, the associated imaging techniques suffer from computationally burdensome reconstruction algorithms that sample the entire 3D space of interest, thereby making them less suitable for real-time applications. Moreover, their application in hand motion tracking scenarios is limited by low frame rates that result from a high number of transmit frequencies. Hence, we present an efficient and powerful approach for the radar-based 3D reconstruction of hand poses. The method extends the frequency shift keying continuous wave radar principle and reconstructs the hand surface using only two carrier frequencies. Instead of reconstructing an entire 3D volume, only two single-tone radar images are computed. Depth information is derived from phase differences between corresponding pixels in the images. The approach significantly reduces computational load by three orders of magnitude compared with the state-of-the-art and enables higher frame rates. Within this paper, this novel reconstruction principle is analyzed and compared to a state-of-the-art radar imaging approach using a MIMO radar system with 94 transmitting and 94 receiving antennas. Detailed simulations of point targets and comprehensive measurements demonstrate the excellent imaging performance of our approach.

human-computer interaction becomes more important [1], [2].In this regard, two popular research fields have emerged [3].The first focuses on gesture classification based on the evaluation and extraction of high-level abstract motion or pose information.The second one builds on precise hand pose estimation, i.e., it aims at capturing the correct threedimensional (3D) information of hand poses and motions.This is also of great interest in medical contexts, as the analysis of hand function can provide information regarding the current health of patients [4], [5], [6].
In the computer vision domain, markerless methods based on RGB cameras, depth cameras, or RGB-D cameras, i.e., a combination of the aforementioned are used for exact hand pose estimation, as well as static and dynamic automated gesture recognition [7], [8], [9].Computer vision algorithms have a rich history and have been extensively researched.However, optical sensors have the disadvantage of their performance being highly dependent on the lighting conditions of the illuminated scene [10].
For this reason, radar-based methods for the capturing of hand poses or hand motion offer a promising alternative, as they do not suffer from this disadvantage on account of the wavelengths used; therefore, such methods enable a robust precise measurement of distances.Furthermore, unlike computer vision approaches, they are able to directly measure motion by analyzing the Doppler effect, which is beneficial when it comes to the evaluation of dynamic features.The majority of the work published in the radar context focuses on dynamic gesture classification.An overview of the state-of-the-art of hand gesture recognition approaches using radar sensors can be found in [11].In this context, Google launched a project called Soli, that utilizes a frequency-modulated continuous wave (FMCW) radar in combination with machine learning algorithms for robust dynamic hand gesture recognition and gesture tracking with sub-millimeter accuracy [12].Further research can be found in [13], [14], [15], [16], [17], [18], and [19], in which micro-Doppler signatures or range-Doppler maps are used to enable automatic hand gesture recognition in combination with machine learning approaches.In [20], multiple scattering points of the hand are extracted from the measurement data of an FMCW multiple-input-multiple-output (MIMO) radar, which are subsequently analyzed in the spectrum, spatial, and time domains.In [21], an interferometric continuous wave (CW) radar is used to measure the angular velocity of dynamic hand gestures.The commonality among all these methods is that the radar utilized incorporates low lateral resolution, which is why they focus on dynamic rather than static features.In [22], a convolutional neural network is trained and applied to enable improved static hand gesture recognition based on radar data.The data acquisition is done using a mechanical MIMO FMCW radar scanner consisting of two transmitting (Tx) and receiving (Rx) antennas, which enables a multiperspective view and the generation of sufficient training samples.However, within this work as well as in the previously mentioned articles, the authors do not aim for a precise 3D imaging of different hand poses, as is the case in most computer vision approaches.
High-resolution 3D imaging of the human body has been the state-of-the-art for personnel security screening for many years, as radar waves in the millimeter-wave range (mmWave) can penetrate clothes and, therefore, enable the detection of forbidden items, such as weapons [23], [24], [25].To ensure this, the used radar systems require a high 3D resolution, which can be achieved by large antenna apertures, comparably high carrier frequencies, and high signal bandwidths.As radar imaging requires measurements from many positions and viewing angles, MIMO radars are currently considered the state-of-the-art in building sufficient imaging systems.However, a high number of antennas and the use of high signal bandwidths lead to high requirements for the hardware [23].Besides that, these radar systems commonly use a backprojection approach to precisely reconstruct a 3D object scene.This procedure is computationally expensive as it raises the need to, first, subdivide the 3D space into smaller sub-volumes (voxels) and, subsequently, reconstruct each of them individually (see Section III).Consequently, traditional imaging radar systems suffer from low measurement rates, computationally burdensome reconstruction algorithms, and high hardware requirements.All these limitations make them unsuitable for use in real-time capable and consumer-friendly gesture sensing systems, or hand-tracking applications.
A radar-based method that can be used in static and dynamic hand gesture classification tasks and is also suitable for precise hand pose tracking applications has not been researched yet.All previous studies either lack the ability to precisely reconstruct hand poses or their computational complexity makes them unattractive for real-time applications.Additionally, low frame rates associated with the latter present challenges for tasks involving direct motion evaluations.Therefore, we propose a new algorithm and measurement principle that is based on the frequency shift keying (FSK) continuous wave (CW) radar principle [26].Our approach requires only two closely neighbored frequency steps (2FSK) and, thus, requires a small signal bandwidth, thereby reducing hardware requirements.Furthermore, we incorporate a novel 3D reconstruction approach that lowers computational complexity, as it is based on the computation of two single-tone images at an estimated target distance instead of sampling and reconstructing a full 3D space.In addition, measurement acquisition and data transfer times can be significantly decreased, as only two frequency steps are sent out by the antennas.By achieving a significant reduction in measurement time and computational complexity, this novel approach delivers high measurement rates and opens the way toward real-time hand-tracking applications.Furthermore, the precise 3D reconstruction of hands using radar technology allows for the application of thoroughly researched computer vision algorithms.This manuscript makes the following contribution to extant research.We present the theory underlying our fast, precise, and efficient 3D reconstruction technique of the human hand.To evaluate the performance of the novel 2FSK-based imaging principle, we compare its theoretical accuracy, 3D reconstruction results, and computational efficiency to a state-of-the-art radar imaging approach using a stepped frequency continuous wave (SFCW) signal form.Since automatic sign language recognition is an intensly researched application of static hand pose classification [27], [28], [29], we selected three exemplary hand poses of the American Sign Language (ASL) alphabet to compare the 3D imaging results of both approaches.To guarantee a static object scene for the comparison, all hand poses were 3D printed.In addition, we also included measurements that involved a real human hand to further prove the applicability of the novel approach.The manuscript is structured as follows: The measurement setup is described in Section II.In Section III and Section IV, we explain the theoretical background of the SFCW approach, as well as the theory behind our novel 2FSK based algorithm.In Section V, we define the signal and reconstruction parameters relevant for our setup.Thereafter, we analyze the performance of the proposed approach in Section VI.In Section VII, we discuss the results and in Section VIII, we present a summary.

II. MEASUREMENT SETUP
Fig. 1 shows an overview of the experimental setup, which we used to capture different hand poses.To accurately position the hand in front of the radar, a styrodur structure, including a reference coordinate system of the x-y plane drawn on a sheet of paper, was placed on the table.Two absorber walls were placed behind the table.The hand was positioned roughly at a distance of 30 cm.In order to reliably resolve the fingers of the hand in lateral direction, an image resolution significantly less than 1 cm is necessary.Current state-of-the-art 3D radar imaging systems of the human body are typically broadband MIMO systems equipped with a two-dimensional (2D) antenna array.To achieve the same resolutions in both lateral dimensions, a square aperture is a viable choice.The MIMO array used for our setup is a submodul of a commercially available automotive radome tester [30] and is depicted in Fig. 2(a).It has 94 Tx and 94 Rx antennas, which are arranged on a square frame, as illustrated in Fig. 2(b).As a result, we acquire a squared, uniform, and fully equipped virtual array of 8836 antenna pairs.The spacing between adjacent antenna elements is 3 mm, thereby resulting in an array with a physical size of aproximately 14 cm x 14 cm.The size of the resulting virtual MIMO aperture is 28 cm x 28 cm (assuming a far-field approximation).Hence, the lateral resolution in 30 cm distance is 4 mm, as revealed in the next section.The used radar signal modulation is SFCW.The bandwidth and number of Fig. 3. State-of-the-art MIMO millimeter-wave imaging.The 3D shape of an object within the volume of interest O, with #» r ṽ pointing to all relevant voxels, is obtained by reconstructing a 3D volume Ô and projecting the voxel containing the highest intensity along the z-axis onto the x-y plane while extracting its z-index (maximum projection).frequency steps are configurable within a range from 72 GHz to 82 GHz.Hence, it is well suited to compare the existing approaches and our novel 2FSK approach.The transmitting power per antenna and transmit frequency can be described by an effective isotropic radiated power of approximately 10 dBm.

III. MILLIMETER-WAVE IMAGING WITH MIMO RADARS
In the following, the state-of-the-art 3D mmWave imaging approach is discussed in more detail using the example of an SFCW signal modulation.The procedure is illustrated in Fig. 3 and explained in the following.Let us assume the volume of interest O containing an object whose 3D dimensions are to be detected.This volume is divided into smaller subvolumes, called voxels.Their size is determined by the sampling dimensions of the applied reconstruction.The vector #» r ṽ points to the different voxel positions O(x, y, z).Within this volume there are N v voxels located at #» r v actually containing a scatterer.The scene is scanned with a MIMO radar consisting of N Tx transmitting and N Rx receiving antennas, sending out N f frequency steps.To obtain a reconstructed voxel Ô( #» r ṽ ) within the 3D object scene of interest, the baseband signals s B need to be correlated with complex weights w to correct the respective signal delays before they are coherently summed up across all transmit-receive combinations [24].This signal processing algorithm is also known as back propagation, backprojection, or digital beamforming [23].Mathematically, this procedure can be described by [31] Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. where and All frequency steps that are transmitted by the Tx antennas are represented by f , and the speed of light is described by c.
The vectors pointing from the origin, i.e., the center of the MIMO array, to any Tx or Rx antenna position are denoted as #» r Tx and #» r Rx , respectively.A constant reflection phase shift is described by φ Rx .By performing the backprojection algorithm for all relevant voxels #» r ṽ , a 3D volume describing the object scene is obtained.The resolution of this volume depends on several parameters.
According to [24], the spatial resolution in front of the center of a square-shaped MIMO array with equal Tx and Rx apertures and side length D is given by and where δ x and δ y describe the cross-range (lateral) resolutions and the range resolution is depicted by δ z .The distance between the focused spot and the antenna array is denoted by L. The minimum and maximum operating frequency as well as the signal bandwidth are indicated by f min , f max , and f .As high signal bandwidths lead to high range resolutions, one of the main applications of broadband mmWave imaging is personnel security screening, as this type of 3D imaging is able to resolve clothes, body surfaces, and possibly hidden items underneath the clothes.One disadvantage of this brute-force algorithm is the extremely high computational cost associated with the reconstruction of each voxel within the volume of interest.If this approach is applied to the imaging of hands, it can be assumed that the strongest reflections are caused by the skin surface, as human skin mostly consists of water and, therefore, strongly reflects millimeter waves [24], [32].To extract the physical dimensions of the hand from the reconstructed volume Ô, a so-called maximum projection needs to be performed [33].Hence, for each pixel position defined by x and y, the voxel Ô(x, y, z) with the strongest intensity along the z-axis within the reconstructed volume is extracted, which can then be used to locate the hand and its surface in 3D space and allows the projection of a 3D volume Ô( #» r v ) onto a 2D image (see Fig. 3).This procedure has two main disadvantages.On the one hand, the frame rate depends on the number of transmitted frequencies, as these influence the pure measurement duration and the time required for the data transfer.For the given hardware depicted in Fig. 2, the frame rate for N f = 128 is approximately 70 Hz.To enable a precise radar-based tracking of hand motion, frame rates above 1 kHz are desirable; therefore, an increase in the frame rate of factor 14 is required to enable precise hand motion tracking.On the other hand, brute-force reconstruction of the complete volume is computationally expensive, as all iterations depicted in (1) have to be run for every voxel of interest.This is particularly a problem with regard to real-time applications.However, when it comes to imaging of the hand, reconstructing an entire volume is unnecessary, as only the visible parts of the hand surface create a radar response.Hence, there is no need to resolve multiple targets along the z-axis.Therefore, instead of reconstructing a volume and obtaining the contour of the hand by performing a maximum projection, new approaches should be identified that focus on efficient extraction of the hand surface from the radar data.
The aim of this paper is to present a radar measurement concept and image reconstruction approach that not only drastically reduces the measurement and computational effort but still provides pleasing, high-quality results of the reconstructed hand shell in the form of an image.The results of that study reveal that an efficient extraction of the body shell and a reduction of the computational complexity of three orders of magnitude compared to classical radar imaging methods is achieved with the proposed concept.

IV. 2FSK MIMO RADAR IMAGING PRINCIPLE
In this section, we describe our novel signal processing approach that enables efficient radar imaging of the hand.This approach combines the theory of the 2FSK radar concept [26] with the previously described principle of mmWave imaging.Compared to CW radar, 2FSK radar can significantly increase the unambiguous target range measured [34].This can be achieved by subsequently sending out two neighbored CW frequencies f 1 and f 2 and evaluating the phase difference φ = φ 2 − φ 1 between both baseband signals.As a result, the distance d to the target can be calculated by [26] The maximum unambiguous range is This formula can also be used to calculate the maximum unambiguous range achievable with the SFCW signal form.In this case, f 2 − f 1 = f describes the step size between two frequencies.When applied to radar imaging, the presented 2FSK approach reconstructs two single-tone images ( Ô1/2 z e ) of one slice of the volume of interest O at an estimated depth z e and performs the phase evaluation described in (7) for every pixel of that slice.This is an iterative process, as we first approximate the distance to our target and then adjust it based on an initial 2FSK evaluation.Thereafter, a final phase evaluation is performed to estimate the contour of the respective object.Based on this approach, only two frequencies need to be transmitted by every Tx antenna of the MIMO array.The operating principle of this approach is depicted in Fig. 4 and is described in more detail in the following account.
If we assume that there is a target at position #» r p = (x, y, z e ), the estimated distance d e im from the mth Tx antenna (m ∈ [1; N Tx ]) to the respective target back to the ith Rx antenna (i ∈ [1; N Rx ]) can be calculated by whereas the scatterer is actually located at #» r p = (x, y, z).Therefore, the correct distance to the point scatterer is and where d im describes the error of the distance estimate.For one Tx-Rx antenna pair, the two baseband signals generated from this point target are obtained.For reasons of simplicity, the amplitudes of the baseband signals A 1 and A 2 are set to one.Furthermore, we note that the baseband signals are influenced by the reflections of multiple point scatterers.The impact of all scatterers in the volume of interest, located at x and y coordinates different from the pixel being reconstructed, can be expressed as an extra unknown phase term in addition to φ Rx .For simplicity, the additional phase term and the unknown reflection phase are summarized by φ Rx below.The complex weights w 1 im and w 2 im corresponding to ( 12) and ( 13), which are needed to reconstruct a 2D image for both carrier frequencies, are based on the first estimate d e im and described by Correlating the baseband signals from ( 12) and ( 13) with their individual weights, with respect to (11), yields To obtain pixel Ô1/2 z e (x, y) = Ô1/2 ( #» r p ) of the image of slice z e , these results are added up over all possible Tx-Rx combinations: Ô2 z e (x, y) = To evaluate d, the 2FSK principle is followed.Hence, the phase difference between corresponding pixels in the two images is measured by muliplying Ô2 z e (x, y) with the complex conjugate of Ô1 z e (x, y).Therefore, the differential complex pixel information is obtained by The phase φ of this differential complex pixel Ôz e (x, y) is then used to calculate d by applying (7), which results in In (21), there is only a division by two, as d is defined as the complete distance error related to the path from Tx antenna to point scatterer and from point scatterer to Rx antenna, which is illustrated in Fig. 5, using the example of a monostatic antenna and a point target P(x, y, z).In contrast, the distance d in (7) only covers the forward distance from a monostatic antenna to the target.To estimate z, we simplify which can then be used to correct the first estimated guess z e .In Fig. 5, it also becomes evident that the difference between Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.d/2 and z is reduced for increasing values of z as the angle α becomes smaller.The theoretical maximum unambiguous value for d/2 is defined by (8), thereby resulting in ±c/( f • 2).
To further illustrate the mechanism of this 2FSK imaging algorithm, (20) can be rewritten as Multiplying out the parentheses results in which can be simplified to + exp j2π This evaluation leads to the cancelling out of the unknown φ Rx .If all ds were equal, this formula would result in an in-phase summation of the individual antenna contributions.For most imaging radars, d differs between Tx-Rx combinations.However, as f is set to comparably small values for 2FSK applications and, hence, small variations in d cause only small phase shifts, we can assume that lines 1 and 2 of (25) cause the same phase shift and lead to an in-phase summation of these summands.Moreover, the cross terms of ( 25) (e.g. in line 3 and 4) remain.To evaluate their influence, line 3 of ( 25) can be rewritten as exp j2π with d 12 = d 11 + δd and δd being an offset between the two ds.Therefore, line 4 of ( 25) can be rewritten as exp j2π The term f d 11 /c of ( 26) and ( 27) causes an additional in-phase summand with respect to (25).In the following, we assume that which implies that the phase shifts caused by these terms in ( 26) and ( 27) cancel themselves out, so that only the in-phase summations of the complex pointers of (25) remain.
In other words, the residual pixel phase of the multiplication described in (20) is influenced by the average d over all possible Tx-Rx combinations.The phase error that is made by this assumption is approximately exp j2π f δd c .
The overall workflow of the 2FSK MIMO imaging principle is depicted in Fig. 6.The phase difference between the corresponding pixels of the two complex images is evaluated for each pixel (x,y) (see Fig. 4), where a target can be assumed after applying a suitable threshold.To obtain the best possible focused image, the image is divided into two steps.First, two images are reconstructed at z e , and the pixel containing the strongest scatterer is evaluated.After calculating the corresponding z, z e is corrected by z, and two complex images are reconstructed at the adjusted z e coordinate.Finally, the phase difference of all pixels with an amplitude above a certain threshold is evaluated by applying ( 21) and ( 22), and the 3D information is smoothed by an averaging filter.
V. SIGNAL AND RECONSTRUCTION DESIGN When choosing the optimum signal parameters for the 2FSK approach, the value for d max /2 (see (8)) is one design criterion.We set f in the following to 200 MHz; in this manner, we acquire an unambiguous range for d/2 of approximately 0.75 m and ensure that f is still sufficiently small to enable the use of all simplifications described in Section IV.Further, the CW frequencies are set to f 1 = 79.8GHz and f 2 = 80 GHz, thereby enabling high angular resolution.With regard to ( 5) and ( 6), the cross-range and range resolution for a CW signal at f 1 = 79.8GHz at a distance L = 0.3 m are approximately δ x = δ y ≈ 4 mm and δ z ≈ 4 cm, respectively.With regard to the backprojection, two complex images for x/y ∈ [−0.1 m, 0.1 m] are reconstructed.N x = N y are set to 101 sampling points, thereby leading to a pixel dimension of 1 mm × 1 mm.
When applying the SFCW approach, in our case, a signal form that covers a bandwidth of 10 GHz from 72 GHz to 82 GHz with N f = 128 is used.This enables a cross-range resolution at distance L = 0.3 m of approximately δ x ≈ δ y ≈ 4 mm as well as a range resolution δ z ≈ 1 cm.With respect to the backprojection of the volume, x and y also lie within [−0.1 m, 0.1 m] and N x = N y are also set to 101 sampling points.The depth z is sampled from [0.26 m, 0.34 m] in N z = 81 steps, thereby yielding a voxel dimension of 1 mm × 1 mm × 1 mm.

VI. PERFORMANCE OF THE IMAGING APPROACH
Now that the theory behind the novel approach has been explained and the signal and reconstruction parameters have been established, the following section evaluates its reconstruction performance.

A. Simulation of Point Scatterers
Within this section, the theoretical accuracy of both techniques is evaluated by simulating the reflected signal of one point scatterer.First, the theoretical accuracy of the novel 2FSK MIMO imaging approach is analyzed for different point scatterer positions P(x, y, z), as seen in Fig. 5 and for the MIMO geometry presented in Fig. 1.To estimate the point target position, two complex images are reconstructed at an estimated distance z e using the two different carrier frequencies.At first, z e is estimated at 30 cm.To adjust z e , the phase difference of the pixel containing the strongest scatterer is evaluated (see Fig. 6).Afterward, again two complex images are reconstruced at the adjusted value for z e , and the phase evaluation of the strongest scatterer is repeated.
Within the first simulation set, the x and y positions of the point target are varied, whereas the true z coordinate and z e are kept constant, with z = 0.3 m and z e = 0.27 m leading to z = 0.03 m.In Table I, the true position P of the point target and the absolute deviations per coordinateδx, δy, δz -compared to the detected value using the 2FSK imaging principle before and after adjusting z e can be seen.Furthermore, the 2FSK algorithm was evaluated for different z values, which can be seen in Table II.Here, z was kept constant at 0.03 m.In addition, various values for z were evaluated while x, y, and z were kept constant at (0,0,0.3).The results are presented in Table III.These results indicate that the accuracy of the 2FSK algorithm decreases the closer the target lies to the radar.This can be illustrated by Fig. 5.The closer the point target is located with respect to the radar, the greater the difference between d/2 and z becomes.It is also evident, that the readjustment of z e significantly improves the overall accuracy of the detection of all coordinates.
Second, the theoretical accuracy of the SFCW approach is analyzed.For all point target positions listed in Table I and Table II the state-of-the-art approach yielded 100 % accuracy (results were rounded to the millimeter).The remarkable theoretical accuracy and the widespread use of this algorithm provide strong motivation to utilize its reconstruction results as a benchmark for evaluating the performance of the 2FSK approach for the upcoming measurement section.

B. Measurements Using 3D Printed Hand Poses
This section presents measurement results regarding the 3D reconstruction of different hand poses of the ASL alphabet to further evaluate the performance of the novel 2FSK 3D imaging approach.In this regard, the results are compared to the state-of-the-art SFCW approach.To allow this comparison, three hand poses were 3D printed and coated with an electromagnetic interference shielding lacquer.In this manner, a static object scene can be ensured.The hand poses describe the ASL letters B, F, and U and are depicted in Fig. 7(a).
The MIMO radar array used for the measurements corresponds to the geometry described in Section II.The measurement setup can be seen in Fig. 7(b).The 3D printed hands were positioned roughly at a z coordinate distance of 30 cm to the radar.We placed two absorber walls behind the hands.To position the hands, we used a table with a styrodur plate and one absorber mat on top.For both the 2FSK and SFCW measurements, a measurement averaging across three successive measurements was applied to the measurement data.
For the 3D reconstruction of the hand poses following the novel 2FSK approach, we followed the steps depicted in Fig. 6.A threshold of −13 dB below the maximum amplitude to suppress sidelobes is applied; hence, every pixel within the image reconstructed at f = 79.8GHz with a magnitude below this threshold is removed from both reconstructed images.The remaining pixels then define the region of interest (ROI) in x and y dimension.
To perform a 3D reconstruction of the hand poses using the SFCW algorithm, first, the volume of interest is reconstructed.Then, a 2D image is generated via the maximum projection and the corresponding z coordinates of the skin surface are extracted as described in Section III.Thereafter, a threshold of −15 dB below the maximum amplitude is applied to suppress side lobes.
The result of the image reconstruction normalized to the corresponding maximum magnitude with applied thresholding for both algorithms is presented in Fig. 8(a).At this point it is noted that the absolute maximum magnitude for the SFCW maximum projection image is higher than that of the reconstructed 2FSK images, as the SFCW algorithm runs a backprojection for multiple frequencies.
To determine the z coordinate of the skin surface for each pixel using the novel 2FSK approach, the phase of both complex images is compared for each pixel within the extracted ROI.With respect to ( 21) and ( 22), z as well as z = z e + z are calculated.Afterwards, the values for z are smoothed with a 2D running mean filter with a kernel size of 15.The z coordinates extracted during the maximum projection in case of the SFCW evaluation, are also smoothed with a 2D running mean filter with a kernel size of 15.
The results for the detection of the skin surface of the hand for both algorithms can be seen in Fig. 8(b)-(h).The 3D reconstruction of the hand surface yields rather similar results when both approaches are compared.The estimated z coordinate fluctuates slightly more in the 2FSK case, whereas the surface estimate using the SFCW approach appears smoother.The small fluctuations of the depth estimate in the 2FSK algorithm may be caused by clutter that can have a stronger effect compared to the SFCW case, as only one z-slice is reconstructed.However, the depth profiles of the respective hand poses are clearly recognizable in the 2FSK case.By means of these figures based on both approaches, the different hand poses can clearly be distinguished from each other and assigned to the corresponding letter.This remains the case if some parts of the hand do not return a signal due to their orientation to the radar -for example, the thumb saddle joint in the case of the letter F.
Table IV presents  To further evaluate the performance of the 2FSK approach, the reconstruction results for letter F are compared to the state-of-the-art for increased distances to the radar.The comparison of the 3D reconstructed hand surface are depicted in Fig. 9.The average and maximum deviation in case of letter F for increasing distances to the radar are compared within Table V.At 40 cm and 50 cm, the 2FSK algorithm still yields pleasing results that are comparable to the SFCW outcome.However, with increased distances to the radar, the lateral resolution reduces for both approaches.With regard to the 2FSK algorithm, this implies that the depth information within one lateral resolution cell is overlayed.In case of the SFCW algorithm, two targets at differing depths that lie in the same lateral resolution cell might still be distinguishable due to the high radial resolution.Nevertheless, when applying the maximum projection, only the depth information with the higher amplitude is retained.Hence, as for the respective application individual fingers need to be seperated, a lateral resolution below 1 cm is required.At a distance of 50 cm, the reduction in lateral resolution is clearly recognizable, which is why this distance should not be exceeded.
In Table V, an increase of the deviation between SFCWand 2FSK-based results with the increase in distance to the radar is evident.There are two main reasons for this behavior.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.First, the received signal amplitude decreases, which affects the 2FSK earlier compared to an SFCW signal generation, as only two transmit frequencies are sent, thereby making it more difficult to seperate the main lobe from sidelobes.Second, a decrease in lateral resolution affects both algorithms in a different manner, which can cause an increase in the deviation between the two.In summary, based on the prior point target simulations and the comparison of both approaches using real measurements for distances up to 50 cm, it can be stated that the novel 2FSK approach has a performance comparable to the state-of-theart radar imaging algorithm with a theoretical accuracy in the sub-millimeter to millimeter range.

C. Efficiency of the 2FSK Imaging Approach
Now that we have shown in the previous section that the results of the novel 2FSK 3D reconstruction are comparable to the results of the state-of-the-art approach, the aspect of  efficiency is now addressed.First, the new algorithm presented here, requires much less bandwidth than the broadband stateof-the-art reconstruction approaches.This reduces the requirements for the necessary hardware and increases bandwidth efficiency.Further, the presented reconstruction algorithm is computationally more efficient.In order to quantify this, the computational complexity -represented by calculation steps -of the novel approach is compared to the exemplary SFCW state-of-the-art algorithm in Table VI and Table VII.We assume the scenario as depicted in Section VI apart from setting N f = 10 GHz/200 MHz = 50 to ensure the same unambiguous range for both approaches.The advantage of the novel 2FSK algorithm lies in the reconstruction of two single-tone images at one z coordinate, whereas the brute-force state-of-the-art approach reconstructs a full volume by evaluating a high number of frequency steps paired with a computationally expensive maximum projection.The results reveal the immense reduction in computational effort by three orders of magnitude, thereby opening the way toward real-time reconstruction of hand poses.It should be noted that the SFCW-based algorithm requires that one is aware of the approximate position of the hand, otherwise huge 3D volumes have to be reconstructed, thereby resulting in a strong increase in computational burden.Using the iterative localization of the hand as proposed by the novel algorithm, this problem does not occur.In addition, this novel approach holds the possibility of considerably increasing the frame rate, as only two frequency steps are required.With regard to the hardware used here, the frame rate can be theoretically increased to 70 • 128 2 ≈ 4.5 kHz, allowing radar-based hand motion tracking even for fast motions.

D. Measurements Including a Real Human Hand
To further support our results, we also include an exemplary evaluation of a measurement setup in which a real Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.hand was positioned in front of the sensor (as depicted in Fig. 1 and described in Section II), and the 2FSK signal and reconstruction approach was applied.Again, we roughly positioned the hand at a distance of 30 cm and imitated the three letters from the ASL alphabet.For these measurements, we gained optimum results setting the measurement averaging to a factor of 1, as a real human hand cannot be static like the 3D printed versions.The reconstructed images as well as the results for the 3D reconstruction are depicted in Fig. 10.The threshold was further adjusted to −10 dB.The results of the 3D reconstruction of a real human hand provide further proof that the application of the 2FSK-based approach for 3D imaging of the human hand is valid and yields promising results.Their specific depth profiles can be clearly assigned to the respective hand poses.In addition, the radar image as well as the depth profile strongly resemble the measurement results when using the 3D printed hands.

VII. DISCUSSION
Our findings have clearly revealed that the proposed fast and efficient 2FSK-based imaging approach yields promising results for the 3D hand pose reconstruction that are comparable to the outcome of the state-of-the-art SFCW-based radar imaging approach.Simultaneously, our novel approach reduces computational complexity by three orders of magnitude and significantly increases possible frame rates.However, there are also a few drawbacks of the proposed method that appear to cause the small differences when comparing the 3D reconstruction outcome of both methods.One limitation exists due to the fact that in our proposed method, we only reconstruct one z-slice of our object scene.This leads to clutter having a stronger impact compared to the backprojection of an entire volume, followed by a maximum projection.In the future, it would be interesting to research how the robustness against clutter could be improved.Empty space measurements could be a viable approach in this context.Furthermore, the reconstruction of only one z-slice also makes it more difficult to seperate the side lobes from the main lobe.This is particularly important for hand areas that reflect only a small signal amplitude.An interesting approach to reduce this effect would be a stepwise subtraction of the point spread function.Moreover, the approach would benefit from one additional 2FSK evaluation round.This implies that the 3D coordinates are estimated by our proposed approach; thereafter, we reconstruct the depth profile obtained by our prior evaluation, instead of one z-slice, to increase reconstruction accuracy.In addition, given that only two transmit frequencies are evaluated, the impact of phase errors also must be considered.In general, radar-based approaches, unlike optical methods, suffer from numerous specular reflections, thereby limiting the hand areas, where a 3D reconstruction is possible.In the future, this could be addressed by multi-perspective imaging, which would also reduce the effect of occlusion.

VIII. CONCLUSION
In this manuscript, we present a novel algorithm for the 3D reconstruction of hand poses that incorporates the theory of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
frequency shift keying as part of the concept of millimeterwave imaging.For this approach, only little bandwidth is required, as only two closely spaced transmit frequencies are needed.The concept of frequency shift keying is first used to iteratively locate the hand within a 3D space.Thereafter, the phase difference of corresponding pixels between two single-tone images is evaluated according to the 2FSK principle to precisely estimate the contour of the hand.The big advantage of this approach is that it enables a direct reconstruction of the body shell, i.e., a 2D depth profile, and avoids the backprojection of a 3D volume, as is currently the case with millimeter-wave imaging.This fact brings great advantages.The presented approach is characterized by high efficiency in terms of computational effort and hardware requirements.Furthermore, image reconstruction complexity can be reduced by a factor of 1000, which brings the potential to reduce reconstruction time, while frame rates can be significantly increased.In this manner, we were able to not only raise the efficiency of radar-based static hand pose estimation but also open the way to precise real-time radar-based hand motion tracking.In this study, suitable simulations revealed that the algorithm yields a theoretical accuracy in the sub-millimeter to millimeter range.In real measurement setups, the novel 2FSK approach reveals comparable results to state-of-theart 3D radar imaging principle, with deviations lying in the millimeter range.Finally, it can be concluded that the novel 2FSK imaging approach has great potential to improve radarbased methods for capturing hand poses, including static and dynamic gesture recognition and hand-tracking applications.In the future, this methodology should be implemented by leveraging the advantages of radar-based methods and the advancements made in computer vision algorithms to enhance their performance.

Fig. 1 .
Fig.1.Measurement setup to capture hand pose and surface.The hand is positioned in front of the MIMO radar using a styrodur structure placed on a reference coordinate system of the x-y plane.

Fig. 2 .
Fig. 2. MIMO radar used within the measurement setup.(a) Photo of hardware.(b) Antenna distribution of the respective MIMO radar.

Fig. 4 .
Fig. 4. The 2FSK MIMO radar imaging principle.The 3D shape of the object within the volume of interest O is obtained by reconstructing two 2D images Ô1 ze and Ô2 ze of one z-slice at an estimated depth z e based on two neighbored frequencies.#» r p indicates the different pixel positions of interest.The actual depth of the target is calculated by evaluating the phase difference φ between respective pixels of both images and correcting the initially estimated depth z e .

Fig. 5 .
Fig. 5. Difference between d and z in case of a monostatic antenna.

Fig. 8 .
Fig. 8. 3D imaging results of three printed hand poses from the ASL alphabet.Radar image of 3D printed hand pose B (a) reconstructed at f 1 based on 2FSK approach and (b) based on SFCW approach after maximum projection.Estimated hand surface coordinates based on the novel 2FSK approach for hand poses (c) B, (e) F, and (g) U.Estimated hand surface coordinates based on the state-of-the-art SFCW approach for hand poses (d) B, (f) F, and (h) U.
the absolute maximum (|δz SFCW−FSK max |) and mean value (|δz SFCW−FSK |) for δz SFCW−FSK , describing the deviation between SFCW-and 2FSK-based z coordinate estimation.The maximum deviations |δz SFCW−FSK max | lie in the range of 1 cm to 2 cm, whereas the mean value of the deviation lies approximately around 3 mm to 5 mm.It becomes evident that the algorithms generate similar results for all analyzed sign language letters.

Fig. 9 .
Fig. 9. Comparison of 3D imaging results of printed hand pose F for increased distances.Estimated hand surface coordinates at a distance of 40 cm to the radar based on the (a) 2FSK and (b) SFCW approach.(c) 2FSK and (d) SFCW based 3D reconstruction results at a further increased distance of 50 cm.

Fig. 10 .
Fig. 10.3D imaging results of a real human hand based on the 2FSK approach for three exemplary hand poses from the ASL alphabet (left: B; middle: F; right: U).Images reoconstructed at f 1 (top) and the respective estimated hand surface coordinates of the hand poses based on the 2FSK algorithm (bottom).The extracted 3D information accurately represents the corresponding hand pose and allows a clear assignment.

TABLE I DEVIATIONS
PER COORDINATE -δx , δy, δz -COMPARED TO TRUE POINT TARGET POSITION P FOR DIFFERING VALUES OF x AND y IN METERS.z IS KEPT CONSTANT AT 0.03 m

TABLE II DEVIATIONS
PER COORDINATE -δx , δy, δz -COMPARED TO TRUE POINT TARGET POSITION P FOR DIFFERING VALUES OF z IN METERS.z IS KEPT CONSTANT AT 0.03 m

TABLE III DEVIATIONS
PER COORDINATE -δx , δy, δz -COMPARED TO TRUE POINT TARGET POSITION P FOR DIFFERING VALUES OF z IN METERS.P IS KEPT CONSTANT AT (0.00, 0.00, 0.30) METER

TABLE IV DEVIATION
δz SFCW−FSK OF ESTIMATED z COORDINATE BASED ON SFCW COMPARED TO 2FSK ALGORITHM IN MILLIMETERS

TABLE V DEVIATION
δz SFCW−FSK OF ESTIMATED z COORDINATE IN CASE OF LETTER F AND INCREASED DISTANCES TO THE RADAR

TABLE VI COMPUTATIONAL
COMPLEXITY OF PROPOSED 2FSK ALGORITHM

TABLE VII COMPUTATIONAL
COMPLEXITY OF SFCW (STATE-OF-THE-ART) ALGORITHM