Explorations of Autonomous Prosthetic Grasping via Proximity Vision and Deep Learning

The traumatic loss of a hand is usually followed by significant psychological, functional and rehabilitation challenges. Even though much progress has been reached in the past decades, the prosthetic challenge of restoring the human hand functionality is still far from being achieved. Autonomous prosthetic hands showed promising results and wide potential benefit, a benefit that must be still explored and deployed. Here, we hypothesized that a combination of a radar sensor and a low-resolution time-of-flight camera can be sufficient for object recognition in both static and dynamic scenarios. To test this hypothesis, we analyzed via deep learning algorithms HANDdata, a human-object interaction dataset with particular focus on reach-to-grasp actions. Inference testing was also performed on unseen data purposely acquired. The analyses reported here, broken down to gradually increasing levels of complexity, showed a great potential of using such proximity sensors as alternative or complementary solution to standard camera-based systems. In particular, integrated and low-power radar can be a potential key technology for next generation intelligent and autonomous prostheses.

myoelectrically operated, the human-machine interface (HMI) relies on myoelectric (or electromyographic) sensors placed on the surface of the residual limb.Signals picked up from these sensors can be used to drive the prosthesis in a onemuscle-one-movement simplistic approach (e.g., flex biceps to close the hand).Despite several major and well-known challenges related to surface myoelectric signal acquisition (noise, motion artifacts, interface impedance changes due to humidity, temperature and pressure) [2], [3] whose inconsistencies make the grasp of an object hardly reliable and repeatable, this simple control approach proved to be somehow functional with basic prosthetic grippers, but failed to translate efficiently when modern multi-articulated (or multi-grasp) prosthetic hands reached the market (e.g., iLimb Ultra, Össur, Iceland and BeBionic, Ottobock, Germany).When operating multi-articulated prostheses, muscular contraction patterns are often used as the switching mechanism between different hand grasps (e.g., flex wrist three times to enable pinch grasp), killing the intuitiveness of the control.That is why in the past decades, researchers spent extensive efforts trying to relieve the amputee users from the burden of a nonintuitive HMI, moving the learning to the machine instead via artificial intelligence algorithms [4], [5], [6], [7], [8], [9], [10], [11].Thanks to those efforts, it is now widely accepted that machine learning algorithms applied to myoelectric signals can indeed facilitate the user when operating prosthetic hands with more than one degree-of-freedom, and this is confirmed also by the commercial interests behind this solution (e.g., Complete Control, COAPT, USA, and Myo Plus, Ottobock, Germany).However, the aforementioned challenges related to surface myoelectric signal acquisition still remain, latently contributing to disrupt the control experience.
Even though acceptance rate surveys surely need a more frequent update, they seem to unanimously capture strong trends of prosthesis rejection, sometimes as high as 40% [12], [13].Obviously, the prosthetic challenge of restoring the human hand functionality and dexterity is still far from being achieved.Arguably, this is in part due to major challenges related to the acquisition of myoelectric signals from the surface of the skin, or to the total lack of tactile sensory feedback, challenges that coming-soon implanted solutions are successfully overcoming [14], [15], [16].However, these implanted solutions are still under clinical investigation and will not reach the mass before a decade, and most importantly, they cannot provide a full answer to the complex problem of restoring the human hand functionality.Such articulated problem must be addressed in parallel from different directions: more intelligent hardware must be developed for the HMI as much as for the robotic prostheses.Unfortunately, the efforts spent so far on more intelligent and autonomous robotic hardware are far from being satisfactory, both from an engineering and a clinical perspective.The idea of semi-autonomous prosthetic hands was proposed decades ago and previous work already attempted to address this need [17], [18].Shape recognition for automatic grasp adaptation via cameras and image processing techniques was proven possible for industrial automation [19] and prosthetic purposes [20], [21], [22].Došen et al. explored object size and shape recognition via an RGB camera and a laser depth sensor located on a prosthetic hand [23].Markovic, Mouchoux, et al. explored similar purpose (i.e., object recognition for the identification of the most adequate prosthetic grasp) as well as automatic wrist orientation, via augmented reality glasses, depth and inertial sensors [24], [25], [26].Automatic grasp selection, pre-shaping and wrist orientation was also recently pursued by Nobre Castro and Došen via an infrared depth camera placed on the dorsal side of a prosthetic hand [27].Lastly, Starke et al. recently presented a semi-autonomous control strategy for object recognition and grasp selection via RGB camera, inertial and distance sensors, directly interfaced with the user via a display and a single myoelectric channel, completely self-contained within the KIT robotic hand [28], [29].
Unfortunately, despite the promising results none of the proposed prosthetics solutions still has reached any clinical implementation.Arguably, this might be due to the difficult portability of computer vision solutions into a self-contained wearable system (i.e., too demanding hardware and data processing) and into the uncertainties of unconstrained environments.Therefore, it still remains a wide potential benefit in exploring other integrating and concurrent solutions pushing towards prosthetic hands which are semi (or potentially even completely) autonomous from the conventional myoelectric HMI.To this goal, a variety of proximity sensors can be further explored [30], [31], [32].For instance, radars proved high potential for an accurate recognition of materials and body parts [33], and they are progressively becoming essential in fields such as automotive [34], civil engineering [35], contactless vital sign monitoring systems [36], [37] and novel human-machine interfaces [38], [39].
Here, in an attempt to contribute to the field, we hypothesized that a combination of a radar sensor and a lowresolution time-of-flight camera can provide sufficient data for autonomous control approaches, in particular for shapes and materials recognition in both static and dynamic scenarios.To test this hypothesis, we analyzed via deep learning algorithms a human-object interaction dataset (HANDdata [40]) comprising of first-person data recorded from a variety of sensors, including proximity (i.e., state-of-the-art radar and time-of-flight sensors), inertial, load cells and fingers stretch.This dataset includes almost 6000 human-object interactions focused on the reach-to-grasp action performed by 29 different able-bodied individuals, with 10 standardized objects of 5 different shapes and 2 kinds of materials.Moreover, further inference testing was performed on unseen data that was purposely collected following HANDdata original protocol.The analyses reported here, broken down to gradually increasing levels of complexity, showed a great potential of using such proximity sensors as alternative or complementary solution to standard camera-based systems.In particular, integrated and low-power radar can be a potential key technology for next generation intelligent and autonomous prostheses, and perhaps even for other applications in healthcare, social (e.g., mobile servant) and industrial (e.g., warehouse robots) robotics.

II. MATERIALS AND METHODS
This study aims to investigate the feasibility of using state-of-the-art proximity sensors, such as a radar and a timeof-flight camera, for object recognition in the context of autonomous prosthetic hands.To this aim, a dataset of humanobject interactions was analyzed via deep learning.In order to investigate which sensor can be better suited for the aimed goal of recognizing the target object, all analyses were performed in 3 conditions: 1) radar data alone (RAD), 2) time-of-flight data alone (TOF), and 3) a combination of radar and time-offlight data (RADTOF).Additionally, to further investigate the grasping scenarios, a fourth data condition was defined with the addition of the inertial sensor (RADTOFIMU).This last condition was deemed helpful to explore effects of fusion of data from both the environment and the participant's behavior.

A. Human-Object Interactions Dataset
Most of the analyses reported in the following were performed on the HANDdata dataset [40], a data collection specifically tailored for autonomous grasping of a robotic hand and with particular attention to the reaching phase.This dataset included almost 6000 human-object interactions recorded via radar and time-of-flight sensors mounted on the forearm of able-bodied participants.These interactions focused on one of the most representative actions of daily life object manipulation, namely the reach to grasp.All details about the dataset are included in its dedicated report, but we briefly summarize the most important ones here for readers' convenience.
29 healthy adults participated to the data collection.Participants were asked to reach-grasp-lift-and-replace different objects while wearing an instrumented glove (Figure 1, left).The glove was instrumented so to track fingers stretching, kinematics of the forearm, and proximity measurements of the target objects.Forearm kinematics was tracked via 3-axes accelerometer and gyroscope of the BMI160 (Bosch, Germany) inertial sensor acquiring at 120 samples per second.The proximity sensors used were a pulsed coherent radar (A121, Acconeer, Sweden) and time-of-flight sensor (VL53L5CX, STmicroelectronics, France-Italy).The A121 radar was set to emit trains of 60 GHz pulses with known starting phase in pulsing/silent cycles alternating at 13 MHz.Knowing the velocity of the emitted pulses, the radar then tried to reconstruct reflections at certain discrete distances by analyzing the time taken to receive the echoes.In the configuration used in this study, the radar received pulses at times related to the distances of 35 range points, equally distributed from 6 cm to 41 cm (i.e., each range point is The dataset analyzed in this study includes human-object interaction data acquired from 29 participants manipulating 10 different objects while wearing an instrumented glove.The instrumented glove (left) was based on a CyberGlove to which proximity and inertial sensors were added via a custom circuit board mounted on a customizable wrist band.Ten different target objects were used (right), each linked to a particular hand grasp.Abstract objects like a sphere, a cylinder, a triangular prism, a cuboid with a thin rectangular prism 'handle' and a thin rectangular prism were meant to trigger the spherical, power, tri-digit, lateral and pinch grip pattern, respectively.separated by 1 cm).Then, a complete reading of all range points, also defined as a sweep, was repeated 40 times with a rate of 800 Hz.Lastly, these 40 sweeps were collected in frames (i.e., frame size 40x35) and acquired with a rate of 15 frames per second.The radar field of view covered a 3D region of approximately 65x53 degrees at 50% of beam power.The VL53L5CX time-of-flight sensor tried to estimate target's distance by illuminating the region of interest with 940 nm photons (i.e., invisible light).It was set for 64 pixels resolution at 15 frames per second (i.e., frame size 8x8).Its field of view covered a 3D region of approximately 45x45 degrees at 75% of beam power.
Ten different target objects were used in the data collection protocol and each object was meant to trigger a certain grasp pattern (Figure 1, right).Specifically, abstract objects like a sphere, a cylinder, a triangular prism, a cuboid with a thin rectangular prism 'handle' and a thin rectangular prism were meant to trigger the spherical, power, tri-digit, lateral and pinch grip pattern, respectively.Each object was available in two materials, wood or aluminium.Objects dimensions, materials and weights are standardized within the Southampton Hand Assessment Procedure [41], a clinically validated functional assessment, well-established in the field of prosthetics.
The dataset included human-object interactions from four different scenarios with gradually increasing complexity (for explicative figures and further details check HANDdata dedicated publication [40]), namely: 1) Bench-static, ideal scenario in which the proximity sensors were fixed in a clamp and steadily facing the target objects at a fixed distance and position.2) User-static, static scenario in which the participants were steadily standing in front of the target objects with the proximity sensors partially facing the targets.3) Pick-and-Lift, dynamic scenario in which the participants were reaching the target, to pick it, lift it about 10 cm, and then reposition it on the same start-area.4) Pick-Lift-and-Move, dynamic scenario in which participants were reaching the target, to pick it, lift it about 10 cm while transporting it to a land-area different from the start-area, 40 cm away.Start-and land-area were alternated at each trial ultimately providing data for pickand-lift from two different approach directions.For user-static and grasping scenarios, the participants were asked to start and end each trial with the arm in rest position.The rest position was defined as the upper arm adjacent to the body trunk, with the elbow joint bent at 90 degrees, and with the hand palm perpendicular to the floor.Moreover, the subject's alignment in respect to the target object changed depending on the scenario.For the user-static and pick-andlift scenarios, the subject's arm in rest position was aligned on the single area of interest where the target objects were located (i.e., left instrumented platform).Instead, for the picklift-and-move scenario, the subject's arm in rest position was aligned with the center of the two start and land areas (i.e., in between the left and right instrumented platforms) causing the target objects to be poorly or not in view at movement start.
All target objects were acquired in all scenarios.Moreover, "no-object" acquisitions were included for the static scenarios.These were achieved for the bench-static scenario by having no target under the proximity sensors, and for the user-static by having no hit within 1 meter range.Unfortunately, HANDdata did not include "reaching to no target" trials for the grasping scenarios.

B. Deep Learning Models
Deep learning models were trained and tested so to assess the feasibility of shapes and materials recognition via proximity sensors.The models consisted of a convolutional neural network (CNN) for the RAD and TOF conditions, and a mixed-input neural network for the RADTOF condition.All models were implemented with TensorFlow framework, trained and tested via GPU (T4, Nvidia, USA).The models' architectures and hyperparameters were found via preliminary testing which involved also two steps random search via Keras tuner for number of layers and units, for dropout, activations, and learning rate, over 100 trials with 3 executions each.
1) CNN: The CNNs consisted of sequential CPR blocks (Figure 2) including four layers (convolution 3x3 kernel + max-pooling + batch-normalization + leaky-ReLU activation).A regularization dropout layer with rate 0.25 was attached after the activation function in each CPR block.The number of CPR blocks was defined as three for the RAD condition and two for the TOF condition.For the RAD condition, the CNN's input was defined as 64x35 2-channels images composed of real and imaginary parts of the 64-points Fast Fourier Transform of each radar's frame along the sweeps dimension (i.e., doppler map).For the TOF condition, the CNN's input was defined as 8x8 1-channel images composed of the sensor's raw frame.For the CNNs output, a final dense layer allowed to adapt each model to each particular classification problem.
2) Mixed-Input NN: For the RADTOF condition (i.e., the combination of radar and time-of-flight data), a mixed-input neural network was used (Figure 2) composed by two CNN input branches, respectively for RAD and TOF data, defined as above, followed by four linear layers with leaky-ReLU activation functions.Two dropout layers with rate 0.5 were attached to the second and to the fourth linear layer.For the output, a final dense layer allowed to adapt the model to each particular classification problem.
In order to further investigate the grasping scenarios, another data condition was defined with the addition of the inertial sensor.For this condition, namely RADTOFIMU, a third branch was added to the mixed-input neural network (Figure 2) composed by two linear layers with leaky-ReLU activation functions and dropout layers with rate 0.1.The input of this branch was defined as 6x1 (3-axes accelerometer and gyroscope) and fed with samples of the inertial sensor which were closest in time to the radar and time-of-flight frames.

C. Training and Offline Testing
For each of the deep learning model, a similar procedure was followed for the training and testing.Each dataset was shuffled and split between train, validation and test data sets taking respectively 80%, 10% and 10% of the data, ensuring balance among the different classes' samples.Networks were trained with the Adam stochastic optimizer up to 200 epochs, with batch size of 32 and learning rate 0.001.Then, at the end of the training epochs the network was further tested on the test_set data, from which the accuracy was computed.Test accuracy is reported as raw, thus no post-classification filtering was applied.
The data considered for the training and offline testing was: -All available frames for bench-and user-static scenarios, -frames relevant to the reaching motion for pick-and-lift grasping scenario, thus all frames included in the time range from −0.6 to −0.2 seconds before contact with the object (i.e., from −9 to −3 frames).Importantly, no data from the pick-lift-and-move scenario was used for networks training nor for offline testing.We deemed more relevant to the explorative narrative of this study to leave that scenario for the inference testing part to showcase performances at worst case.
The instant of contact with the object was approximated from the inertial sensor data as the time instant of the first maxima in the z-axis (i.e., the axis parallel to the reaching direction), thus the instant of zero in the acceleration towards the target object.A simple sanity check was performed on pick-and-lift data discarding trials that were too fast (1.6% discarded data, 45 out of 2900 trials).
For the user-static and grasping scenarios, the training and testing was performed considering: -aggregated data, thus from the 29 participants together, -individual data, thus only relevant to each participant.Individual results are reported with the format MED:IQR (median: interquartile-range).Even though aggregating the participants' data was our main strategy (see inference testing), we intended to include here also subject-specific explorations so to provide an interesting glance on alternative approaches tailored on each end-user.
The classification problems were defined according to the scenario of interest so to showcase the potential for Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Materials recognition was explicitly defined only for benchstatic, however it was implicitly reported also for the other scenarios by the "which object?" classification problems.We considered more relevant to the explorative narrative to provide results about this worst-case.It is important to note that some explorations (i.e., "which grasp?" classification problems) ultimately aimed to demonstrate potential for the autonomous selection of the most adequate hand prosthesis grasp for manipulating a certain object.To this aim, we intended to keep these explorations tied to a realistic scenario of a modern multi-functional prosthetic hand with a limited selection of grasps, namely power and pinch grasps, a common clinical setup for prosthesis users.
A last "no-object" class was always added to classification problems related to the static scenarios of benchand user-static, aiming to explore the capability to recognize the no-target situation in static conditions.Such ability could arguably stabilize the autonomous control and avoid misclassifications before any reaching motion would take place.Moreover, considering the preliminary investigation and demonstrative purposes of this study, some analyses regarding the user-static and grasping scenarios were performed considering only the objects made of metal (TABLE I).This was deemed necessary due to the limited permittivity of the wood (i.e., high transparency to the radar) so to limit inherent obvious advantages between different data conditions and focus the exploration on object recognition during motion.

D. Inference: Testing With New Collected Data
All models trained with the aggregated data from all 29 participants were further explored via inference testing, thus by measuring classification accuracy on unseen data not part of the original dataset.For this, new data was collected according to the particular scenario of interest trying to replicate as close as possible the original data acquisition protocol [40].In particular, for inference testing on the bench-static scenario, new data was acquired for each object for two different orientations (1 st and 2 nd from the bench-static protocol), for about 6 seconds each.Then, for inference testing on userstatic, grasping pick-and-lift and pick-lift-and-move scenarios, a new data collection was performed with a participant that was already part of the original dataset.However, CyberGlove and platforms data were not acquired as in the original protocol.
The participant signed informed consent for data and media acquisition and public release.The ethical approval was provided by Ethical Committee of the Scuola Superiore Sant'Anna (ref.12/2022).
The actual inference testing was performed on MATLAB (Mathworks, USA) by evaluating the TensorFlow nets exported in *.h5 format on the freshly acquired data.Specifically, the data considered for the inference testing was: -Last 60 acquired frames (about 4 seconds) for each object for bench-and user-static scenarios, -All frames included in the time range from −0.6 to −0.2 seconds before contact with the object (i.e., from −9 to −3 frames) for pick-and-lift and pick-lift-and-move grasping scenarios.The classification problems were as defined in TABLE I.Only for the grasping pick-and-lift and pick-lift-and-move scenarios a mode post-classification filter was applied, therefore considering as final prediction the most frequent predicted class from the analyzed frames related to each different trial.

E. Statistical Analysis
A correlation analysis was performed between participants' height and resulting accuracies in static-user scenario (corrcoef function available on MATLAB).No further statistical analysis was performed on the data.

A. Offline Analysis 1) Bench-Static: Can We Recognize the Object on the
Bench?: An illustrative sample of the proximity sensors images provided to the CNNs for the bench-static scenario is shown in Figure 3.
Results from the bench-static scenario (Figure 4) showed perfect accuracies for all objects (100%), shapes (100%) and materials (100%) recognition with the radar data.Accuracies were lower with the TOF data for all objects (81.92%), shapes (69.43%) and materials (92.86%) recognition.Interestingly, the combination of radar and TOF data allowed top accuracies as the ones achieved with the radar data, resulting in 100% accuracy for all objects, shapes and materials recognition.
2) User-Static: Can We Recognize the Object From the User?: High recognition accuracies were found also in the user-static scenario when considering all participants together and RAD and RADTOF conditions (Figure 4).Indeed, when using radar data alone the offline accuracies were 95.42%, 98.62% and 97.34% for all objects, shapes and grasps recognition, respectively.For RADTOF, accuracies were 95.12%, 99.17% and 99.36% for objects, shapes and grasps.Lastly for TOF, accuracies were 59.05%, 75.41% and 90.22% for all objects, shapes and grasps recognition, respectively.
In general, the TOF sensor proved to be more sensitive to the less optimal centering of the object in its field of view, demonstrated also by the significant negative correlation between participants' height and resulting accuracies (p=0.03).
3) Pick-and-Lift: Can We Recognize the Object While Moving Towards It?: Accuracies certainly dropped when analyzing the dynamic scenario of the user reaching to grasp the object.Even though results presented quite some variability, they show a trend of TOF outperforming the other data conditions in both aggregated and individual analyses (Figure 4).Specifically about the aggregated analysis, when using timeof-flight data alone the accuracies were 74.81%, 92.10% and 96.97% for all objects, shapes and grasps recognition, respectively.When using radar data alone the accuracies were considerably lower, to 36.21%, 62.58% and 66.67% for all objects, shapes and grasps.Recognition accuracies benefitted from considering more sensors together.Indeed, for RADTOF accuracies were 77.14%, 62.15% and 92.90% for all objects, shapes and grasps.Lastly, the addition of inertial sensor data (i.e., RADTOFIMU) allowed accuracies of 57.87%, 81.91% and 95.84%.
Similar findings were reached also for the individual analysis, with TOF overall leading the recognition performances, with RAD quite unreliable, and with the accuracies from all data conditions converging to high values when considering a simple 2-classes problem of power grasp versus precision pinch grasp.Specifically, for TOF average accuracies were 86.76:9.04%,93.55:14.71%and 91.89:10.91%for all objects, shapes and grasps recognition, respectively.For RAD data condition, accuracies were 41.18:33.17%,63.64:52.59%and 71.79:20.99% for objects, shapes and grasps.When it comes to the mixed-input neural networks, RADTOF accuracies were 82.81:15.80%,45.71:47.22%and 91.43:27.68%,while RADTOFIMU accuracies were 85.48:18.01%,57.58:34.44%and 85.37:14.74%for all objects, shapes and grasps recognition, respectively.

B. Inference Analysis on New Data
All models trained with the aggregated data from all 29 participants were further explored via inference testing, thus by measuring classification accuracy on unseen data not part of the original dataset.Results for all data condition, scenario and classification problem are reported in Table II.A summary of the most promising results for each scenario and classification problem is depicted in Figure 5. via confusion matrices.As expected, the bench-static scenario proved to be the easiest to solve with RADTOF being the condition which closest followed the accuracies seen during offline testing.Here, beside showing misclassifications between the wooden sphere and cylinder and between the wooden prisms, resulting accuracies were relatively high even when dealing with all objects available placed with different orientations (Video 1).Instead, user-static proved to be the most challenging scenario in which a poorly controlled alignment with the target object can considerably deteriorate recognition.Indeed, classifications were quite random when trying to recognize all objects, and misclassifications were mostly related to the small prisms and the large cuboid and sphere when trying to recognize the shapes.Nevertheless, the RAD condition allowed discrete performances when the classification problem was simplified in choosing among only two hand grasps, power or precision grasp.The inference accuracies for the grasping pick-and-lift scenario were quite unexpected.Here, results showed promising performance for the recognition of shapes and grasps even when sensors were moving towards the target, reaching classification accuracies as high as 94% with RADTOF condition on the 2-classes problem.The inclusion of inertial sensor data proved to be beneficial when matching the classification problem to the number of different reaching trajectories of each shape, with misclassifications mostly related to the small prisms.
1) Pick-Lift-and-Move: What Happens If We Include More Directions of Approach?: The pick-lift-and-move trials included an increasing level of complexity, namely a different and more disadvantageous starting position as well as two more directions of approach towards the target object, rightwards with the target poorly in view at start and leftwards with the target not in view at start.The same models trained with the aggregated data from all 29 participants were further inference tested on this unseen and diverse data.Results are reported in TABLE III.Overall and as expected, accuracies dropped from the pick-and-lift inference tests, particularly for all neural networks which depended on time-of-flight sensor Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.data.Highest accuracy was 68% and reached by RAD CNN in the 2-classes problem.Moreover, solving the recognition problem for ten different objects with two shapes and two materials seemed just unfeasible for all data conditions.

IV. DISCUSSION
The feasibility analyses reported in this study seem to confirm the original hypothesis of state-of-the-art proximity sensors being a possible alternative for shapes and materials recognition of objects.Indeed, data from a modern integrated radar and a time-of-flight 8x8 pixels camera proved to be sufficient for the reliable and repeatable recognition of ten different objects in an ideal static scenario (Video 1), thus with the targets steady and located within the sensors' field of view.Moreover, it was investigated how gradually increasing the complexity of this ideal scenario would change the recognition accuracy.Specifically, promising results were found also in a less ideal but still static scenario of subjects standing in front of the target object with limited control on the sensors' field of view centering and alignment in respect to the targets.Here, data from a single radar sensor was sufficient to achieve discrete performances in recognising the most indicated hand grasp to use before starting the reaching motion towards the object.Against expectations, such radar sensor (and its combination with other sensors) proved also promising results when facing dynamic scenarios, thus showing the potential to recognize the target object even during reaching motion.
Interestingly, the bench-static setup unveiled the potential of using a radar sensor for the recognition of materials.Even if assessed briefly and with two very different materials, these results are in harmony with previous literature [33], showing a great potential for radar sensor to characterize complex objects based on their permittivity.Such considerations are to be taken into account for the design of modern intelligent prosthetic hands, likely based on a mixed-sensors system where, hypothetically, materials can be recognized in a later stage of the grasping action so to allow minor grip adjustments and prevent object slippage.
As seen in the inference tests, the trained networks responded differently to unseen data in different scenarios and classification problems.Due to the limited testing is it therefore hard to make any claim on the data generalizability.While aggregating the data from multiple participants seems to be the way to go, it might be also beneficial, for an actual clinical translation point of view, to better tailor the data collection and network training to the individual characteristics of the participant of interest (i.e., the end-user), probably also

TABLE III INFERENCE ACCURACIES (%) FOR PICK-LIFT-AND-MOVE
including different object reaching directions.The negative correlation found between participants' height and resulting accuracies would suggest that further sanity checks on the data might be needed in order to reduce unhelpful data variability.
This study intended to compare two state-of-the-art proximity sensors, so to understand their different potentials for the given problem.Moreover, we intended also to verify the original hypothesis that a combination of these sensors can be of value for the given problem.Overall, the data from the radar sensor was better suited for the different problems in all scenarios, sometimes even outperforming the combination with inertial and TOF sensors.However, combining radar and time-of-flight data proved to be successful for an ideal bench-static setup.Oppositely, the time-of-flight sensor heavily suffered from misalignments with the target objects, in line with its reduced field-of-view compared to the radar (45x45 • vs 65x53 • ), but its data became more and more relevant as the sensor progressed towards the target.Additionally, the radar performance might be improved with different data acquisitions (e.g., different range points and sweeps settings for the radar) and different processing (e.g., alternative to common doppler maps, advanced background removal techniques).Nevertheless, a combination of the two proximity sensors might still be the way to go, perhaps optimizing the deep learning models so to reduce the impact of a sensor in the final prediction accordingly to the scenario of interest.
While it is clear that introducing a less controlled alignment within the sensors' field of view as well as diverse data from dynamic situations (like reaching and grasping the objects from different directions) can have considerable effects on the final accuracy, the results are still remarkable considering the simplicity of these sensors compared to those normally operated in similar research, such as RGB-D cameras.Such considerations can have implications in the development of future autonomous robotic grippers, especially prosthetic hands.At first, proximity sensors can be a valid alternative to more power and computation demanding camera-based systems for the recognition of objects or their basic characteristics relevant to the grasp.Moreover, such recognition seems feasible either before and after starting the approach movement towards the target, provided that the objects are somehow within the sensors' field of view.The latter finding opens up the possibility of performing and/or updating the intended grasp on the robotic hand while the amputee user moves towards the target, a challenge and achievement that is still unprecedented in the field.Ultimately, we envision a system in which a complex target recognition can be achieved via three phases: 1) during the static phase to allow a correct preshape of the prosthesis, 2) during the reaching phase to allow corrections of the grasp or to alternatively alert the user of a potential misclassification, and 3) during the target-isgrasped phase to allow recognition of further details of the object and consequent grip adjustments.It remains of primary interest to understand how each of these phases would impact the naturalness of prosthesis use and thus the final user's acceptance.

A. Limitations
To the best of our knowledge, this study offers for the first time the exploration of radar and time-of-flight proximity sensors for object properties recognition and grasp facilitation in autonomous prosthetic hands.However, we report here only pure offline analysis.Even though the inference tests included here are of great help to assess the nets translational potential, such results are limited and to be intended as a feasibility check; as seen already in other technologies, the system performance can wildly change during real-time tests.For this reason, it is our intention to continue optimize the deep learning models, port them from the computer to a wearable format, and then test such approach in real-time with an actual robotic prosthetic hand.
This study analyzed a limited set of objects and materials, thus it is still unclear how results are transferrable to a realistic scenario of use with a prosthetic hand.Nevertheless, we argue that the target objects used here can be already fairly representative of certain daily-life tasks because these objects were purposely selected for the Southampton Hand Assessment Procedure, a well-known, clinically validated functional assessment for robotic hands intended for prosthetics purposes.
The "no-object" class poses further challenges to the userstatic scenario (i.e., when trying to recognize an object before actually starting the reaching motion).Here, the no-object class for user-static was simply represented by ideal data acquired by the proximity sensors while pointing at nowhere with no hit within a 1 meter range.However, such ideal data might be hard to acquire in a real implementation with the amputee hovering the prosthesis over a desk, or towards a group of objects.The risk of frustrating misclassifications is higher in a realistic scenario.
Even though shape recognition proved to be feasible while approaching the target object, no hand prosthesis currently on the market can change its fingers posture to a grasp in less than half a second.Thus, with current robotic technology amputee users would need to reach the target with slower speed than able-bodied individuals.Nevertheless, the idea and goal remain of interest because it would allow the hand prosthesis to better mimic its biological counterpart.
Architecture and hyperparameters optimizations on the deep learning models used here were limited.Preliminary explorative iterations were conducted to find a basic network architecture and a hyperparameters set that could suit all data conditions, after which the networks were not further optimized during the analyses.However, there is a large potential for improvement by performing more extensive systematic optimizations tailored to the situation of interest (i.e., which sensor, static or dynamic, which classification problem).This route will be surely further explored in the next steps of the project.

V. CONCLUSION
In this study we explored for the first time the use of state-of-the-art proximity sensors for object grasp facilitation in autonomous prosthetic hand.To this aim, an extensive human-object interactions dataset was analysed via deep learning models trying to classify the different target objects.Results showed a promising potential for grasp classification (i.e., selecting the adequate hand grasp to be used) before and also during the reaching-to-grasp motion.Results seem to suggest modern, low-power radar as a potential key technology for next generation intelligent and autonomous robotic hands.In particular, these results seek to contribute to the development of alternative control approaches for prosthetic hands, less dependent on the conventional but knowingly unreliable electromyographic human-machine interface.Autonomous prosthetic hands can be a game changer, envisioning a complex system that can interact in an intelligent fashion with the user and any object.

Fig. 1 .
Fig.1.Instrumented glove (left) and target objects (right).The dataset analyzed in this study includes human-object interaction data acquired from 29 participants manipulating 10 different objects while wearing an instrumented glove.The instrumented glove (left) was based on a CyberGlove to which proximity and inertial sensors were added via a custom circuit board mounted on a customizable wrist band.Ten different target objects were used (right), each linked to a particular hand grasp.Abstract objects like a sphere, a cylinder, a triangular prism, a cuboid with a thin rectangular prism 'handle' and a thin rectangular prism were meant to trigger the spherical, power, tri-digit, lateral and pinch grip pattern, respectively.

Fig. 2 .
Fig. 2. Illustration of the architecture of the mixed-input neural network used for the combination of radar (RAD), time-of-flight (TOF) and inertial sensors (i.e., RADTOFIMU).The single RAD and TOF CNNs are depicted in their respective branch of the mixed-input net.

Fig. 3 .
Fig. 3. Images from the proximity sensors data.Visualization of the data acquired from the radar and time-of-flight sensors while steadily facing the different target objects.For illustration purposes, radar images were plotted as RGB images with blue channel set to zero, and time-of-flight images were plotted as pseudo-colour images with "viridis" colour mapping.Lastly, all images were smoothed via Gaussian interpolation.

Fig. 4 .
Fig. 4. Offline test accuracies for the different scenarios, data conditions and classification problems.The data conditions were defined as RAD when considering radar data alone, as TOF when considering time-of-flight data alone, as RADTOF when considering a combination of radar and time-of-flight, and as RADTOFIMU when considering also the inertial sensor.The scenarios were bench-static (top-left), user-static (top-right) and pick-and-lift (bottom).The different classification problems are explained in the spiderplots legend.

Fig. 5 .
Fig. 5. Confusion matrices of the most promising results from the inference test.The matrices visualize classifications for the three scenarios (columns) and for the different classification problems (rows).The different problems are indicated with different colours (legend as in Fig. 4), while classes are indicated with initials.For the shapes, sphere = S, cylinder = C, triangular prism = T, cuboid with a thin rectangular prism 'handle' = B, thin rectangular prism = W, no-object = NO.For the different materials, metal = M and wood = W. Lastly, for the grasps, power = PO, precision pinch = PI.

TABLE I CLASSIFICATION
PROBLEMS DEFINED FOR TRAINING, OFFLINE AND INFERENCE TESTING recognizing different objects and object properties in static situations or even during the reaching motion (TABLE I).