A Dexterous Hand-Arm Teleoperation System Based on Hand Pose Estimation and Active Vision

Markerless vision-based teleoperation that leverages innovations in computer vision offers the advantages of allowing natural and noninvasive finger motions for multifingered robot hands. However, current pose estimation methods still face inaccuracy issues due to the self-occlusion of the fingers. Herein, we develop a novel vision-based hand-arm teleoperation system that captures the human hands from the best viewpoint and at a suitable distance. This teleoperation system consists of an end-to-end hand pose regression network and a controlled active vision system. The end-to-end pose regression network (Transteleop), combined with an auxiliary reconstruction loss function, captures the human hand through a low-cost depth camera and predicts joint commands of the robot based on the image-to-image translation method. To obtain the optimal observation of the human hand, an active vision system is implemented by a robot arm at the local site that ensures the high accuracy of the proposed neural network. Human arm motions are simultaneously mapped to the slave robot arm under relative control. Quantitative network evaluation and a variety of complex manipulation tasks, for example, tower building, pouring, and multitable cup stacking, demonstrate the practicality and stability of the proposed teleoperation system.

and hand pose estimation or gesture classification in computer vision, markerless vision-based teleoperation has gained great achievements in recent years [4], [5].It provides a natural and efficient way for teleoperation, especially for anthropomorphic hands.
In prior methods for markerless vision-based teleoperation, the human hand pose estimation algorithms usually are carried out following a kinematic retargeting process.In contrast to these works, an end-to-end regression model that takes human hand images as inputs and predicts the robot joint commands bypasses the intermediate process and directly targets the robot system.Also, the end-to-end model is more intuitive for novice demonstrators and saves post-processing time in practice.A primary concern of the end-to-end approach is exploring the rich common features between two image domains (human and robot hands) to learn the kinematic mapping from the human hand to the robot hand.Recently, image-to-image translation has become a prevalent method for discovering the hidden mapping feature between two representations in robotic imitation learning and style transfer [6], [7].Thus, it is promising to improve the cross-domain prediction accuracy by introducing an image-to-image translation mechanism into the hand pose estimation models.
However, these vision-based pose estimation algorithms still suffer inaccuracy issues due to the self-occlusion of the fingers, especially when the visual data is provided by a single and fixed camera.On average, the mean errors of the state-ofthe-art hand pose estimation algorithms by a single camera are less than 10 mm but only when the angle between the camera direction and the human hand is less than 30 • [8], [9].A wellcalibrated multicamera system is commonly used to perceive more information, but it cannot avoid extreme viewpoints for the target objects and always encounters issues like time synchronization and long processing time [10].Another option is to ensure that the camera always captures the human hand from the best viewpoint and at an optimal distance, capitalizing on the more straightforward pose estimation due to the disocclusions of the fingers.Active vision systems update the viewpoints of the camera and to gain better information [11].To thoroughly solve the limited viewpoint issues in visionbased teleoperation, developing an active vision system at the local site would be beneficial.
In this article, we devise a markerless vision-based handarm teleoperation system which consists of a hand pose estimation method, Transteleop, and a real-time active vision system (see Fig. 1).The initial results of the hand pose c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The human demonstrator teleoperates the slave hand by an end-to-end hand pose estimation network, Transteleop, and controls the slave robot arm by relative trajectory control based on the operator's wrist motions.To solve inaccuracy issues of the hand pose estimation caused by the self-occlusion of the human fingers, we introduce a controlled active vision system to explore optimal hand observation.The active vision system consists of a depth camera mounted on a robot arm and a real-time trajectory generation method.This teleoperation system enables the slave robot to finish different types of manipulation tasks, such as pouring, sweeping, and multitable cup stacking.
estimation method have been partly reported in [12].This work explains the Transteleop model in more detail and builds a new pairwise human-robot dataset in which both hands are in the same orientation.Furthermore, due to the augmentation of the active vision setup, the human-in-the-loop teleoperation system is capable of accomplishing more manipulation tasks by keeping the hand in the best field of view all the time.To summarize, our primary contributions are as follows.1) We develop an end-to-end robot hand pose estimation model (Transteleop) based on the image-to-image translation method.Trained on a self-built human-robot pairwise dataset, Transteleop predicts the joint commands of the robot and translates human hand images into synthesized robot images, promoting a better perception of pose features.2) We set up an active vision system to thoroughly address the restricted field-of-view issue in vision-based teleoperation.This system ensures that the camera always captures the human hand from the best viewpoint and at an optimal distance.3) We prove the reliability and practicality of the proposed teleoperation system by network evaluation, trajectory analysis, and nontrivial robot experiments, including pick and place, tower building, pouring, sweeping, midi mixer fader sliding, and multitable cup stacking, across two trained demonstrators.

A. Robotic Teleoperation
Robotic teleoperation systems conceptually consist of two sites: 1) the local and 2) the remote site.The local site locates the teleoperator and multiple devices used to measure or support the human's movements and to display the real-time status at the remote site.The remote site comprises robots, supporting sensors, and manipulated objects.Receiving the commands from the human and the sensor perception, the robots then perform the manipulation tasks.
The measuring devices at the local site fall into two main categories: 1) contacting/wearable devices and 2) markerless devices.Contacting/wearable devices, such as joysticks, marker-, inertial and magnetic measurement unit (IMU)-, or electromyography (EMG)-based data gloves or wearable suits, and haptic devices [13], have long been used in robotic teleoperation.From simple joysticks to force-reflection joysticks, most are used for controlling robots that have limited motion types, such as unmanned aerial vehicles and mobile bases [14].Marker-, IMU-, EMG-based wearable suits, such as gloves and clothes, usually require exact calibration and customization in order to achieve accurate control except for the band-type devices such as MYO armbands [15].In practice, IMU- [16] and EMG-based [17] devices are convenient to set up and efficient in controlling the manipulators with multiple degrees of freedom (DoF).Zhang et al. [18] presented an intuitive teleoperation system using an EMG armband for controlling a prosthetic hand while using IMU sensors for operating a Universal Robot UR10 arm.However, the forearm's EMG signals are decoded to classify only two hand motions (open and grasp), so the multifingered hand is merely capable of achieving simple grasping tasks.In terms of dexterous manipulation, the other obvious drawback of wearable devices, especially glove-based sensors, is the obstruction of natural human motions.With the rapid development of virtual/mixed/augmented reality (VR/MR/AR) devices, VR-/MRand AR-based teleoperation has been gaining considerable attention in robotics [19], [20], [21].This is due to the benefits of immersive interaction and enhanced perceptual information.
Compared to wearable devices, contactless devices have the advantages of allowing for natural, unrestricted body motions and different teleoperators, and of being less invasive [22], [23].The most common contactless devices in teleoperation are low-cost RGB cameras.Making use of human body tracking and hand pose estimation algorithms, markerless vision-based teleoperation has been studied in controlling humanoid robots or dexterous robotic hands [24].Many works separately research the visual perception of human bodies (e.g., human gesture classification or human hand pose estimation) and robot control (e.g., specific motions or kinematic retargeting) [25], [26].Kinematic retargeting takes the body detection results from visual perception algorithms and generates robot commands in joint space [27].Recently, Handa et al. built a PointNet++ inspired hand pose estimation model and a fingertip-prioritized kinematic retargeting method for a 23 DoF hand-arm teleoperation system [5].This system achieved impressive results in dexterous manipulation, for example, block stacking, and cup insertion, but the human workspace fully determines the robot workspace.To improve the efficiency and intuitiveness of the teleoperation system, instead of two-stage visual teleoperation, Fang et al. [28] proposed a human-robot posture-consistent end-to-end neural network for teleoperating a 7 DoF Baxter arm.The network comprises skeleton point estimation, robot arm posture estimation, and robot joint angle generation.However, only some arm imitation experiments were demonstrated in simulation.In our previous work, Li et al. [29] presented an end-to-end network, called TeachNet, which exploits the geometrical resemblance between human hands and the robot hand by a consistency loss.But only simplistic grasping experiments on the real robot are demonstrated.Later, we further discovered shared features between human hands and the robot hand based on the image-to-image translation method and built a mobile handarm robotic teleportation system by combining an IMU device and a 3-D-printed camera holder [12].Nevertheless, the 248-g and 35-cm camera holder has to be worn on the human's forearm, undoubtedly bringing extra physical burden for the teleoperator.

B. Active Vision
Active vision has been widely used in object tracking [30], robotic grasping [31], human-robot interaction [32], and simultaneous localization and mapping (SLAM) [33].It aims to select attention and image viewpoint by moving the vision sensor to an optimal pose for facilitating the associated applications.The common form of an active vision system is that the vision sensor is either mounted on the end effector of a manipulator as a hand-eye system or on Pan-Tilt robots.Calli et al. [34] utilized the curvature information from the silhouette of unknown objects to update the robot pose using active vision for obtaining exemplary grasping configuration.Recently, some works optimized the camera viewpoint based on reinforcement learning techniques for grasping pose generation and robotic pushing tasks [35], [36].In the humanrobot interaction scenario, active vision usually strengthens the robot's ability to detect the humans' presence and interpret their motion or emotions.Latif et al. [37] proposed the eyegaze tracking interface TeleGaze to teleoperate mobile robots based on the visual information from the two Pan-Tilt-Zoom cameras on the robots.To improve the operational performance and increase the immersive feeling, Huang et al. [38] established an active vision system based on a Pan-Tilt-Zoom video camera to track the target in the space robot teleoperation task automatically.Instead of applying active vision to observe the remote site, in this article, we investigate how to build a controlled active vision setup at the operator site to capture the human hand from favorable views.

III. HARDWARE SETUP
Our goal is to build an agile vision-based teleoperation system in which the teleoperator performs natural finger motions and unrestricted arm actions for a series of manipulation tasks that can be performed in an unlimited workspace.The hardware setup is shown in Fig. 2.
The local site setup [see Fig. 2(b)] consists of a 6 DoF UR5 collaborative robot arm with a RealSense SR300 depth sensor, a PhaseSpace motion tracking system, a 3-D-printed lightweight LED wrist marker, two monitors, and two deadman's switches.The human teleoperator stands in front of the UR5 robot and keeps a safe distance, while the UR5 robot arm, which possesses a certified safety system, is used to track the human's right hand autonomously.For finger tracking, the depth sensor is mounted on the end-effector of the robot arm for capturing depth images of the human hand.For wrist tracking, the PhaseSpace motion tracking system (320 Hz) estimates the right human hand's 6-D pose based on the wrist marker pasted to the back of the human hand.
Two monitors are used to visualize the real-time status of the remote site and the depth images of the human hand.In addition to the two visual streams of the robot state from the top view and the right view, real-time force feedback on the five robot fingertips is also represented by five cylinders, whose height changes along with the magnitude of the measured force.
At the remote site, the slave robot is a PR2 robot [39] with a 19 DoF Shadow robot hand [40] mounted on its 5 DoF right arm.A Kinect2 RGBD camera mounted on the PR2 head and a webcam located to the right of the PR2 capture the robot's performance from the top and side viewpoints.In addition to that, five Syntouch Biotac tactile sensors [41] are retrofitted at the fingertips of the Shadow hand.In our setup, the human stands in front of the UR5 robot and can only see the PR2 from the visual displays.
This hardware setup works across three computers under the same local-area network, and the data is communicated between these computers via robot operating system (ROS).The primary ROS topics on each computer and data communication are depicted in Fig. 3. PC1 and PC2 belong to the local site, while PC3 is at the remote site.PC2 and PC3 control the UR5 robot and the PR2 robot, respectively.There is one ROS master each, running at both sites.PC2 publishes the 6-D global hand poses, and PC3 generates real-time trajectory commands for the right PR2 arm based on the global hand poses.Moreover, PC3 generates the Shadow hand's joint commands based on the human hand images, and then the Shadow hand imitates the human hand gestures at the remote site.PC1 is used for feedback visualization and to control the SR300 depth camera.Thanks to the master_discovery_fkie1 ROS package, selected essential ROS messages, such as the 6-D human hand pose, the robot hand commands, and the sensing feedback are synchronized at both sites.

A. Transteleop
We expect to estimate the joint angles of the robot hand from the human hand depth image I H captured by the tracking system.Despite the Shadow hand being designed to match the mechanisms of an adult hand, dexterous teleoperation requires an accurate mapping from the operators' hand to the robot.Due to the cross-domain gap between the robot hand and the human hand, how to acquire instructive and shared hand features H share , such as the skeletal shape and the entire silhouette, from these two domains dominates this regression problem.We believe that it would be favorable to predict from the shared pose feature H share rather than the bare I H .In order to attain an instructive feature representation Z pose , we adopt a generative structure that maps from the human hand image I H to the robot hand image I R and retrieves the pose feature H share from the bottleneck layer.Although conditional GANs have led to a substantial boost in the quality of image generation, the discriminator only pursues the high realism of reconstructed images but does not fully concentrate on the pose feature of the input.Alternatively, the Autoencoders are known to learn efficient data codings in an unsupervised manner and are widely used in image-to-image translation applications as well.Therefore, we propose to use an encoder-decoder style imageto-image translation method (Transteleop) 2 for hand feature H share extraction.This learning scheme is defined as (1) The deep network architecture of Transteleop is shown in Fig. 4. Transteleop boils down to four modules: 1) encoder module; 2) embedding module; 3) decoder module; and 4) joint module.The encoder-embedding-decoder association takes a depth image of a human hand I H and reconstructs a depth image of the robot hand ÎR at the same hand pose.The embedding layers connect the encoder and the decoder submodules and embody the shared pose features H share .Note that all layers in the embedding module are fully connected layers because a fully connected layer allows each unit to connect every activation unit of the previous layer, while a convolutional layer usually has a specific receptive field.In the image-to-image translation field, the L1 loss is found to produce a rough outline of the predicted image but keeps highresolution details, while the L2 loss tends to estimate the mean of the distribution leading to blurry images [42].
In our case, the local features of the hand, such as the positions of the hand keypoints, are more important than the image resolution.Transteleop tackles this problem by introducing a keypoint-based L2 reconstruction loss [see (3)], where M is the number of pixels and α ij is the weighting factor of the pixel at [i, j].ÎR is the ground truth of the robot hand image where P ij is the location of the pixel at [i, j] in the image coordinate, A is the location array of all 15 keypoints and their eight neighboring pixels, and α ij is the scaling factor of the pixel at [i, j].D max is the maximum value of array D.
The joint module of Transteleop employs fully connected layers to deduce 17-D joint angles from the latent feature embedding H share .The joint loss L joint is supervised with a mean squared error loss where N is the number of joints and GT denotes the ground truth joint angles.
During the training time, the complete training objective L hand is the weighted sum of the reconstruction loss and the joint angles regression loss and is trained on a paired humanrobot dataset (see Section IV-B) where λ recon = 1 and λ joint = 10 are the scaling weights.At inference time, the decoder module is not used.Accordingly, Transteleop takes a depth image of a human hand as input and then outputs joint commands for the robot.

B. Dataset
Considering that the inconsistent orientation and position of the input and reconstructed images admittedly yield more training challenges to Transteleop, it would be better to take the images of the human hand and the robot hand from the same viewpoint and at the same wrist poses.In [12] and [29], the robot images are recorded through the Gazebo simulator when the camera and the robot both are at a fixed global pose.In this article, we propose to collect a pairwise human-robot dataset from the same viewpoint.Given the human hand depth images and keypoint positions from Bighand2.0[43], the robot images are collected through an OpenGL interface, and the robot joint angles are optimized by the bio-ik solver [44].
The Bighand2.0 dataset provides 960K human hand depth images, in which the global poses of the human hand vary considerably.The positions of 21 hand keypoints in relation to the camera frame are shown in Fig. 6(a).In order to obtain the wrist orientation, we build a local hand frame for each hand.The z-axis is the mean of the vector first finger (FF)palm, middle finger (MF)-palm, ring finger (RF)-palm, and little finger (LF)-palm.The y-axis is the cross product of the MF-palm and the RF-palm.Then, we obtain the x-axis for a right-handed coordinate system.The vectors FF-palm, MFpalm, RF-palm, and MF-palm represent the vectors from the wrist pointing to the metacarpal joints of the FF, MF, RF, and the LF.Once we have the wrist orientation, we obtain the transformations from the camera to the wrist regarding the human hand dataset.Next, we set up a camera in OpenGL at the same orientation and position with respect to the robot wrist.Taking advantage of the bio-ik solver, we developed an optimized retargeting method integrating a position mapping and an orientation mapping from the human hand keypoints to the corresponding robot hand keypoints.The kinematic chain of the Shadow hand is visualized in Fig. 6(b).Subsequently, joint angles of the robot, namely, the ground truth of Transteleop, are acquired.Finally, given the transformation between the camera and the robot wrist, we render the robot model and capture the depth images of the robot hand in OpenGL.The pairwise dataset consists of 400K synchronized human-robot depth images and corresponding robot joint angles.As demonstrated in Fig. 6(c), the robot hands are imitating the human hand and are at similar wrist poses.

V. ACTIVE VISION SYSTEM
Our real-time active vision system allows the camera to capture the right human hand at optimal viewpoints by involving a moving vision sensor.The vision sensor is mounted on the end-effector of the robot arm.In such a tracking system, three crucial issues should be considered: 1) whether the robot can smoothly follow the human hand in real time; 2) whether the robot keeps a safe distance from the human; and 3) whether the UR5 robot arm can satisfy the required workspace of the manipulation tasks.
Regarding the first issue, the frequency of the PhaseSpace motion tracking system ensures the fast and reliable identification of the human hand.Then, the goal pose of the UR5 robot, which carries the SR300 depth camera, is always updated to a position where the camera can optimally observe the human fingers, namely 40 cm in front of the human hand palm.In our hand coordinate system, this position is easily calculated by a 40-cm translation along the negative y-axis of the human hand, compare Fig. 6(a).Second, 30 Hz joint-space trajectory generation is achieved by the inverse kinematics solver bio-ik.The real-time 6-D poses of the end-effector are online translated into joint-space robot commands, which are required to be as close as possible to the current robot configuration.In the Cartesian space, the translation and angular motions are constrained by velocity and acceleration limits.Besides that, a maximum velocity constraint in joint space is also employed.
On top of the certified safety system of UR5, the 40-cm distance between the human hand and the end-effector and the trajectory constraints also provide a strong safety guarantee.Furthermore, we add a collision object whose volume covers the area of the human hand into the planning scene, and update its pose in real-time.During the experiments, we check the collision of the target poses before the execution.Moreover, the human can always press the deadman switch (left foot pedal) to stop the UR5 robot immediately.
To identify the overall workspace of the system, we constructed the reachability map of the UR5 and the right arm of PR2 by creating grid-poses in the environment, and calculating valid IK solutions for the poses.In our setup, the UR5 robot is mounted on the wall near a corner and the PR2 is standing in an unconstrained space.The blue, green, yellow, and red spheres in Fig. 7 represent that the robot end-effector can reach that position with more than 50, 20, 10, and equal to 1 orientation(-s).Compared to the UR5 workspace, apparently, only a few blue spheres scatter in the PR2 workspace due to the mechanism limitation.To utilize the most of the PR2 workspace, we implement relative control for the right PR2 arm and absolute control for the UR5 robot.Therefore, the right PR2 arm only performs the incremental motion of the human arm after the demonstrator presses the right foot pedal, and the UR5 robot will online track the human arm motion.In this way, the human demonstrator can always move their arm in a comfortable motion range.A block diagram of the overall hand pose tracking is shown in Fig. 8.

VI. SLAVE ROBOT MOTION GENERATION
As discussed in Section V, because of the limited workspace of the PR2 robot, we employ relative control for the slave robot  to achieve a fine coordination between two robots.Therefore, an initial registration of the human wrist pose with the slave wrist is not necessary, and the workspace of the slave robot is not constrained by the local site.The slave robot only moves when the human presses the foot pedal.This not only secures the robot but also facilitates the potential adjustments at the local site, for example, to avoid possible self-collision of the UR5 robot, or when close to the workspace boundary of the UR5 robot.Similar to the motion of the UR5 robot, given the human wrist pose acquired through the Phasespace system, the joint angles of the slave arm are also computed by the bio-ik solver under velocity and acceleration constraints in the Cartesian space and the joint space.A minor difference is that, in joint space, we consider the feedforward as well as the feedback joint angles difference to calculate the joint velocity.

VII. NETWORK EVALUATION
In the preprocessing phase, we perform erosion followed by dilation to remove noise from the raw depth images and then extract a fixed-size cube around the hand.Then, the image is resized to 96 nator in Transteleop and an adversarial loss based on the "pix2pix" framework [45].This baseline is used to find a better image-to-image translation structure between the conditional GAN-based structure and the encoder-decoder structure chosen in Transteleop.To find out whether the proposed human-robot dataset improves posture learning, we also train a model called TransteleopFix on the same joint labels, but the robot wrist is fixed in the robot depth images, see [12].We compared the fraction of frames whose maximum angle errors are below the high-precision thresholds and also the average angle and distance error over all joints.The comparison results are shown in Fig. 9 and Table I.
We can observe that the Robot-only model significantly outperforms other baselines over all evaluation metrics because of the matched domain.Meanwhile, Transteleop shows an average 2.8% improvement of the accuracy compared to Humanonly in the high-precision condition.This result suggests the effectiveness of the reconstruction loss.Also, Transteleop performs better than TransteleopFix, indicating that the new pairwise human-robot dataset with the same wrist orientation enables the model to learn more indicative shared pose features.We note that the performance of GANteleop is much worse than Transteleop because the discriminator in TeachNet, the TeachNet model obtains at least 15% lower accuracy below a maximum joint angle error threshold.We infer that the image translation structure seizes more valuable pose features than the alignment mechanism between two layers in TeachNet.
In addition, we analyzed the influence of the camera viewpoint.We divided the test dataset into 12 portions based on the angle between the camera direction and the y-axis of the hand.Fig. 10(a) exhibits the average absolute angle error on the individual joint (except for joints LF5, LF4, MF4, RF4, and TH3, which have a relatively smaller mean error than other joints) tested on the 12 subdatasets.The mean errors of all joints manifest a noticeable rise when the viewpoints are in the [75 • , 150 • ] range due to the amount of self-occlusion.Especially, the mean errors of thumb joint 5, which is one of the essential joints for manipulation, is 0.096 rad in the [165 • , 180 • ] range.Surprisingly, most joints perform well at [150 • , 180 • ] range.To figure out this phenomenon, we analyze the average number of occluded joints for each subdataset, which can indicate the posture complexity.A joint occlusion is defined by thresholding the distance between the joint's depth annotation value and its reprojected depth value.The average number of occluded joints is shown in Table II.Apparently, there are some straightforward human hand images with lower posture complexity at [150 • , 180 • ] viewpoint range in our test

VIII. ROBOT EXPERIMENTS
In this section, we rigorously examine the proposed teleoperation system by precision analysis of robot trajectories and five elaborate experiments (pick and place, tower building, pouring, sweeping, and pushing) that test precision and power grasp, prehensile, and nonprehensile manipulation.In Cartesian space, the maximum linear velocity, angular velocity and linear acceleration are 0.2 m/s, 2 rad/s, and 2.0 m/s 2 for both robots.The maximum velocity in the joint space of UR5 is 3 rad/s.The velocity limits of five joints on the right PR2 arm are the default values from PR2 manual [39].The control frequencies of the UR5 robot, the Shadow hand, and the slave arm are all 30 Hz.

A. Trajectory Analysis
To check if the robot could track human hands in real time, it is essential to evaluate the precision of this teleoperation system quantitatively.We recorded the end-effector trajectories of both the UR5 and the PR2 while the right human hand performed specific motions.11(a) and (c), we can see that in most cases, UR5 follows the human hand well.During around 3-12, 41-43, and 60-62 s, where the human hand is moving through a sharp corner and the UR5 robot is stretching a bit, the tracking error is over 3 cm.The probable reasons are that: 1) the regularization goal, which tries to keep the joint-space solutions as close as possible to the current robot configuration, in our trajectory generation method and the servoj parameter set in the UR5 driver are acting together to smoothen the trajectory and 2) the closer the robot is to its workspace boundary along the x-axis, the greater the trajectory error.Note that the 3-cm position error does not affect the camera tracking the human hand at all because the camera is still near its optimal working pose (around 40-cm distance from the human hand, near a 0 • view angle).The right PR2 arm starts from the center of its workspace and precisely conducts the motion commands as depicted in Fig. 11(b) and (d).The average error of the right PR2 arm is 1.8 cm.These trajectory analyses indicate that the two robots are capable of following human motions and conducting most manipulation tasks, such as pick and place, and pouring.

B. Teleoperation Experiments
To verify the reliability of our method, real-world experiments across six types of physical tasks were performed by a female and a male adult.The operators need to get familiar with the active tracking system first and then take a warm-up phase for each task with ten nonconsecutive attempts before the real testing.As a matter of fact, one of the primary concerns of vision-based teleoperation systems is their lack of haptic feedback.Since the visual or auditory channels could be a low-cost alternative to haptic feedback [46], we visualize the pressure values on each fingertip from the Biotac sensors during the manipulation process as a clue to force feedback.
1) Pick and Place: In this experiment, the robot grasps a Pringles can with a radius of 4 cm and then places it on a blue cylinder with a radius of 2.5 cm.We teleoperated the robot to grasp the Pringles can from the top and from the right side by power grasp, as shown in Figs.  of the UR5 robot, update the hand pose to a grasping pose, then slowly rotate the right wrist to the pouring pose.To fulfill this task, the UR5 robot needs to track the human hand simultaneously.Overall, this pouring task mainly examines the stability of the tracking system.4) Sweeping: The robot grasps the brush and sweeps three small blocks to a specific place, see Fig. 14(a).The contact force between the brush and the table surface should be mild.This task contains the challenges of pushing, sliding, and precision grasping.5) Midi Mixer Fader Sliding: Figs. 1 and 14(b) and (c) illustrate the fader sliding task by the FF of the Shadow hand.Fig. 14(c) displays the top visual scene of the remote site during the experiments and the visual haptic feedback of five robot fingertips.Obviously, in the top scene, the critical experimental area is easily occluded by the robot hand itself.Therefore, the remote manipulation states are heavily dependent on the side webcam and the haptic feedback.Besides that, humans hardly control their hands to move along an exact straight line.Hence, the robot usually cannot slide a fader from left to right at one time.To improve the success rate, we downscaled the human motions thrice.6) Multitable Cup Stacking: Fig. 15 illustrates the teleoperation process of this task.In this task, the robot first picks a blue cup on the table in front of it; then the robot rotates it 90 • by a manual teleoperation command from the operator; then it inserts the blue cup into a green cup on the other table on its right.Afterward, the robot rotates back to its original pose and repeats the above procedure to insert another cup.In this task, the webcam is located right in front of the PR2 robot to provide a complete view of the two tables [see Fig. 15(a)].This task shows the potential of our current setup and the benefits of the mobile robot.
Table III numerically shows the average completion time and success rate of all tasks.Each task was conducted five times by each operator.In the tower building task, the human needs to place the objects on a smaller surface with proper force, and the robot could accidentally ruin the tower.Therefore, this task took the longest time (except for the multitable task which contains three rotations of the mobile base) and achieved a relatively low success rate.For the fader sliding task, the average time refers to the time it takes to continuously slide the fader from left to right three times.The fact is that the time used to find the next fader occupies half of the completion time.The high success rate of most tasks indicates that our system can execute precision grasps, power grasps, placing, sliding, and robust tracking.These results verify the feasibility of the proposed dexterous hand-arm teleoperation system.Refer to the experimental video for details of the manipulation tasks.

IX. CONCLUSION AND DISCUSSION
We developed a novel hand-arm teleoperation system incorporating the vision-based joint estimation approach, Transteleop, and a real-time active vision system.Transteleop predicts the robot joint angles from a depth image of the human hand in an end-to-end manner based on a human-torobot translation model.Through the comparison and ablation studies on the test dataset, it emerges that Transteleop performs better than other baselines.The viewpoint examination verifies the necessity of active perception.Then, trajectory analysis and systematic robotic experiments, including pick and place, tower building, pouring, sweeping, fader sliding, and multitable cup stacking, demonstrated the excellent efficacy of our proposed system.In practice, the UR5 robot used in the active vision system can be replaced by any kind of 6-DoF robot arm.
In the future, we would like to design a deep-learningbased 6-D global hand pose estimation algorithm to replace the motion tracking system, thus simplifying the hardware setup.In this system, we fully harness the cognition of humans and employ a direct control strategy for the robot.Even though direct control achieves seamless and continuous control, it transmits all control burden to the users.Shared control, which combines direct user commands and remote feedback or autonomy, is a better choice for long-term teleoperation tasks, such as user intention detection and task autocorrection.Furthermore, developing adaptive force control strategies into the current vision-based teleoperation system could allow the robot to deal with physical and dynamic interactions with the environment compliantly [47], [48].
The limitation of the current vision-based teleoperation system is the robot-motion feedback.The operator cannot directly see the robot and has to rely on the camera images and (visual) haptic feedback from the robot.This is of course a more realistic setting for remote teleoperation, but it is not intuitive for untrained humans, and therefore incurs a much higher mental load and much slower motions.Also, the robot hand often partially occludes the scene and objects during the manipulation tasks, so that the camera images become noninformative.A very recent study [49] confirms that a VR interface allowed users to operate the robot more smoothly, resulting in a shorter trajectory of the robot to the target object and faster task completion compared to the camerabased teleoperation approach.Therefore, we proposed the use of virtual reality technology for immersive teleoperation in our future work.Though visual force feedback is a cheap and convenient substitute, it is revealed to be less efficient than haptic force feedback, especially when interacting with soft objects [46].Slip detection during the manipulation tasks or using ultrasound-based haptic devices [22] would be helpful in vision-based teleoperation.In addition, it is somehow exhausting to focus on multiple visual displays simultaneously, so using audio channels, such as auditory force feedback or a workspace boundary alarm, will be useful.

Fig. 1 .
Fig. 1.Pipeline of our proposed vision-based hand-arm teleoperation system.The human demonstrator teleoperates the slave hand by an end-to-end hand pose estimation network, Transteleop, and controls the slave robot arm by relative trajectory control based on the operator's wrist motions.To solve inaccuracy issues of the hand pose estimation caused by the self-occlusion of the human fingers, we introduce a controlled active vision system to explore optimal hand observation.The active vision system consists of a depth camera mounted on a robot arm and a real-time trajectory generation method.This teleoperation system enables the slave robot to finish different types of manipulation tasks, such as pouring, sweeping, and multitable cup stacking.

Fig. 2 .
Fig. 2. Hardware setup of the proposed teleoperation system.(a) Front view of the local site.(b) Teleoperation hardware.

Fig. 3 .
Fig. 3.Primary ROS nodes on each computer and data communication among three computers.The colored lines represent the data transfer between two nodes.The types of transferred data are listed in the corresponding colored texts.The two robots are running on two independent ROS masters but communicate via the master_discovery_fkie ROS package.

Fig. 4 .
Fig. 4. Architecture of Transteleop.Given an input depth image from the human hand domain, Transteleop aims to reconstruct a robot hand image and predicts the joint commands of the robot hand in the robot domain.

Fig. 5 .
Fig. 5.Heatmap of weight factor α. The darker color illustrates how important these pixels are.

Fig. 6 .
Fig. 6.(a) Human hand model in the Bighand2.0dataset.The colored points show the 21 keypoints of a human hand in the dataset.The four black arrows represent the vectors from the wrist pointing to the metacarpal joints of the FF, MF, RF, and LF.TH refers to the thumb.(b) kinematic chain of the Shadow hand.The standard Shadow hand has 24 joints, but all joints 1 in our model are fixed at 20 • because of the installation of Biotac tactile sensors.WR1, WR2 refer to wrist joints 1 and 2. The lengths of the four fingers are the same.(c) Examples of paired human-robot depth images at the same wrist pose in our dataset.The left and right images in each pair are the human hand and the robot hand, respectively.

Fig. 7 .
Fig. 7. Visualization of (a) UR5 workspace and (b) PR2 workspace in our setup from the third-person viewpoint.The blue, green, yellow, and red spheres indicate that the robot end-effector can reach that position with more than 50, 20, 10 and equal to 1 orientation(-s).

Fig. 8 .
Fig. 8. Block diagram of the human hand pose tracking and generation of arm motions for the local and remote robot.Separate emergency-stop functions are used to protect operators and robots.

3 )
× 96 and normalized to [−1, 1].During training, random in-plane rotation and random Gaussian noise are added for data augmentation.The average inference time of Transteleop is 0.027 s tested on a computer with Intel i9-7900X CPU with 3.30 GHz and 128 G of RAM, and a GeForce GTX 1080 Ti GPU.This section examines the regression accuracy of Transteleop and four baseline models trained on the proposed paired human-robot dataset.The four baselines are as follows.1) Human-Only: A model that removes the decoder module in Transteleop and is used to evaluate the effect of the reconstruction loss in Transteleop.2) Robot-Only: A model that removes the decoder module in Transteleop and feeds the images of the robot hand.It is used to show the expert model with the matched domain.TeachNet: A state-of-the-art end-to-end robot hand pose estimation model with an auxiliary consistency loss [29].4) GANteleop: A model that adds a PatchGAN discrimi-

Fig. 9 .
Fig. 9. Fraction of frames whose (a) absolute maximum angle error and (b) distance error over all joints are below a threshold between the Transteleop approach and four baselines on our test dataset.TABLE I ANGLE/DISTANCE ACCURACY UNDER HIGH-PRECISION CONDITIONS AND AVERAGE ANGLE/DISTANCE ERROR

Fig. 10 .
Fig. 10.(a) Absolute average angle error on the individual joint tested on twelve intervals of viewpoint.(b) β is used to indicate the camera viewpoint, which is the angle between the camera direction and the y-axis of the hand.TABLE II AVERAGE NUMBER OF OCCLUDED JOINTS

Fig. 11 .
Fig. 11.Trajectory analysis of the UR5 and the right PR2 arm.The overlapping numbers on the trajectories in (a) and (b) mean the corresponding execution time (s).(a) UR5 tracking trajectories.(b) PR2 trajectories.(c) Position error along x-, y-, and z-axes of the UR5 tracking.(d) Position error along x-, y-, and z-axes of the PR2 motion.

Fig. 11 (
Fig.11(a) and (b) presents the example end-effector trajectories of the UR5 and the PR2 while a human moves the hand on the local side.The frame coordinate of these trajectories is parallel to the base frame of UR5.The goal wrist positions in both figures are the smoothed goals after filtering the Cartesian constraints.From Fig.11(a) and (c), we can see that in most cases, UR5 follows the human hand well.During around 3-12, 41-43, and 60-62 s, where the human hand is moving through a sharp corner and the UR5 robot is stretching a bit, the tracking error is over 3 cm.The probable reasons are that: 1) the regularization goal, which tries to keep the joint-space solutions as close as possible to the current robot configuration, in our trajectory generation method and the servoj parameter set in the UR5 driver are acting together to smoothen the trajectory and 2) the closer the robot is to its workspace boundary along the x-axis, the greater the trajectory error.Note that the 3-cm position error does not affect the camera tracking the human hand at all because the camera is still near its optimal working pose (around 40-cm distance from the human hand, near a 0 • view angle).The right PR2 arm starts from the center of its workspace and precisely conducts the motion commands as depicted in Fig.11(b) and (d).The average error of the right PR2 arm is 1.8 cm.These trajectory analyses indicate that the two robots are capable of following human motions and conducting most manipulation tasks, such as pick and place, and pouring.

Fig. 12 .
Fig. 12.(a) PR2 robot picks a Pringles can and places it on a blue cylinder.(b) PR2 robot stacks three different objects on top of each other.

Fig. 13 .
Fig. 13.Screenshots of the pouring task.From left to right, three images in the upper row show the PR2 robot grasping a cup filled with rice and pouring the rice into a bowl.The images in the lower row visualize the real-time human status at the local site.

2 )
Tower Building: This experiment requires the robot to stack three different objects on top of each other, see Fig.12(b).In this experiment, the robot takes a precision grasp for the small green block and conducts a power grasp for the orange irregular block.Since larger objects are required to be placed on top of the smaller cylinder in experiments 1 and 2, these two tasks strictly inspect the grasping ability and stable release.3)Pouring: In this task, the robot grasps a cup filled with rice, pours the rice into a bowl, and then places the empty cup on a box.Fig.13visualizes the teleoperation process of this task.The human is supposed to turn the right hand 90 • clockwise, move along the y-axis

Fig. 14 .
Fig. 14.(a) PR2 robot grasps a brush and sweeps three blocks to the orange block on the right side of the image.(b) PR2 robot slides the mixer fader from left to right using its FF.(c) Top scene from the Kinect2 on the head of PR2 and the visual haptic feedback.The five red cylinders from left to right qualitatively illustrate the pressure of the fingertip of thumb, FF, MF, RF, and LF.

Fig. 15 .
Fig. 15.Screenshots of the multitable cup stacking task.(a) Side scene from the webcam located right in front of the PR2 robot.(b) Teleoperation process is shown from left to right.

TABLE III AVERAGE
COMPLETION TIME AND SUCCESS RATE OF EACH TASK