Vision-Based Assistance for Myoelectric Hand Control

Conventional control systems for prosthetic hands use myoelectric signals as an interface, but it is impossible to realize complex and flexible human hand movements with only myoelectric signals. A promising control scheme for prosthetic hands uses computer vision to assist in grasping objects. It features an imaging sensor, and the control system is capable of recognizing an object placed in the environment. Then, a gripping pattern can be selected from some predefined candidates according the recognized object. However, previous studies assumed that only one object exists in the environment. If there are multiple target objects in the environment, the hand could become confused in attempting to find the target object. This study addresses this problem and proposes a method to determine the target object from multiple objects. The proposed method is able to determine the target object by estimating the positional relationship between the artificial hand and the objects, as well as the motion of the hand. To verify the validity and effectiveness, we implemented the proposed method in a vision-based prosthetic hands control system and conducted pick-and-place experiments. The experiments confirm that the proposed method can accurately estimate the target object in accordance with the user’s intention.


I. INTRODUCTION
Myoelectric hands have been developed to help amputees who have unfortunately lost their hands due to accidents or diseases. As the name suggests, myoelectric hands use muscle electric potentials measured from the residual arms for control interface. It has been expected that dexterous hand movements can be estimated via myoelectric signals, but the control robustness provided by the pattern recognition methods can support no more than 10 movements. On the other hand, multiarticulated myoelectric hands have been developed in recent years, and they are very advanced from a mechanical perspective. The individual motors in each finger enable precision motion over delicate work. However, the control of precision motion at the finger level through myoelectric signals has never been achieved. Thus, a new approach for controlling dexterous hand movements is necessary.
Vision plays an extremely important role in human grasping. When we grasp an object, we use our eyes to recognize The associate editor coordinating the review of this manuscript and approving it for publication was Pedro Neto . the target object and coordinate the actions of the hands. Vision has been used in the control system of myoelectric hands as an alternative input source. Figure 1 illustrates a conceptual diagram of vision-based control for myoelectric hands. Vision-based control features an imaging sensor to recognize the target object and selects a grip pattern from predefined candidates according to the recognized object category.
Vision-based control of myoelectric hands is inspired by human visuomotor function, that is, the ability to synchronize visual information with physical movement. A schematic diagram comparing the components and interactions involved in the human visuomotor system and the vision-based control system is shown in Fig. 2. The human visuomotor system for reach-to-grasp movement is a simple model with only two components: the human and the target object. In the case of controlling a vision-based myoelectric hand, the target object, the myoelectric hands and the user are involved in a grasping movement. The interactions between the three components include cognition, action, interface and feedback.
Cognition is the core module of vision-based control. With a cognition module, the control system can recognize the  object category, object orientation, object position and other contextual information related to a full grasping movement. Theoretically, the more information that can be recognized from the cognition module, the more dexterous the hand movement will be. The recognized information and the user intention (conveyed through interface) together can be further combined to guide the action of the myoelectric hands. Since the myoelectric hands are an extension of human capability, the system works optimally only when the cognition and action of the myoelectric hands is accordance with the human. It is necessary to use an interface module to convey the user's intention and to use a feedback module to make the user aware of the hand's cognition. Myoelectric signals are widely used as an interface of the vision-based system. The synergy of the four modules controls the joint angle, the speed and the force of the myoelectric hands.
In previous studies, one object placed in an environment is recognized, and the corresponding grip pattern is triggered. However, in practical use scenarios, it is often necessary to select one target object among several objects. In this article, we propose the use of an imaging sensor as an additional interface to myoelectric signals to convey user intentions. The imaging sensor was introduced to recognize objects, but in fact, the images from the imaging sensor can also be used to estimate the motion of the hands. For example, if the hand moves to the right, the object captured by the imaging sensor will move in the opposite direction. The object in the image will become larger as the hand moves closer to the target object and smaller as the hand moves farther away.
The proposed control scheme uses the captured images from imaging sensors to estimate the motion of the hand without using an IMU, and the estimated motion can be further used as an interface to convey the user's will to the control system.

II. RELATED WORKS
The widely adopted interface for controlling prosthetic hands is through EMG signals, where the user motion intention is estimated based on the pattern recognition of EMG signals. Previous studies have extensively investigated handcrafted or learned features with advanced classifiers to improve the accuracy of pattern recognition [1]- [4]. Though the capability to discriminate motion patterns from EMG signals has improved substantially, the robustness provided by the current pattern recognition algorithms is still insufficient for daily activities. Moreover, the natural control of multiarticulating hands is not even close [5]- [7]. To solve the limitations of EMG-based control, recent studies on the control of prosthetic hands attempt to measure and report other kinds of sensory information besides EMG signals to offer clues into controlling a prosthetic hand [8]- [10]. This information includes vision, voice, and orientation of the residual arm. Vision-based control systems [11], [12] have been considered a promising solution due to the rich contextual information related to grasping (e.g., object size, object shape, and distance) included in an image.
On the other hand, deep learning has revolutionized the field of computer vision in recent years [13], and techniques to recognize and localize objects from an image have been improved considerably. These techniques have been applied to the control system of prosthetic hands to determine the grasp pattern and grasp timing [14]- [18]. For example, in the study of [19], Shima et al. attempted to recognize the shape of an object using RGB-D sensor to improve classification accuracy. To control the prosthetic hands in the whole approaching phase, He et al. introduced a real-time object detection system and implemented the system in a server-client mode to improve the computation capability. Fukuda et al. attempted to use IoT technology to collect contextual information of the grasping scene from both the objects and the prosthetic hands [20].
Since prosthetic hands are expected to be controlled semiautonomously and autonomously, it is necessary that the user and the prosthetic hand understand each other's intentions. Therefore, we consider expanding the interface channel between the user and the prosthetic hand. EMG signals have been used in previous studies as the only interface to convey user's intentions [11], [15], and in this article, we propose imaging sensors as a new interface. There have been many attempts to use imaging sensors as an interface. For example, in the study of [21]- [23], an imaging sensor is attached to the human body as an interface to recognize gestures and other actions. The research of [24], [25] uses hand gestures and head tilt as an intuitive interface to control an electric wheelchair and a mobile robot. In the research of [26] and [27], an imaging sensor attached to the wrist captures the detailed movements of the five fingers. Though hand gestures can be an effective means of conveying intentions to the system, a different approach is needed for amputees who have difficulty using hand gestures.
In [28], a camera is attached to the wrist, and the hand motion is recognized using the captured video. In addition, if we note that the direction of measurement of the imaging sensor reflects the eyesight of the prosthetic hands, the conventional technology of the gaze input interface [29] can also be applied.

III. VISION-BASED MYOELECTRIC PROSTHETIC HAND
The control diagram of the proposed system is shown in Fig. 3. First, image capture is performed with an imaging sensor attached to the prosthetic hand. A deep convolutional neural network is used to detect objects from the captured images. Then, determination of the target object among several objects is performed based on the distance of the object, the position of the object in the image, and the gaze time on the object. At the same time, EMG signal decoding is performed to estimate the user's desire to grasp or halt an already started grasp session. Motion decisions are made based on the determined target object and estimated user intention. If the user is willing to grasp an object, motor commands will be generated to control the prosthetic hand to preshape a specific grip pattern according to the target object.
A. TARGET OBJECT SELECTION Figure 4 shows an example of target object selection by the proposed method. Distance from the prosthetic hand and a ratio parameter used to represent how close an object is to the center of the image are shown next to the object. The objects surrounded by rectangles are considered as target candidates, and the object surrounded by the red rectangle is the target object.
Object detection module is the basis of target object selection. It accepts an image as input and outputs the object category and object position in an image. The architecture of the neural network is depicted in Fig. 5. It uses the same architecture as used in [17]. The network first extracts features from an entire image using the convolutional layers (backbone); then, the features are used for coordinate regression and class classification [30]. The object detection module helps to find all the possible target objects in the grasping scene. The object detection speed is highly dependent on the performance of the processing unit. The captured images are processed at approximately 30 [FPS] with GeForce GTX 1080 in our experiment. The results of object detection are used in determining the target. We adopt a three-step approach to do that. The first step is preliminary screening of the target objects by removing distant objects 201958 VOLUME 8, 2020 that the hands cannot reach. The second step is to find the object that is located nearest to the center of the image. After these two steps, the target object can be preliminary selected. If the user wants to further improve the accuracy and stability of the control system, he can optionally increase the gaze time on the target object. The longer the hand gazes at the object, the more likely it is that the object is the target. The gaze time can be short or long, depending on the user habits and the reliability of the control system. The details of each step are explained below.

1) NARROWING DOWN TARGET CANDIDATES BY OBJECT DISTANCE
Target candidates are narrowed down by excluding objects far from the prosthetic hand. When estimating the distance between the object and the hand, the area of the estimated bounding box during object detection is used. The correspondence between the bounding box area and the object distance is reported in advance for each type of object, and the approximate distance between the object and the prosthetic hand can be calculated based on the area of the detected bounding box using interpolation method. The estimated distance is used to exclude objects that are farther than a certain distance from the prosthetic hand. Objects in Fig. 4, such as scissors, compass, and stapler, are excluded from the candidates because that their distances from the prosthetic hand exceed a specified threshold.

2) TARGET OBJECT DETERMINATION BY OBJECT POSITION
The distances from the center of the image to the center of the bounding boxes where the target candidates are positioned are then calculated. The object closest to the center of the image is selected as the target object. When the user tries to reach to grasp an object, the object will be close to the center of the image. Therefore, the closer the object is to the center of the image, the more likely it is that the user intends to grasp that object. Since the dimensions of the image are normalized to a range of 0 to 1, the center of the image is (0.5, 0.5). If the center point of the detected bounding box is (X , Y ), the distance d between the two points is calculated as follows: The probability that a candidate object becomes the target is defined based on this distance d. The selection probability of each candidate is calculated by the following equation.
The probability p i indicates the likelihood that the i-th candidate becomes the target. This probability is normalized for all target candidates, and the sum of the probability for each candidate is 1.
Among all the detected candidates that are surrounded by a bounding box in Fig. 4, the object with the highest selection probability is considered the target. At that time, the color of the bounding box turns red to provide a visual prompt to the user.

3) TARGET OBJECT DETERMINATION BY GAZE TIME
In addition, gaze time is considered as another factor for determining the target object. The longer the gaze time on a target object, the more the user desires to grasp it. We monitor the changes of the center point and area of the bounding box to estimate the motion of the hand and calculate the time that the hand is gazing at the object. Specifically, if the center point of the bounding box is moving downward, the hand is moving upward, if the area of the bounding box becomes larger, the hand is moving forward and approaching the object. If the center point and area of the bounding box do not show a clear change, the hand is not moving but simply gazing at the target object. The accumulated time is calculated and input into the sigmoid function shown below, and the output  of the function is regarded as the level of the user's intention to grasp the object.
where a and b are parameters that control the shape of the sigmoid function. The relationship between the cumulative gaze time and the level of the user intention is shown in Fig. 6. In this study, a and b are set to 2.5 and 2, respectively, by trial and error, but these values can be adjusted by the user. When the probability P approaches 1, the target object is confirmed.

B. MOTION DECISION AND MOTOR CONTROL
The target object is determined using information from the imaging sensor, but the intention to grasp or not and the timing to trigger movement are determined by the user. The intention of the user is sent to the control system through the myoelectric interface. Figure 7 shows the collection and processing of the myoelectric signals. Two electrodes are attached to the forearm flexor muscle group and extensor muscle group respectively. The signals are measured with an amplifier (Bagnoli Desktop, Delsys Inc.) and sampled at 1000 Hz. The EMG signals are then full-wave rectified through a high-pass filter (cutoff 10 Hz), and further lowpass filtered (cutoff 1 Hz). Then, the envelope of the signal is extracted within a 100 ms window which ''slides across'' the signal. The left side of Fig. 7 shows the myoelectric signal of the flexor muscle group and the extracted envelope during flexion. The envelope is shown as a blue line for the flexor group and as a red line for the extensor group, indicating the activity level of these muscles. A threshold value is set in advance for the signal envelope. Once the envelope exceeds the threshold, a specific activity is triggered. According to which muscle group exceeds the threshold value, we set three patterns of triggering activities: no muscle activity, flexor group activity, and extensor group activity. The flexor group and extensor group are never active at the same time.
When the hand has already been preshaped to the corresponding grip pattern according to the category of the target object and a certain movement is triggered through the muscle activities, the motor is driven to control the prosthetic hand. At that time, the activity of the flexor muscle group is responsible for the execution of the hand close motion, and the activity of the extensor group is responsible for the execution of the hand open motion. The prosthetic hand has no motion if there is no trigger signal. Figure 8 shows the hand gripper used in the experiment and some examples of its grip patterns. The prosthetic hand has 3 degrees of freedom. The grip patterns (joint angle) of the hand are configured in advance. Sometimes the joint angle of the hand may not be comfortable for the user to grasp the object, but the user can orientate the prosthetic hand with the residual arm.

IV. EXPERIMENTS
We conducted experiments to verify the validity of the proposed method. Table 1 lists 10 kinds of objects used in the experiment. The pictures of each object shown in the table are taken when the distance between the hand and the target object is kept in 0.3 m. The actual dimensions of these objects and the area of their corresponding bounding boxes are reported in the table. We then altered the distance between the hand and the object from 0.2 m to 0.7 m in 0.1 m intervals. For each distance level, the area of the bounding boxes is calculated and averaged from five taken pictures. The correspondence between the distance and the area of the bounding box is then fitted with an equation and plotted   in Fig. 9. The approximate distance based on the size of the detected bounding boxes could be calculated with the graph or the fitting equation.

A. VERIFICATION OF TARGET DETERMINATION
We first verified the algorithm for target object determination based on object position. In the experiment, five kinds of objects, scissors, glue, pen, tape and stapler, were placed on a table, as shown in Fig. 10. We held the hand gripper and moved it from left to right at a relatively constant speed. At the same time, the target object probability was calculated in real time using Eq. 1-3. The probability is plotted in Fig. 11. When an object appears in the middle of the image, the probability that the object is the target object reaches a peak, as expected. Sometimes the target object probability of an object has two or three peaks in the graph. This scenario happens when the number of detected objects in an image decreases or increases. Figure 12 shows how the target object is determined. The subfigures from top to bottom show the visual field of the prosthetic hand, the gaze time on the objects as well as hand movements, and the estimated target object probability. The visual field of the hand is currently displayed in a monitor. The user can view the monitor to understand the system state. We can see from the figure that the gaze time is accumulated when the hand is kept in a resting condition (the value of hand movement is near 0), and the gaze time is reset when the hand moves again.
The target object probability also changes appropriately with the change of the object position in the image. If an object appears in the center of the visual field of the hand, the target probability of the object reaches a peak in the graph (See label 1). At around the 200th frame (label 2), the gaze time has a reset because of a sudden hand movement. The target object probability in label 3 is near 100%. It is because that the scissor is the only object in the visual field of the hand at that time (A part of paste is not recognized as an object in label 3). The target probability of the scissor kept a high value from the 350th to 450th frame, but the scissor is not considered as the target object. It is because that the hand has a movement and the gaze time is not accumulated for a period of time. After the 450th frame, the gaze time is accumulated again when the hand is in a stable state with no movements. At that time, the scissor is identified as the target object.

B. CONTROL EXAMPLE WITH VISION-BASED INTERFACE AND EMG SIGNALS
Finally, all the processing, including the determination of the target object, the hand control with myoelectric signals and the hand motion, are verified through a case study. The results of the experiment are shown in Fig. 13. The subfigures from top to bottom are the feedback screen showing the cognition of the hand, the target object probability, the hand movement, the target object probability considering the gaze time, the trigger signal (control signal) extracted from the myoelectric signals, and the joint angle of the prosthetic hand. In the lower-right corner of the feedback screen, there is a vector (inside a circle) indicating the movement direction and speed of the prosthetic hand. The direction and speed are estimated from two adjacent frames in a sequence of captured images. If the hand has no motion, the gaze time accumulates and a green bar appears showing the length of the gaze time. If the gaze time exceeds a predefined threshold, the word ''selected'' is displayed near the estimated target object.
Similar to the previous experiment, the target object probability varies appropriately with the change of the object position in the image. Additionally, the determination probability (the fourth subfigure from top) increases with longer gaze time. The target object probability is intentionally suppressed after the 170th frame by performing a hand movement with the prosthetic hand. The myoelectric signal can successfully trigger the hand to open and close without making any incorrect operations. Myoelectric signals are intentionally generated near the 230th frame. Since the determination probability is near 0, no motion is triggered.

V. CONCLUSION
In this study, we improved the control system of a visionbased prosthetic hand, and the developed system can help the vision-based prosthetic hand localize the target object, select a proper grip pattern and trigger a hand open/close motion. In particular, we proposed a method to estimate the target object from multiple objects placed in the environment. The method estimates the positional relationship between the prosthetic hand and the objects to propose the target candidates. Then, the gaze time is taken into consideration to finally determine the target object. To verify the validity and effectiveness of the proposed method, we conducted experiments in a prototype system of vision-based prosthetic hands. The experiments prove that the proposed method can accurately estimate the target object in accordance with the user's intention.
In the future, we would like to extract more contextual information from the images to improve the accuracy of target object determination and to conduct automatic grip pattern generation based on the attributes (shape and weight) of the object. Moreover, we would like to deepen the potential application of vision-based myoelectric hands.
NOBUHIKO YAMAGUCHI received the Ph.D. degree in intelligence and computer science from the Nagoya Institute of Technology, Japan, in 2003.
He is currently an Associate Professor of the Faculty of Science and Engineering, Saga University. His research interest includes neural networks. He is a member of the Japan Society for Fuzzy Theory and Intelligent Informatics.
HIROSHI OKUMURA received the B.E. and M.E. degrees from Hosei University, Tokyo, Japan, in 1988 and 1990, respectively, and the Ph.D. degree from Chiba University, Chiba, Japan, in 1993.
He is currently a Full Professor of the Graduate School of Science and Engineering, Saga University, Japan. His current research interests include remote sensing and image processing. He is a member of the International Society for Optics and Photonics (SPIE), of the Institute of Electronics, Information and Communication Engineers (IEICE), and of the Society of Instrument and Control Engineers (SICE). VOLUME 8, 2020