Fast Prediction of a Worker’s Reaching Motion Without a Skeleton Model (F-PREMO)

,


I. INTRODUCTION
There is an increasing need for cooperative work between humans and robots in fields such as object transportation [1], [2], nursing care [3], [4], and product assembly [5], [6]. Particularly in the industrial field, many collaborative industrial robots have been developed and operated without any separation between the workers and industrial robots, primarily due to the following reasons: 1) The safety fences can be removed after a risk assessment has been performed, as defined in ISO10218-1:2011 and ISO10218-2:2011 (revised in 2011); 2) Cooperative work has great potential for improving production efficiency by decreasing production time, shortening the production line, and decreasing the number of workers.
The associate editor coordinating the review of this manuscript and approving it for publication was Tao Wang .
For these two reasons, a number of robots have been developed for cooperative work, including CR-35iA [7], duAro [8], NEXTAGE [9], YuMi [10], LBRiiWa [11], and Baxter [12]. For example, the FANUC corporation developed an industrial robot for cooperative work, called CR-35iA. This robot has the following three safety functions [13]: 1) Contact stop function, which is activated on the physical impact between the robot and worker; 2) Push to escape function, under which the robot may be shoved in any direction by a human operator; and 3) Shock absorption function, which is achieved by covering the robot's surface with cushioned materials. ABB has also produced a collaborative industrial robot called YuMi [10]. YuMi can execute emergency stops within less than 10 [ms] after detecting an unexpected impact, and the surface of the robot is covered with soft pads for absorbing the physical impacts. In addition, the positions of YuMi's joints and the length of each joint are carefully designed such that there is no gap where the worker's fingers can get stuck.
These robots ensure worker safety by utilizing emergency stop functions when collisions are detected and by computing trajectories for collision avoidance based on the distance between the robot and the worker [14]- [16]. In other words, these safety functions are based on the current state of the robot and the worker. In addition to these functions, predicting the worker's movement can greatly improve production efficiency by decreasing the number of emergency stops, increasing the running speed of the robot, and computing more efficient trajectories for collision avoidance [17].
The prediction of worker movements at production sites is categorized into the following two types. The first is the prediction of the movement of the worker's center of mass [18], [19], while the second is the prediction of the motion of the worker's arm, which is referred to as the reaching motion [29]- [31]. The former is mainly treated as a problem of predicting the movement of a mass point. In contrast, predicting the reaching motion is more difficult since the dynamics of motion of a human arm are highly nonlinear [32]. However, in many assembly tasks, the worker's arm moves with nearly no change in the worker's center of mass.
Therefore, this paper focuses on predicting the reaching motion of a worker. A number of prediction methods for reaching motion have been proposed. A prediction method that uses a Bayes classifier and assumes that the positions of each joint are recognized and measured is reported in [31]. In their method, the reaching motion is considered to be divided into two layers: the upper layer defines the type of task for the reaching motion and the lower layer provides the motion. The method proposed herein predicts the type of task and motion, and subsequently, integrates each prediction. Their paper reports that a prediction accuracy of greater than 70% is achieved based on 1/3 of the trajectory of the reaching motion.
With technological advances on three-dimensional measurement [20], [21], a lot of robotic tasks have been achieved with 3D sensors, such as, positioning [22], [23], pose estimation [24]- [26], picking [27], [28], and measurement of worker's movement. Researches [29] and [30] also employed motion-capture systems to detect the arms and utilized the Gaussian Mixture Model (GMM) algorithms to predict the reaching motion of the arm. Their studies focused on learning models that describe how humans cooperate, which can be applied to cooperative human-robot work. Other similar techniques used to investigate human motion have been studied by using Hidden Markov Models (HMMs) [33].
In these models, the positions of the worker's joints, particularly the arm joints, are measured using a motion-capture system, which requires (1) attaching markers to the worker and (2) placing at least three cameras. These requirements increase the cost of installing a cooperative production system and place limitations on the working environment.
Alternative methods for predicting the reaching motion without a motion-capture system have been proposed using a skeleton model of the worker.
For example, a prediction method for the reaching motion based on gesture recognition was proposed in [34]. In this method, the various types of worker movements are registered in a database in an offline process. The movements are then categorized into classes, and one GMM is assigned to each class. The volume swept by the worker is also computed for each class. Next, in an online process, the current motion is first categorized based on the recognized gesture, and the time series of the volume swept by the worker is predicted. This method achieved prediction accuracies of 50% and 92% based on measured data of 43% and 80% initial trajectories, respectively.
Ravichandar et al. proposed a prediction method for the trajectory of reaching motion and reaching point based on the dynamics of a human arm model using a neural network (NN) [32]. In their experiment, a skeleton model of the worker was tracked in real-time with a Kinect sensor. The prediction accuracy and prediction time of this algorithm were 87% (seven correct predictions out of eight trials) and 50 [ms], respectively.
Jiang et al. [35] proposed a method that identifies the skeleton model and predicts the reaching motion by using the identified skeleton model. They represented the skeleton model as a mapping between high-and low-dimensional representations in the skeleton model by using latent conditional random fields to model the spatial and temporal contexts of human activities. The method of Jiang et al. took 11.2 [s] to compute 10 [s] of the reaching motions.
The above methods estimate the joint positions of the worker by using the estimated skeleton model in real-time. However, to develop the skeleton model, it is necessary to place a 3D sensor in front of the worker in order to capture the upper body. In other words, these methods place limitations on the placement of the 3D sensor.
To overcome these limitations, we propose a fast and highly accurate prediction method, called fast prediction of reaching motion (F-PREMO) for the prediction of reaching motion without the need to attach markers to the worker or place a 3D sensor in front of the worker. The proposed method divides the common working area of the worker and robot into different regions and predicts which region the worker's hand will reach into. The proposed method mainly comprises the three following processes. First, the method extracts a set of 3D points comprising of the worker, which is called the ''worker's cluster'' in this paper, from the obtained 3D point cloud at each time step. Next, the method calculates a feature vector of the reaching motion from the time series composed of the worker's clusters. Finally, the proposed method predicts the region into which the worker's hand will reach by using a random forest. We confirmed the performance of the proposed method using an assumed assembling task. We conducted an experiment using a Microsoft Kinect v2 [36] as the 3D sensor; the sensor was mounted above the worker rather than in front of the worker. The experimental results showed that the proposed method could accurately predict the reaching motion based on 50% initial data for the reaching motion. In addition, the average computational time for the prediction was 5.2 [ms]. These results show that VOLUME 8, 2020 FIGURE 1. Situation for cooperative work considered in this paper. The working space consists of three types of areas: the robot's working area, common working area, and the worker's working area. In the robot's and worker's working areas, only a robot and worker can be located, respectively. The common working area is divided into several regions, each of which is assigned a label. A 3D sensor is located at a position where it can observe the common working area and movements of the worker.
the proposed method can quickly and accurately predict the reaching motion without requiring marker attachments or 3D sensors in front of the workers.
The remainder of the paper is organized as follows. Section II describes the problem considered in this paper: predicting the reaching motion of a human doing cooperative work with a robot. Section III describes the processes for computing the feature vector showing the worker's reaching motion from 3D point cloud data obtained by a 3D sensor. Section IV describes the semi-automatic process for creating training data and the process to construct a predictor using a random forest. Section V experimentally examines the performance of F-PREMO, including prediction accuracy and computation time. Section VI concludes this paper.
The following notations are used in this paper. A sphere with its center located at p ∈ R 3 and radius r ∈ R is denoted by S(p, r). Elements of set A are represented by a i (i = 1, 2, · · · ). The symbol |A| is defined as the number of elements of a set A. Using the above notations, we can write (1)

II. PROBLEM STATEMENT
As described in the previous section, this paper aims to predict a human's reaching motion during cooperative work between the human and the robot. In cooperative work, there is a chance of collision in the common working area where the robot and worker co-exist. To address the above problem, we consider the working situation shown in Fig.1. This paper assumes three types of areas: the robot working area, common working area, and worker's working area. The common working area is divided into several regions, each of which is given a label. We will FIGURE 2. Prediction process from 3D point cloud data to prediction in the proposed method. Based on the 3D point cloud data, the method categorizes the point cloud and identifies 3D points representing a worker, as shown in Sec. III-B. We refer to the 3D points as a worker's cluster. The proposed method then computes a feature vector using the time-series data of the worker's clusters, as shown in Sec. III-C. Finally, the method predicts the reaching motion from the current time t to t + T by using a random forest, as shown in Sec. IV.
propose a method that predicts the area that the worker's arm is reaching toward.
In addition, the proposed method does not require attaching markers to the worker or placing the 3D sensor in front of the worker. The only requirement concerning the placement of the 3D sensor is to mount it such that it can observe the worker movements in the common working area and in the worker's working area.
We can summarize the problem described above as follows: Problem 1: There exists a common working area for a worker and a robot, which is divided into n regions represented by 1 , 2 , · · · , n . The time-series data of the 3D point cloud in the common working and worker's working areas are assumed to be given from the initial time 0 to the current time t, and are denoted by P(0), P(1), · · · , P(t). The region located closest to the robot's working area S robot and where the worker's hand will reach from the current time t to t + T is determined as where d(X , Y ) denotes the distance function for the regions X and Y .Î(t + 1 : t + T ) is the predicted index set of regions toward which the worker's hand will reach from time t + 1 to t + T , and is defined by where E(J 1 , J 2 , · · · , J n ) is a mapping from the multi-set {J 1 , J 2 , · · · , J n } to a unique set in which each element has a multiplicity of 1. The symbol V th represents a threshold for volume. The symbolV worker represents the predicted area occupied by the worker at time τ and X ∩ Y volume denotes the volume of the intersection between regions X and Y.
To solve Problems 1 and 2, we first propose a method to extract the worker's cluster from the obtained 3D point cloud data; this method is a solution to Problem 1, shown in Sec. III-B. We then propose a process to compute a feature vector that represents the reaching motion, as described in Sec. III-C. Finally, we present a method to estimate the region i * that satisfies (2) by using the time-series data of the feature vectors; This method is a solution to Problem 2. . Process for computing a feature vector for the worker's reaching motion. This computation consists of four steps. In the first step, the worker's cluster is determined from point clouds obtained by a 3D sensor. In the second step, the 3D space is divided into voxels, and each voxel is binarized such that whether the number of points for the worker's cluster located in each voxel exceeds a threshold or not. After this process, the binary vector f (t ) is obtained in the third step. In the fourth step, the feature vector of the worker's reaching motion F (t ) is computed from the time series of the binary vectors f (t ), f (t − 1), · · · .

III. COMPUTATION OF FEATURE VECTOR
This section describes the process of computing a feature vector of the worker's reaching motion. The process mainly consists of four steps, as shown in Fig.3.
In the first step, the worker's cluster is extracted from 3D point cloud data. In the second step, the working area is divided into voxels and each voxel is binarized to indicate the presence of the worker. Subsequently, the binary vector f (t) is obtained in the third step. Finally, in the fourth step, a feature vector F(t) is computed based on the time series of binary vectors f (t), f (t − 1), · · · . Section III-A will explain preparation needed to be performed for the process of computing the feature vector. The first step is described in sec. III-B and the rest of the steps are shown in sec. III-C.

A. PREPARATION
First, it is necessary to install a 3D sensor in a place where it can observe the common working area and the worker's working area. It is not necessary for the 3D sensor's observation area to cover the entirety of the worker's working area. In addition, the 3D sensor is assumed to be able to obtain depth images of the scene. Thus, each 3D point obtained by the 3D sensor will have a corresponding 2D representation (u, v) in the camera's view, which can be represented by where (x, y, z) and s are the obtained 3D point's coordinates and a scale parameter, respectively. A number of models have been proposed for the function g [37]. The simplest one is the pin-hole camera model, which can be represented by where K is the intrinsic matrix and includes the focal length of the camera. Second, a rigid body conversion matrix T S M to transform the manipulator's coordinate system to the 3D sensor's coordinate system can be obtained by offline calibration. Next, the 3D shape of the manipulator is approximated by a primitive shape χ (e.g., a cuboid or a sphere) [14], and a function is identified, where V M and p joint i , i = 1, 2, · · · , J represent the approximated occupied space of the manipulator and each joint position of the manipulator, respectively. As a final phase in the preparation, the 3D sensor obtains a point cloud data for background where the manipulator is set to be in an initial pose, and the worker is not present.

B. RECOGNITION OF WORKER BASED ON 3D POINT CLOUD DATA
This section describes the process for extracting the worker's cluster of 3D point cloud data from the obtained 3D point cloud data at each time point.
First, the 3D points consisting of the manipulator are recognized in the 3D point cloud data as follows. Here, the 3D point cloud data are denoted by P(t) := {p 1 (t), p 2 (t), · · · , p n t (t)}. For the extraction, each joint position of the manipulator q M i , i = 1, 2, · · · , J in the robot coordinate system is computed by encoders from the angles of the joints. The joint positions are converted into the 3D sensor's coordinate system by From the computed joint positions, we can specify the space occupied by the manipulator as The 3D points corresponding to the manipulator are recognized by Next, we perform the background subtraction for the point cloud and extract the changed points to satisfy where p (u,v) (t) ∈ R 3 represents the 3D point corresponding to the 2D representation (u, v) and ψ z (p) is a function that returns the z coordinate of a 3D point p. Here, the set of points satisfying (11) is defined by P bs (t) := {p (u,v) (t) | satisfies eq.(11)}.

VOLUME 8, 2020
The manipulator's points are then removed from P bs (t); that is,P bs (t) := P bs (t)\P M (t).
The clustering is performed for pointsP bs (t) using either the 3D representation of the points [38] or the projected 2D representation of the points [39]: The clustered points are then obtained. 1 The clusters inP bs (t) with more than worker points are considered to be the worker; that is, where worker is the user-defined threshold, which should be set depending on the worker's volume.
We write the above process for extracting the worker's cluster of 3D point cloud data from P(t) as P worker (t) = ξ (P(t), P(0)), where P(0) represents the 3D point cloud obtained as the background. The function ξ in (15) is a solution to Problem 1.

C. FEATURE VECTOR OF THE WORKER's REACHING MOTION
This section describes the process of computing a feature vector of the worker's movement from the obtained worker's cluster derived in the previous section. The reaching motion is predicted using a random forest [40] with the feature vector as input.
First, we assume that the worker's cluster of 3D point cloud data is given from the initial to current time P worker (τ ), τ ∈ {0, 1, · · · , t} by using the method shown in sec. III-B.
The 3D space is then divided into voxels with length L voxel on each side. Each voxel is turned into a binary state to indicate the presence of the worker inside a voxel. If at least one point from P worker (t) is mapped to the voxel, the voxel takes the value 1; otherwise, the voxel takes the value 0. After this process, the 3D voxel space is mapped into a 1D space, creating a binary vector denoted by f (τ ) ∈ {0, 1} n f , τ = t, t − 1, · · · , as shown in Fig.3. Now, we are ready to define the feature vector as where κ 0 and κ 1 are user-defined constants for time delay. The function δ represents the time change for a worker's pose and is defined by 1 We have used an approach similar to that of [39] in sec. V.

IV. PREDICTION OF REACHING MOTION WITH FEATURE VECTOR
This section explains the prediction process by using random forest as a classifier. Before use, the random forest is trained on the data pairs of the feature vectors and region labels. These training data are semi-automatically created from time 0 to T train using the following process. The training data, which are constructed of pairs of the feature vector F(t) and the region label (t), are created in real-time without postprocessing. To create the training data, the worker only needs to indicate the starting time t i for the i-th reaching motion. The end time for each reaching motion is automatically identified as follows. Once the worker's hand is detected inside the region i in the common working area at time t reach i , a timer will run for the duration s timer . If the worker's hand has been inside the region i during the whole duration of s timer , the training dataset D i for the i-th reaching motion is constructed by (18) where T i represents prediction horizon, which is defined as t reach i − t i + s timer . The collection of n train reaching motion samples is combined as Not all feature vectors from time 0 to T train are included in D predict . The feature vectors that are not included in D predict (i.e., the worker is not performing a reaching motion toward the common working area) are denoted by F(t no i ) and collected in D no := (F(t no 1 ), −1), (F(t no 2 ), −1), · · · , (F(t no m no ), −1), (20) where the label -1 indicates no reaching motion toward the common working area. Here, note that {t no 1 , t no 2 , · · · , t no m no } = {0, 1, · · · , T train }\{t 1 , t 2 , · · · , t n train }.
Finally, we combine all training datasets generated by the above process into one training dataset: The random forest classifier consisting of n tree decision trees is then built using the training dataset. Each decision tree in the random forest is trained on randomly chosen data pairs (F(t), (t)) from D, where F(t) is the input vector to a decision tree and (t) is supervised data. In the online prediction process, the output of the random forest is the integrated output from each decision tree, as shown in Fig.4. The output of each tree is a probability prediction score for each region, including the special label -1. The output i predict region (t) from the random forest represents the region into which the worker's hand will reach with high probability from time t to t + T .
Representing the reaching motion with the feature vector F(t) allows us to store the current and previous positions of the worker's poses in a single vector without imposing any FIGURE 4. Process of predicting the region toward which the worker's hand will reach from current time t to t + T by using a random forest. First, a feature vector F (t ) of the worker's reaching motion is computed from the point cloud data P(t ). The feature F (t ) is then inputted to the random forest, which is obtained by the semi-automatic training process described in Sec. IV. A random forest is a set of binary decision trees, and each tree outputs the probability of the worker's hand reaching a certain region. The outputs of each tree are integrated, and the region with the highest probability is the prediction output from the random forest. kinematic model for the worker's body. Even though F(t) is a high-dimensional binary vector, the fast computation of the reaching motion is achievable. The random forest classifier can ignore unimportant elements inside F(t) (e.g., voxels that have never been occupied by the worker). Each decision tree is trained to identify and disregard unimportant feature values inside the first binary decision tree layers, thereby quickly reducing the feature space for the prediction process. Figure 5 shows a photo of the experimental system used for validation.

V. VALIDATION OF THE PREDICTION METHOD
The validation experiment involves a cooperative assembly task as follows.
1) A robot sequentially assembles six types of parts in the robot's working area. 2) The robot places an assembled part in the corresponding parts boxes (six boxes are located on a table).

3)
Step-by-step assembly instructions for a product are shown to a worker.  4) The worker selects and picks up parts that are necessary for the product from the parts boxes, and assembles the product on a conveyor belt.
In practice, the robot does not assemble parts in the robot's working area, and six types of parts are prepared in the common working area because this situation is sufficient to evaluate the performance of the proposed method and verify the extraction of the worker's cluster from the 3D point cloud, including the robot's cluster.
In the verification experiment, the common working area is coincident with the area where the parts boxes are located. This common working area is divided into six regions, each of VOLUME 8, 2020  Figs.7a and 7b indicates that the worker's hand will not reach the common working space from time t to t + T . The prediction result in Fig.7c illustrates that the proposed method outputs the correct prediction at 0.4 [s] after the initial pose of the worker. The results show that the proposed method generates correct predictions based on slight changes in the worker's posture.
which is represented by a box. In this situation, the proposed method predicts which box the worker's hand will reach.
The system configuration is shown in Fig.6. A Kinect v2 for Windows [36] is used as the 3D sensor and set at 1.2 [m] above the parts boxes. The Kinect v2 captures 512 × 424 depth images at a frequency of 30 Hz. The 3D point cloud data obtained by Kinect v2 are sent to a PC with an Intel Core i7-4720HQ CPU and 16 GB of DDR3 SDRAM, and processed for prediction. A point cloud library [41] is used for processing the 3D point cloud data.
To create the training dataset, the worker performs 300 cycles (50 cycles for each box). Here, starting from the  initial pose of the worker, we define one cycle process to include the worker picking up one part and inserting that part into a fixture located on a conveyor belt. The random forest is constructed using the created training datasets. A method for this machine-learning part is implemented with a machine-learning library called scikit-learn [42]. Table 1 shows the parameter settings for the proposed prediction method. In this experiment, Kinect v2 can measure a 3D space with a volume of 2.
Here, we set an ROI of 1.0[m] × 1.0[m] × 1.0[m] in the observed 3D space. In the ROI, we divide the 3D space by voxels with length = 0.1[m] in each dimension. Thus, the dimension of the binary vector f (t) is equal to 1,000. In addition, the dimension of the feature vector is 4,000 since k 0 is set to be two, as shown in Table 1.
To validate the proposed prediction method, we performed an experiment for 90 cycles (i.e., 15 cycles for each of the six boxes). The number of 3D points obtained by Kinect v2 is almost 217,000. Downsampling is performed at each time point, and the number of obtained 3D points is decreased to nearly 14,000 points. Figures 7 and 8 shows the obtained 3D point cloud, prediction results, and photos for the 65th cycle. In Figs. 7a-8h, the upper, middle, and lower figures show the 3D point cloud with prediction results, photos captured by the camera located above the conveyor belt, and photos captured by the camera located in front of the conveyer belt, respectively. In Fig. 7 and 8, the processes of stretching out, picking up the part, and putting it on the conveyer belt are shown in 7a-7d, 7i-7l, and 8a-8e, respectively. After picking up the part, the method predicted ''no box'' (i.e., the worker will not pick up any part in the current phase), as shown in Figs. 8a-8h. This result and VOLUME 8, 2020 FIGURE 9. Time-series prediction results of the random forest for the cycle shown in Fig.7. The horizontal axis shows time, with t = 0 being the beginning of the reaching motion. The vertical axis shows the output of the random forest. The worker can be considered to be in one out of seven states [i.e., reaching box i ∈ {1, · · · , 6} or not reaching any box (''no box'')]. For this cycle, the proposed method can predict the reaching motion correctly at time 0.4[s] since the output of the random forest changes from ''no box'' to ''Box 1'' at that time. Fig.7c show that the proposed method can produce the correct prediction based on slight changes in the worker's posture. Figure 9 illustrates the result of time-series prediction for the 65th cycle. In this experiment, a total of seven states of the worker can be considered, that is, reaching boxes i, i ∈ {1, 2, · · · , 6} and not reaching any box (''no box''). Thus, Fig.9 Figure 10 shows the statistical prediction result for the entire validation set of 90 reaching motions. Compared to Fig.9, which illustrates the prediction result over time, this graph shows the prediction result over normalized distance ratio, where 0 and 1 are the initial point of the trajectory of the reaching motion and the point where the worker's hand reached the box, respectively. Figure 10 shows that when the worker has performed 40% to 50% of the reaching motion from the worker's initial position, the proposed method can correctly predict the reaching motion. The placement of the boxes with regard to the initial pose of the worker indicated that the reaching motions for the right arm were longest for box 1, with motion's length diminishing for every box down to box 6. Longer arm reaching motion results in more visual features of worker's posture that can be used to distinguish the reaching motion toward different boxes. This would explain the differences in the early prediction results seen in Fig.10 for different targeted boxes. Figure 11 shows the computation time for the prediction process. Each prediction process, including the processing of the 3D point cloud, takes 5. 18 [ms]. This result indicates that real-time prediction can be performed by the proposed method.
The changes in prediction accuracy and computation time with the number of decision trees constructing the random forest are shown in Fig.12. Figure 12 illustrates ''Prediction using Random Forest'' corresponding to Fig.11 since the The horizontal axis shows the distance ratio of trajectories beginning from the initial position to the worker's hand reaching the boxes. The average prediction result for each reaching motion is illustrated in this graph. Based on the 40% to 50% initial trajectory of the reaching motion, the proposed method can correctly predict the reaching motion.  computation time for processing the 3D point cloud and feature vector do not depend on the number of decision trees. This figure shows that the prediction accuracy exceeds more than 85% when the computation time is more than 0.2 [ms], and the accuracy gradually increases up to 90%. This result indicates that the proposed method can perform high-accuracy prediction in real-time.

VI. CONCLUSION
This paper proposed a prediction method for reaching motions used to achieve efficient collaborative tasks by a worker and an industrial robot (F-PREMO). The proposed method has the following three advantages: First, the proposed method does not require the posture of each joint of the worker to be estimated or a skeleton model of the worker. In other words, the method does not impose any restrictions such as attaching markers to the worker or placing a 3D sensor in front of the worker. The second advantage relates to the computation time and prediction accuracy of the proposed method. The experimental results show that the performance of the proposed method is similar to or better than existing methods, without imposing any restrictions for the placement of the 3D sensor. Finally, the proposed method does not require much time and cost to generate training data. In general, the training data utilized for supervised learning methods are created manually, requiring considerable time and cost. To solve this problem, the training data are generated semi-automatically in the proposed method using a random forest.
In future work, we will confirm that the proposed prediction method can be applied to worker motions other than the reaching motion. It is also necessary to consider the method that generates the trajectory of the manipulator by using the prediction obtained by the proposed method.