Mastering the working sequence in human-robot collaborative assembly based on reinforcement learning

A long-standing goal of the Human-Robot Collaboration (HRC) in manufacturing systems is to increase the collaborative working efficiency. In line with the trend of Industry 4.0 to build up the smart manufacturing system, the Co-robot in the HRC system deserves better designing to be more self-organized and to find the superhuman proficiency by self-learning. Inspired by the impressive machine learning algorithms developed by Google Deep Mind like Alphago Zero, in this paper, the human-robot collaborative assembly working process is formatted into a chessboard and the selection of moves in the chessboard is used to analogize the decision making by both human and robot in the HRC assembly working process. To obtain the optimal policy of working sequence to maximize the working efficiency, the robot is trained with a self-play algorithm based on reinforcement learning, without guidance or domain knowledge beyond game rules. A neural network is also trained to predict the distribution of the priority of move selections and whether a working sequence is the one resulting in the maximum of the HRC efficiency. An adjustable desk assembly is used to demonstrate the proposed HRC assembly algorithm and its efficiency.


Introduction
Industrial robots have been successfully used to perform repetitive tasks with a high precision. However, there are tasks, such as complicated assembly works, that are less structured and too complex to be fully automated and thus cannot be totally performed by robots. Moreover, evaluation of the performance and the flexible adjustment by the human are sometimes necessary, which makes it impossible to fully replace humans with robots. Therefore, human-robot collaboration (HRC) systems are developed in industry to take advantages of the abilities of both humans and robots [1]. Unlike ordinary industrial robotics where the environment is structured and known, in HRC systems, the robots interact with humans who may potentially have very different skills and capabilities [2]. Over the past two decades, a significant number of researches has been done on the design of the HRC in manufacturing systems to improve the safety, quality, and efficiency of the system [3][4][5][6][7]. Since its declaration in late 1990s [8], the collaborative robots or cobots have been playing an increasingly significant role in the HRC in manufacturing systems as the assistants for humans [9,10]. However, the traditional viewpoint has been mostly focused on the development of the hardware [11,12], which results in machine-driven collaboration with less consideration of the function flexibility of the cobots. These studies are human-centralized, in which the cobots are limited to the scheme designed by humans and programmed based on the human domain knowledge, guidance or expert experience. Cobots have no self-learning capability, let alone the ability of self-organizing and the superhuman decision making that may surpass the existing models. Therefore, to explore a better method to accomplish a given task with the minimum workload, it is desired to develop human-robot systems in which the cobots can better cooperate with humans in a more autonomous way. The cobots will not be supposed to be regarded as only working tools anymore, but to have their own abilities to judge the system states with corresponding decision-making capability and are able to adapt themselves to various levels of the human operators.
The assembly work in manufacturing is significantly important, which integrates parts and components to realize the final products. The application of HRC in complex, continually changing and variety-oriented assembly processes is still limited. One of the main challenges is that the HRC assembly environment is complex and dynamic. Although there might be a list of assembly tasks, the tasks are not necessarily in sequential order and many tasks can be performed in parallel, and some shared tasks can be taken by either human operators or robots. With the uncertainties involving human operator's performance level and other random disruptions in plant floor (e.g., machine tool random failure), the decision on which tasks to be taken by what available resource, i.e. robots or human operators, and in what sequences, will have profound influence on the productivity of the assembly process. The current practice is mostly a manual process that heavily depends on human experience. The cobot is pre-programmed to perform repetitive tasks in limited assembly process. When assembly tasks change, the whole system must be reprogrammed by robotics experts, while operators working on the floor usually do not possess the expertise to reprogram the system. This requires the design of more intelligent cobot for the HRC in manufacturing assembly process.
A large amount of the latest existing HRC studies on efficiency and safety of the system are trajectory based, focusing on the human plan recognition and the prediction of human motions [13][14][15]. Very few works have been done on finding a better decision-making policy or a better task scheduling method to improve the working efficiency of the whole system. Wilcox et al. [16] provide an Adaptive Preferences Algorithm to find the optimal scheduling policy for the robot assistant to adapt to the change of human preference in the task sequence. However, their method is only demonstrated within a very small task space and with simple task structure which can be solved with mathematical optimization. For larger task spaces and more complex task structures, the application of mathematical optimization to the job dispatching problem will be Non-deterministic Polynomial-time (NP) hard.
With the emerging of smart manufacturing of Industry 4.0 in recent years, plenty of advanced approaches have been developed in system control and decision making [17], in which the application of machine learning (ML) is the most impressive part with increasingly powerful algorithms. For example, in scheduling, tons of research have been done using reinforcement learning (RL). In dynamic job shop scheduling (DJSS) problem, Shahrabi et al. [18] use RL with a Q-factor algorithm to enhance the performance of the scheduling method, considering random job arrivals and machine breakdowns. In maintenance scheduling problem, many researchers apply RL to find out the optimal maintenance policies in various manufacturing systems and obtain better system performance than traditional maintenance policies [19][20][21]. And in resource allocation, Ou et al. [22,23] have contributed a lot in the real-time scheduling of the gantry in a work cell. Although these researches are not on HRC manufacturing systems, the systematic way to formulate the decision making problem and using RL to solve the problem have laid a solid foundation in studying the human-machine interaction (such as in maintenance scheduling) and robot-machine interaction (such as in gantry scheduling).
In the past, the traditional RL methods are commonly limited to low-dimension problems. The recent RL algorithms are becoming increasingly powerful in solving extremely high-dimension problems thanks to the advancement in deep learning. One of the most exciting researches on RL in dealing with extremely large state space decision making problem in the past few years is probably Google Deep Mind's Alpha Go and Alphago Zero [24,25]. In Alphago Zero, an algorithm that is based on Mote Carlo Tree Search with a deep neural network trained to improve the strength of the tree search is introduced, which results in a superhuman performance in mastering the complex game of Go. Inspired by self-play algorithm of Alphago Zero [25], we formulate the manufacturing assembly process as a chessboard game with the specific assembly rules determined by the required constraints. By integrating RL and deep neural network, we can take up the challenge to study the decision making and workload scheduling for both humans and cobots in HRC manufacturing systems. The aim of this paper is to train the robot with a self-play algorithm based on RL without guidance or domain knowledge to find an optimal policy that can maximize the assembly work efficiency and can quickly adapt to changing environment and task changes.
The contributions of this work are in: (1) A typical assembly process is formulated as an assembly chessboard game with specific rules associated with assembly constraints. This will simplify the optimization problem, where the constraints are not needed to be considered explicitly. It also allows us to leverage the powerful RL and deep neural network algorithms to deal with extremely complicated assembly works. (2) We develop a multi-agent self-play RL algorithm to perform task scheduling, starting from random play, without any supervision or use of human data. The algorithm is inspired by Alphago Zero, with the goal changes from competition to multi-agent collaboration. The algorithm also addresses specific challenges in manufacturing assembly that do not exist in competitive game.
The remainder of the paper is organized as following: Section 2 states the problem assumptions and objectives. In section 3, the HRC assembly problem is formulated as a RL problem by defining the states, actions and reward function. Section 4 introduces the MCTS-CNN algorithm used to obtain the optimal HRC policy. A demonstration case is presented in Section 5. Finally, conclusions and future research directions are provided in Section 6.
2 Problem description 2.1 HRC assembly problem description The assembly job of a complex product can usually be decomposed into some number of sequential subtasks. There are two crucial matters, i.e. task allocation and task sequencing, that are directly related to the efficiency of the HRC assembly system. In order to provide some concreteness to the problem, we use the lift desk assembly job shown in Figure 1 to demonstrate these two aspects. Figure 1: Diagram of decomposing the product in the human-robot collaborative assembly into tasks for humans, robots or both with a rough assembly sequence.
On one hand, either human or robot has its own distinct strengths and skills that could be exploited in the assembly job. To this end, we categorize all the tasks into three types based on their physical properties and operation characteristics, such as geometrical stability, dexterity requirement, and tolerance limits etc. [26], where type I represents the tasks that can be done by humans only, type II by robots only, and type III by either humans or robots. For example, in Fig. 1, to assemble the legs, humans can place the screws in the assembly holes quickly and precisely. However, it might be hard for a robot to accomplish this task because it requires delicate adjustment to align the screws with the assembly holes. Thus, the screw placing tasks are human-only. Once the screws are placed, the robot can locate the screws and do the screwing job even faster than human. Thus, all the screwing tasks can be classified as robot-only. Other tasks, like flipping or rotating objects, can be done by either humans or robots. The task allocation refers to not only allocating human-only (or robot-only) tasks to proper human (or robot) operators, but also, more interestingly, deciding whether human or robot to take on a type III task.
On the other hand, given a product, its assembly sequence is constrained by the product design and quality considerations etc., but these constraints are relatively loose, which means there are still many possibilities in assembly sequences for some steps. For example, in Figure 1, to assemble the legs of the desk, robot cannot screw S1 before human place it into the assembly hole. Such kind of tasks have sequential constraints, such as "Place S1 and Screw S1", which are based on the sequence mandated by their physical properties or constrains. Other scenarios, such as which screw to be screwed first or whether the left leg or the right leg of the desk to be assembled first, have no sequence constraints, such as "Place S1, Place S2 and Place S3". Consequently, for HRC assembly system, it is nontrivial to have an adaptive policy that determines the optimal task sequences based on real-time system states, so as to take full advantage of the flexibility of the HRC system.
To summarize, in this paper, the problem we are solving can be described as: in a HRC assembly process, using the subtasks with existing relationships between each other as the input and choosing the task sequence and task allocation of the overall tasks as the output, seeking a method to optimize the task sequence and task allocation for both human and robots to maximize the working efficiency of the human-robot collaborative assembly.

Assembly-chessboard game
To determine the optimal task sequence considering the constrains discussed above, we format the structure of the HRC assembly tasks into a chessboard with three different kinds of stones representing the three types of tasks as shown in Figure 2 (a). The constraints in the assembly will be represented as rules in playing the assembly-chessboard game.
To fully reflect the task constrains and task structure complexity, the assembly-chessboard is initialized and the game rules are set as the following: • The chessboard has w × h grids, with the width w determined by the number of parallel tasks and the height h determined by the number of sequential tasks; • Black stones denote type I tasks that can be done by humans only; • White stones denote type II tasks that can be done by robots only; • Grey stones denote type III tasks that can be done either by humans or robots; • Tasks with sequential constraints are placed in the same column with prior tasks be set in the lower row of the chessboard; • Tasks with no sequential constraints are placed in the same row of the chessboard; • One stone may take more than one grid cells, depending on the number of tasks needed in that step (represented by row); • Stones can only be picked from the bottom row of the chessboard; • When any one stone is picked from the bottom row, the stones in all the upper rows will move down one row if the stones occupy the same size of the grid cells; • The game starts with the initial pattern of the chessboard and ends until all stones are taken by humans and/or robots; Using the desk assembly as an example, the pattern of assembly chessboard is initialized shown as the first figure in Figure 2 (b). If the first task is taken by a human operator, i.e., picking a black stone A 1 , the upper row task B 1 moves down to the bottom row shown as the second figure in Figure 2 (b). It means task B 1 can be done at the same time or at the same level of other black stone tasks and the bottom row. Note that C 1 cannot move down at this time, since it occupies two grid cells but only one grid cell is available. Next, if A 2 is picked, the B 2 moves down one row to fill the bottom grid cell and the available two grid cells allow C 1 to move down as shown in the third figure in Figure 2(b).

Problem formulation
One key advantage of formatting the assembly tasks to an assembly chessboard is that it contains not only the task structure but also the hard constrains. It offers a clear insight of the available agents and the options of tasks at any moment by just taking stones from the bottom row and following the game rules. Comparing with traditional hierarchical task tree representation which often has complicated networked structure, this representation provides a simple and accurate description of an assembly process. In this way, we can track the allocation of both humans and robots to the tasks and the sequence of their actions as shown in Figure 2(c). Therefore, our problem can be simplified as: given the pattern of the assembly chessboard at any time t when one agent or several agents are available, find the best allocation of the agents (i.e., humans or robots) to the tasks at the bottom row of the chessboard, of which the accumulated time to finish entire assembly chessboard by HRC is minimized. It is evident that this problem can be conveniently solved with the framework and algorithms in RL.

Notations
For the convenience of problem formulation, the following mathematical notations and assumptions are defined: • T K u , u = 1, 2, ..., U , denotes the U th task in the assembly chessboard with the total number U ; • A i , i = 1, 2, ..., I, denotes the i th task that can be done by humans only with the total number I. Each task A i has an average completion time T Ai ; • B j , j = 1, 2, ..., J, denotes the j th task that can be done by robots only with the total number J. Each task B j has an average completion time T Bj ; • C k , k = 1, 2, ..., K, denotes the k th task that can be done by either humans or robots with the total number K. Each task C k has an average completion time T C k ; • d hj Ai,C k (t) denotes the remaining time for a human operator j to finish the tasks of A i or C k . For example, if T Ai = 30min, at time t it has already been done by the operator j for 10 minutes, the d hj Ai (t) = 20min; • Similarly,d hj Bi,C k (t) denotes the remaining time for a robot i to finish the B j or C k task; • H i (t) ∈ [0, U ], i = 1, 2, ..., M denotes the availability of the human operator i at time t. If the human operator i is working on task T K u , then H i equals to u. Otherwise, it is set to be 0, i.e.
.., N denotes the availability of the robot operator i at time t. If the robot i is working on task T K u , then R i equals to u. Otherwise, it is set to be 0, i.e. R i (t) = 0, d ri T Ku (t) = 0 u, d ri T Ku (t) > 0 • one human and one robot are available at the same time, then the human has the priority to choose his/her action.

Assembly-chessboard Game Problem Formulation
The formatting of the HRC assembly process to the assembly chessboard game playing provides great convenience for the task scheduling problem to perfectly fit the Markov Decision Process (MDP) paradigm, which is the most common framework for RL. The state, action and reward function of the MDP are defined as following.
In our task scheduling problem, at any time t the state of the chess game s t consists of two parts. One is the task information on the chessboard such as the pattern of the chessboard and the remaining time of tasks on the chessboard, and the other is the availability of the human and robot. The state s t can be defined as: in which, a hi t ∈ {A(t) ∪ C(t), ∅}, where a hi t = ∅ means the human operator i takes no action at time t, and a ri t ∈ {B t j ∪ C t − a h t ∩ C t , ∅} where a ri t = ∅ means the robot i takes no action at time t. The ultimate goal of our problem is to minimize the job completion time. Bear this in mind, the reward function r t should be directly related to the elapsed time between two consecutive states. Therefore, the reward function is defined as: From the initial state s 0 to the terminal state s end , all the taken actions consist an action route l denoted as a l = {a(s 0 ), . . . , a(s t ), . . . , a(s end )} as shown in Figure 2(c). Our goal is to find the optimal HRC policy to maximize the accumulated reward r l = end t=0 r t , so as to minimize the total job completion time. It is noted that for route l, there are O l = end t=0 O st possible combinations of humans' and robots' actions, in which O st is the combinations of humans' and robots' actions in state s t . For the example of the desk assembly, this O l can be a huge number just like in the game of chess, the options of moves are around 35 80 . For more complicated assembly jobs, such as automotive assembly, this can be an even larger number. Therefore, the assembly-chessboard game problem has an ultra-high dimension. A proper algorithm has to be developed to solve this problem both effectively and efficiently.

Reinforcement learning for HRC assembly
We use a specially-designed chessboard game to describe the HRC assembly process. Based on the problem formulation in previous section, it has quite a few similarities to traditional board games, e.g. the game of Go. For example, it also has finite moves, and one agent's action inevitably affect the other's situation. Therefore, it is possible to leverage the successful solutions to traditional board games to solve our assembly-chessboard problem. In this section, we present the MCTS-CNN algorithm used in Alphago Zero to obtain the optimal HRC policy.

MCTS in the HRC assembly-chessboard
In the MCTS algorithm, the information of the system state s t defined above is saved in each node s t of the searching tree. For all legal actions a t ∈ a(s t ), the corresponding edges (s t , a t ) are connected to the node s t . The information saved in each state is a set as: {N (s t , a t ), W (s t , a t ), P (s t , a t )} (4) in which,N (s t , a t ) is the visit count of the edge, W (s t , a t ) is the total action value and P (s t , a t ) is the probability in searching the edge (s t , a t ). The searching process always starts from the root node s 0 of the searching tree and ends until a leaf node s L is reached. And the action a is selected in the state s t based on a variant of the PUCT algorithm [25], which can be defined as: in which, Q(s t , a t ) = W (s t , a t )/N (s t , a t ) is the state value and U (s t , a t ) = cP (s t , a t ) bt N (s t , b t )/(1 +  N (s t , a t )) is the exploration part to balance the exploitation part which is Q(s t , a t ), c is a constant that determines the level of exploration. When the leaf node is reached, it is always expanded and evaluated. Then, a reward v is backed up through each previous edge of the searching route. The visit count of these edges will be incremented as N (s t , a t ) = N (s t , a t ) + 1 and the action value will be updated as W (s t , a t ) = W (s t , a t ) + v. After a number of searches from the root, the agent takes an action in the root state s 0 based on the best policy π opt (a|s 0 ) = max at (N (s 0 , a t )/ bt N (s 0 , b t )) proportional to the edge's visit count. Then, after the action is done, the system will transit to a new state s 1 . Use this new state s 1 as the new root node and the subtree below node s 1 is retained as the new searching tree. All other parts of the previous searching tree are discarded. In this way, we can find the optimal policy for the agent scheduling in each state.

CNN in the HRC assembly chessboard
The assembly-chessboard game has an immense state space, and therefore the deep neural network helps tackle the dimensional issue with its powerful approximation ability. In addition, when the assembly job is represented with the chessboard, then the spatial correlation within the board is of great significance. Therefore, CNN is an ideal candidate to work with MCTS to solve the assembly-chessboard problem [25]. The CNN f θ ( * ) takes all the information stored in the raw chessboard representation s as input, and outputs both move probabilities and a state value, i.e. (p, v) = f θ (s). The vector of move probabilities p represents the probability of selecting each action. The value v is a scalar evaluation, estimating the task completion time starting from current state s. This neural network combines the roles of both policy network and value network into a single architecture.
The CNN is trained with the data gained from MCTS. In each state s t , the MCTS search outputs policies π of taking actions. The actions selected by these policies are usually stronger than those selected from the CNN with the raw probabilities. Therefore, MCTS can be regarded as a powerful policy improvement operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure, in which the neural network's parameters are updated to make the move probabilities and the state value (p, v) = f θ (s) more closely match the improved search probabilities and the real task completion time; these new parameters are used in the next iteration(p, v) = f θ (s) to more closely match the improved search probabilities and the real task completion time. Figure 3 illustrates the training pipeline.

Implementation and demonstration
With the understanding of rules in the chessboard, we choose to assemble a height adjustable standing desk shown as an example. The job types are defined based on the following rational: • Matching and placing work can only be done by humans; • Screwing, drilling and gumming work can only be done by robots; • Flipping and rotating work can be done by either humans and robots.
There are 27 robot-only tasks, 19 human-only tasks, and 4 human-or-robot tasks. After fitting the tasks into the chessboard according to the rough assembly sequence, an assembly chessboard is obtained with height h = and width w = 8. There is one robot operator and one human operator cooperating with each other to complete the assembly job. The neural network architecture used in this case is defined as the followings: • Input layer with size h × w × d = 15 × 8 × 3; • Convolutional layer with 10 filters of kernel size 2 × 2 with stride 1 and Relu activation; • Max-pooling layer with size 2 × 2; • Convolutional layer with 10 filters of kernel size 2 × 2 with stride 1 and Relu activation;  For MCTS, the maximum search depth is set to be 3 and maximum searches is limited to 30 times for one root node. The parameter for the UCT algorithm is chosen to be c = 100. We run the program in a PC with Intel Core I5-8400, UHD Graphics 630 and 12G RAM. The training progress is as shown in Figure 4. It can be seen that the proposed algorithm steadily makes progress despite of some fluctuations during the first few iterations. The shortest completion time is reached at the sixth iteration. After the sixth iteration, the completion time remains steady as 95, which indicates that the convergence has been reached.
For comparison purpose, two other approaches are considered in this case study. One is the exhaustive search, in which we traverse all possible routes; while the other one is to randomly choose 1,000 trajectories to finish the assembly job. It turns out that the exhaustive search is not feasible, since the trajectories possibilities explodes and finally depletes the PC memory. This situation is almost inevitable for a large-scale planning problem, since the state space explodes exponentially with decision steps. The exhaustive search program abruptly stops at step 55, and 5 more steps are yet to be traversed. We plot the total route numbers and computing time against the decision steps in Figure 5. As for the other comparison approach, we limit the possible trajectories to 1,000 by randomly sampling to make it computationally feasible. This approach mimics the scenario when both human operator and robot operator randomly choose tasks without considering the goal of minimizing the job completion time. However, as shown in Figure 6, the result reveals that the shortest completion time is hardly achieved as only one out of 1,000 gives the completion time of 95. The average job completion time is 100.58, which is far more than the shortest completion time obtained by the proposed algorithm in this paper. Therefore, this approach cannot guarantee the optimality of the randomly sampled routes. In conclusion, the proposed algorithm in this paper is effective in the online scheduling of assembly tasks in HRC scenarios.

Conclusion
In this paper, we format the human robot collaborative assembly process into an assembly chessboard with the game rules. Be benefit from this formatting, we formulate the task scheduling problem in the human robot collaborative assembly process into a RL problem. The MCTS-CNN algorithm based on Alphago Zero that combines MCTS combined and CNN is used to solve the problem. A demonstration of assembling a desk with HRC has been implemented which shows the time-saving property in finding the optimal task scheduling policy and the effectiveness in improving the collaborative working efficiency of the HRC system. In our future work, more complicated scenarios such as robot random failure and the uncertainty of humans will be included into consideration and the safety of the HRC will be another key point to be investigated while improving the working efficiency of the HRC system.

Acknowledgement
This work is supported by U.S. National Science Foundation under Grant CMMI 1351160 and 1853454.