Learning-Based Automation of Robotic Assembly for Smart Manufacturing

For smart manufacturing, an automated robotic assembly system built upon an autoprogramming environment is necessary to reduce setup time and cost for robots that are engaged in frequent task reassignment. This article presents an approach to the autoprogramming of robotic assembly tasks with minimal human assistance. The approach integrates “robotic learning of assembly tasks from observation” and “robotic embodiment of learned assembly tasks in the form of skills.” In the former, robots observe human assembly operations to learn a sequence of assembly tasks, which is formalized into a human assembly script. The latter transforms the human assembly script into a robot assembly script in which a sequence of robot-executable assembly tasks are defined based on action planning supported by workspace modeling and simulated retargeting. The assembly tasks, in the form of the robot assembly script, are then implemented via pretrained robot skills. These skills aim to enable robots to execute difficult tasks that involve inherent uncertainties and variations. We validate the proposed approach by building a prototype of the automated robotic assembly system for a power breaker and an electronic set-top box. The results verify that the proposed automated robotic assembly system is not only feasible but also viable, as it is associated with a dramatic reduction in the human effort required for automating robotic assembly.

structured smart cell are expected to grow exponentially. This challenge increases the demand for "leveling up" of automation to set up the robots in smart cell-based production lines. This boils down ultimately to the problem of how to automate the programming of the robots in such a way that the robots can carry out the given task automatically with the minimum human input.
For instance, although many cooperative robots have recently been developed and released to support the flexibility of work cells, there is an immediate need for further improvement in the convenience of teaching the robots to accomplish a task in a more direct and intuitive manner. This means that, in order to achieve simultaneous improvements in productivity and flexibility based on robotized smart work cells, it is critical to increase the level of automation in programming and robotic execution. As such, researchers in the field of smart manufacturing are increasingly interested in applying artificial intelligence (AI) to automating robot work planning and control.
The process of robotizing a target process by introducing or adding an industrial robot to a work cell consists of two steps. The first step comprises the preoperations, such as programming (P), teaching (T), and parameter tuning (PT), while the second step comprises the postoperations needed for adaptation to a real environment. The preliminary work (P, T, PT), which is performed by human workers, takes about three to seven months depending on the complexity of the target process. Thus, there exists an urgent need for improvement in the efficiency of the prework. In particular, the current practice of manual work of (P, T, PT) by automation specialists should be replaced by a new paradigm of data-based automation for which AI is expected to play a major role.
Note, however, that, for practical purposes, it may be necessary to consider potential obstacles to the use of AI-based automated systems in real manufacturing practices. For instance, major robot makers may be reluctant to allow their robots to be programed and controlled by external sources. Therefore, this article proposes an AI-based automated assembly system that guides the robots in an assembly smart work cell to automatically create work plans by observing human demonstrations of the work process and content. The work plan includes the assembly order, the sequence of unit assembly operations, the target pose of assembly objects, and the skills required to complete the unit assembly operations. Although challenging, the development of the proposed AI-based automated assembly system is supported by the recent emergence of two significant technological advancements: 1) deep learning (DL)-based real-time segmentation, modeling, and understanding of static and dynamic scenes and 2) an increase in the power of reinforcement learning (RL) for implementing robot skills.

A. Related Work
As a means of improving the prework (P, T, PT) for robotic assembly, the "BAXTER" robot [1] simplified P and T by eliminating P altogether. The process injection is made easier by the use of direct teaching to automate Nemec et al. [4] developed a new type of autonomous robot work cell, the "Reconcell," in which the robot learns skills for polishing and grinding processes from human demonstrations. "Reconcell" was designed for both large and small production lines [3]. Niekum et al. [6] developed a robot that assembles IKEA desks, for which they define the unit tasks that the robot can perform with the skills the robot learns. The furniture is assembled through a combination of the unit works represented as a designated sequence.
Diankov proposed an AI-based platform, "OpenRAVE" [5], for the development of robot programs. "OpenRAVE" carries out automatic task programming using the minimum available information on tasks and robots, including CAD files, sensor data, movement constraints, grip/contact constraints, an obstacle map, and robot-related information, such as robot kinematics, grippers, bases, sensors, controllers, and H/W interfaces. Dömel et al. [30] presented a modular software system that autonomously solves industrial manipulation tasks. A number of modules have been developed to represent knowledge on the state of the world, the objects to be handled, and the assembly processes such that, given the task and environmental requirements, they are flexibly organized through hierarchical flow control to perform autonomous mobile manipulation. Meanwhile, Syddansk Universitet, in the IntellAct project [2], developed a technology that allows robots to automatically learn semantic tasks based on an understanding of scenes and human actions.
Automatic manipulation of industrial objects requires object identification, segmentation, and poses estimation. "Kinema pick" [31] can recognize the shape and pose of various-sized boxes and determine the robotic motions that are necessary to pick them up and place them on a conveyor belt in real time. Deep Sliding Shapes [32] was proposed as a method of simultaneously performing object detection and object pose estimation based on an RGB-D camera. This method consists of DL networks that estimate a 3-D region, identifies the objects in that region, and computes their poses. MaskTrack ConvNet [33] was presented as a method of obtaining a highly accurate representation of object segmentation from videos. In it, a DL network performs segmentation for the current frame based on the segmentation result of the previous frame and the image in the current frame as the input.

B. Problem Statement and Proposed Approach
In this article, we address the following problem: how to reduce the setup time and cost for reconfiguring and reprogramming robots engaged in a minimally structured smart assembly cell, so as to effectively deal with the frequent reassignment of assembly tasks of increasing sophistication. To deal with this problem, we recognize the need to upgrade the level of automation in the assembly line by endowing robots with the capacity for autoprogramming under minimal human input. To this end, we propose a robot autoprogramming environment that involves the integration of robotic learning of assembly tasks from observing human assembly and robotic embodiment of learned assembly tasks with pretrained skills. Furthermore, we demonstrate how recent advancements in deep and RL can impact the quality of implementation of the proposed approach.

II. S Y S T E M O V E R V I E W
The proposed approach to building an autoprogramming environment for robots engaged in a smart assembly cell consists of three parts: 1) robotic learning of assembly tasks through observation of the human assembly process, which results in a high-level description of learned assembly tasks as a formal human assembly script; 2) automatic transformation of the human assembly script into a robot assembly script, in which a sequence of robotexecutable tasks is specified based on robot action planning supported by workspace modeling and simulated retargeting, referred to here as "robotic embodiment;" and 3) automatic execution of the sequence of robotexecutable tasks defined in the robot assembly script with pretrained robot skills. The pretrained skills allow the robots to adapt to the uncertainties and variations present in real assembly environments. Refer to Fig. 1 for an illustration of how the three parts described above function to accomplish the assembly objective of the system.
To define the sequence of assembly tasks from observations, the first part performs DL-based recognition of a sequence of human assembly actions and of the grasping mode of the hand, in addition to the recognition of the objects involved in those assembly actions, from captured video images. To this end, we collect large-scale human assembly data sets to train and test the DL networks, including annotated data sets of human assembly actions with their sequences in the assembly process and the grasping modes involved, from the real assembly of various electromechanical products. The assembly sequences are then organized into finite state machines (FSMs) as generalized representations [34], [35]. An FSM filters a sequence of assembly actions recognized by DL networks in such a way to ensure the robustness of the identified assembly sequences and to describe the assembly states associated with those actions. This process of learning a sequence of assembly tasks from observation is supported by the recognition of the objects and the types of hand grasping motions that help to identify the assembly states associated with the actions.
The learned assembly tasks with their sequences are formalized into a human assembly script in terms of the assembly actions with their orders in the assembly sequence and the state transitions, grasping modes, and objects involved.
The second part, "robotic embodiment," aims to transform the human assembly script generated in the first part into a robot assembly script that can be executed by robots despite the differences between humans and robots in terms of their physical and perceptual capabilities. The robotic embodiment starts with planning a sequence of robot-executable actions with Planning Domain Definition Language (PDDL) [11], so as to accomplish the intended assembly tasks defined in the human assembly script. To plan the robot actions, we predefine a set of robot actions to be used in the PDDL domain. The planned robot actions are subject to simulation-based verification prior to being entered into a robot assembly script. The verification process assesses whether the planned action sequences are executable based on simulation aided by 3-D workspace modeling. If they fail, the planned action sequences are modified through retargeting and replanning. Retargeting sets up novel assembly states that may allow the robots to accomplish the given assembly tasks even though these states are not specified in a human assembly script. The additional retargeted states are then fed back into PDDL for replanning the necessary robot actions. As an example, when the robot cannot reach its target position due to obstacles, contrary to what is expected from initial planning, retargeting and replanning become necessary. In this case, retargeting and replanning should be automatically invoked to generate, for instance, an action for obstacle removal so that robot can reach the target position.
The third part implements the robot assembly script from the second part in real-world assembly cells. The implementation integrates the planned robot assembly actions with real-time recognition of the 3-D positions of the objects and tools involved. It should be noted, however, that simulation-based retargeting alone may not be sufficient for robots to accomplish the planned assembly tasks. This is especially true when the uncertainties and variations involved in assembly tasks are too great to be specified in the assembly script, as is, often, the case when an assembly task requires sophisticated motion and force trajectories, as well as interactions with objects and environments. To handle this issue, we pretrain the robots on a list of skills from which they can select to carry out real-world assembly operations. Here, such robot skills are represented by deep convolutional neural networks (DCNNs) [10], [20], in which the supervisor is capable of self-improvement through imitation learning (IL) [21]- [23] and RL [18].

A. Assembly1.0: Data Sets Collected for Implementation
For the implementation of the proposed learning-based automated assembly system, large-scale training and testing data sets are required. We have collected and annotated the required data sets into a database referred to here as Assembly1.0. Assembly1.0 consists of a video data set of human assembly action sequences for learning human assembly operations, a video data set of human object grasping actions during assembly operations, and a 3-D point cloud data set of industrial objects and, partly, their CAD files, involved in the assembly. These data sets are collected to represent typical assemblies of industrial products in real manufacturing settings. Specifically, the data set for human assembly action sequences includes such unit actions as "approach," "reach," "pickup," "put," "flip," "plug," "screw," and "release," together with the objects involved in the unit actions. On the other hand, the data set for human object grasping actions covers the right and left hands with 33 grasping types in association with eight object categories: "bolt," "upper body," "cable," "copper," "hexagon wrench," "iron," "bottom body," and "busbar." Table 1 summarizes the specifications of data sets included Assembly1.0. We plan to open Assembly1.0 publicly accessible for research purposes through the AI-Hub Korea: https://www.aihub.or.kr/.

III. L E A R N I N G A S S E M B L Y T A S K S F R O M O B S E R V A T I O N : H U M A N A S S E M B L Y S C R I P T
In the first part of the proposed automated assembly process, the robots learn assembly tasks by observing human workers. The objective is to extract high-level descriptions of assembly action sequences, including action classes, the order of actions in sequence, the state transitions associated with actions, the hand grasp type, and the target objects involved. As shown in Fig. 1, this part entails DL-based recognition of assembly action sequences, which is supported by the recognition of the hand grasp types and the objects involved. The assembly action sequences, thus, recognized are filtered with FSM for robustness. The results are formally represented by a human assembly script, which is then transferred to the second part, robotic embodiment.

A. Recognition of Hand Grasp Type for Assembly Intent
Recognition of hand grasp type plays an important role in identifying assembly states, as grasp types indicate how to manage objects and tools by hand [12], [13] or, in short, the assembly intent. However, it is difficult to correctly recognize the different grasp types by visually observing human hands as the hand motions involved in the assembly are quite diverse and can be obscured when holding an object. When manipulating objects, humans either grasp tools and parts or lift objects. Here, we focus on the grasp types associated with one hand to identify human assembly intent. Feix et al. [14] proposed a taxonomy for human grasp types in everyday life by integrating and modifying various existing grasp taxonomies. We adopt their taxonomy of 33 grasp types, of which ten are selected for use based on their strong relevance to assembly: cylindrical, spherical, palmar, tip, lateral, hook, tripod, prismatic finger, disk, and index finger extension. We also select eight categories of objects associated with the ten grasp types. Fig. 2 illustrates the data flow of the proposed grasp type recognition, while Fig. 3 shows the detailed architecture of the grasp recognition system.
The proposed grasp type recognition system is based on two stages of processing. First, we detect all of the hands and objects in the image. Then, every possible combination of hand and object that could be linked to hand grasp is examined. The detection of hands and objects is implemented by RetinaNet [15], as shown in Fig. 3, while the recognition of grasp types from the detected hands is done by a two-layer convolutional neural network (CNN) [8] with an averaging pooling layer and a fully connected layer. Note that RetinaNet consists of a backbone network of ResNet-101 [16], a feature pyramid network [17], and subnetworks for classification and box regression.
More specifically, to identify a possible combination of hand and object as a candidate for grasp type recognition, we first consider the physical distance between the two due to the fact that the hand and object should be near each other. To measure this physical distance, we scale the overlap between the respective bounding boxes of the hand and object. On the other hand, to classify the grasp type of each candidate, we apply region of interest (RoI) pooling [18] to the respective hand and object bounding boxes to generate input for the grasp classifier. Fig. 4 illustrates the two stages of the proposed grasp type recognition system. Fig. 4(a) shows an input image to be processed for grasp type recognition. Fig. 4(b) represents the result of the first stage, where all hands and objects that are present in the scene are detected with the respective bounding boxes. Note that the bounding boxes are colored to represent the types of both hand and object along with a confidence score for each. Fig. 4(c) shows the result of the second stage, demonstrating the grasp

Fig. 4. Illustration of the proposed method of grasping mode recognition: (a) input image, (b) output of the hand/object detector, and (c) output of the grasping mode classifier.
type classification for the selected hand-object candidate. In Fig. 4(c), right and left hands are differentiated using colors, and the classification probabilities of both the grasp type and the object type are marked.
To evaluate the performance of the proposed grasp type recognition algorithm, we collect a hand grasp video data set from real assembly environments, consisting of videos of power breaker/air circuit breaker assembly by human workers. The video data have a resolution of 1920 × 1080 at 30 frames/s. The videos depict 15 workers repeating their assembly tasks, five times each, and each assembly task takes approximately four minutes to complete. Refer to Fig. 4 and Table 1 for the samples of collected data and for the specification of grasping action video data set, respectively. This process results in 7.768 image frames, which are divided into 5.827 frames for training, 0.388 for validation, and 1.549 for testing. The results of the performance evaluation indicate that the proposed system achieves an average accuracy of 98.26% for ten grasp types and eight object types.

B. Assembly Action Sequence Recognition
Assembly action sequence recognition is used to predict human assembly actions in the form of an action sequence based on video input, as shown in Fig. 5. To improve the adaptability of our system to various action recognition scenarios, we adopt a well-established DL model, 3-D CNN with the VGG-M architecture, as the target model. Note that, to make our system as light as possible, we configure the model with a feedforward CNN architecture without a recurrent loop and use only RGB images for action sequence recognition. This provides us with near real-time prediction performance. Furthermore, to improve the robustness of the learning process, we augment the training data by applying a random 2-D projective transformation or homography to the collected video frames. Note that the action recognition system takes the hand grasp types and the recognized objects of interest as an additional input to heighten accuracy. Accordingly, the resulting action recognition system is highly robust to various scenes from different viewpoints.
It is imperative that our DL model is both effective and efficient. One way to improve the power of the DL model is to use a deeper architecture; however, this comes at the expense of computational cost. To achieve satisfactory performance while maintaining the speed of forward processing as close to real time as possible, we leverage a machine learning paradigm called "knowledge distillation," by which the knowledge of a larger model is transferred to a lighter model. In this scenario, the larger model is referred to as the teacher network, while the lighter model is the student network. We use ResNet152 [9] pretrained on the ImageNet data set as the teacher network and 3-D CNN with the VGG-M architecture as the student network. The student network gains knowledge from the teacher network by mimicking the input-output mapping function of the teacher network. For example, the student network mimics the intermediate loss of the teacher network by backpropagating the L2 distance to the loss of the teacher network for training, as illustrated in Fig. 6. As a result, we are able to train the student network to have a high prediction accuracy while sustaining high speed. Specifically, the accuracy of action sequence prediction is improved from 91.67% without knowledge distillation to 94.43% with knowledge distillation when testing with the human assembly action sequence data set shown in Table 1. In addition, FSM helps further improve the accuracy from 94.43% to 94.72%. Knowledge distillation offers not only improvement in accuracy but also efficiency in computation. For instance, we achieve about 0.1 s or ten frames/s to process a video clip consisting of ten consecutive frames. In comparison, it takes about 0.3 s or 3.33 frames/s for the teacher network, ResNet152, to process the same video clip.
As stated in the system overview section, human assembly sequences can be organized into FSM as generalized representations. An FSM representation of an assembly sequence is useful for filtering the predicted action sequences in a postprocessing step. In an FSM, the current state changes to another state based on the action predicted by the action recognition system, as illustrated in Fig. 7. An FSM can be constructed from assembly sequences through learning in the form of either probabilistic or deterministic automata. The FSM is represented by a transition matrix, each element of which indicates what the next state should be either probabilistically or deterministically, given the current state and the action predicted by the DL model. For the sake of simplicity, we predefine a deterministic FSM by grammatically structuring the collected assembly sequences. The FSM offers a concise form of our prior knowledge about the possible order of the action sequences. For example, we know that, if the current state is "reaching bus-bar," the next state cannot be "pickup upper part." Fig. 7 shows an example of FSM filtering of the predicted action sequence, where the state changes only when the action prediction involves "put plate," while the prediction of "other actions" maintains the present state. In particular, to prevent an "illegal transition" from happening, we set 0 for the elements in the transition matrix that correspond to illegal transition so that FSM does not predict a wrong action sequence. In summary, FSM-based filtering of action prediction and the associated state transition help to sustain temporal consistency, alleviating the temporal flickering problem in action sequence prediction.

C. Human Assembly Script
Once the system recognizes the intent and action sequence of human assembly and the objects involved (from the demonstration), the results are formalized into the human assembly script, as illustrated in Fig. 8.

IV. R O B O T I C E M B O D I M E N T O F L E A R N E D A S S E M B L Y T A S K S : R O B O T A S S E M B L Y S C R I P T
The second part of the proposed automated assembly involves the transformation of the human assembly script into a robot assembly script that is executable by robots in real assembly environments. This so-called robotic embodiment process aims to generate robot-executable action plans based on a set of robot actions predefined in the PDDL domain, so as to accomplish the assembly tasks defined in the human assembly script. Robot action planning also relies on 3-D modeling of the assembly workspace. When planning robot-executable actions, simulation-based verification and retargeting of initially planned robot actions play a key role. Once the planned robot actions are verified as not executable, robot action planning should retarget assembly states to generate compensatory robot actions to make the assembly task robot-executable.

A. 3-D Modeling of the Assembly Workspace
The robotic embodiment of learned assembly tasks requires modeling of the 3-D assembly workspace. Modeling enables the robots to simulate or execute the given assembly task based on the 3-D geometric information of the objects and tools involved in the assembly process. We propose the 3-D workspace modeling system illustrated in Fig. 9. The proposed system detects and recognizes objects of interest, estimates their 3-D poses, and overlays the detected objects with the applicable CAD models based on the 2-D images and 3-D point clouds of the workspace captured by a 3-D camera. As shown in Fig. 9, the system first detects the objects of interest and the featured parts of those objects based on a cascaded object detector formed by YOLO Ver. 3 [7], a serial connection of YOLO 1 and YOLO 2. In addition, we devise an object classification net, Part Net, which takes the object labels from the object detector and the featured parts of the detected objects as input to finalize object labeling. Part Net aims to correct labeling errors made by YOLO 1 to meet the high recognition rate required by the industry. We adopt an engineering approach to 3-D pose estimation of objects for the sake of accuracy. That is, we extract and localize the geometric features of objects in 3-D based on the captured 2-D image and 3-D point cloud. The extracted and localized geometric features are then matched with those of the corresponding CAD model for registration. Then, the iterative closest point (ICP) procedure is applied between the 3-D point cloud and the registered CAD model to refine the object's 3-D pose with high accuracy. The input and output parameters and the performance associated with individual modules are illustrated in Fig. 10.
Note that the objective of detecting featured object parts is twofold: 1) to provide high accuracy in object recognition by correcting errors in initial object labeling by the object detector and 2) to provide efficiency and  accuracy in the extraction of the geometric features of objects that are essential for 3-D pose estimation. For instance, geometric features, such as points, line segments, and circles, tend to be associated with the featured parts of objects, as shown on the right-hand side of Fig. 10. As such, the geometric feature extractor in Fig. 10 can extract the geometric features of individual objects simply by applying well-established engineering methods to extract point or line-segment features to the bounding boxes representing the detected parts [38], [39]. Then, these detected features are transformed into 3-D geometric features by incorporating the 3-D point cloud captured by the 3-D camera, as illustrated in Fig. 11.
The extracted 3-D geometric features will be used to estimate the 3-D pose of individual objects. Fig. 12 illustrates the process of matching between the extracted 3-D geometric features and the ground-truth features predefined in the object CAD models for the initial 3-D pose estimation. Note that, since industrial objects are often configured with the same geometric features for their subparts, for instance, f2_1 and f2_2 and f3_1 and f3_2 in Fig. 12, matching between the extracted 3-D geometric features and the ground-truth features requires to use geometric contexts among features, such as distances and  angles among features, as additional matching clues. If the object-to-CAD feature matching, matching the detected part features with those of the predefined CAD models, is insufficiently precise for robotic assembly, ICP is applied to the two sets of the point cloud, one from the 3-D camera and the other from the CAD model with the pose estimated by feature matching, to more precisely refine the 3-D poses of individual objects.
We implement the proposed 3-D workspace modeling system for experimental verification. For this purpose, we collect real industrial objects: 20 categories and 100 objects are used as the training and testing data sets. In general, we obtain an object classification success rate of over 99% with less than 1 mm and 1 • of pose error. To illustrate the effectiveness of the proposed 3-D workspace modeling process, Fig. 13 shows the final 3-D poses of individual objects estimated after the object-CAD model feature matching and the fine-tuning based on ICP.

B. Robot Action Planning With Retargeting
For robot action planning, we employ PDDL to generate a sequence of robot actions [29] that can accomplish the assembly tasks defined in the human assembly script. PDDL is widely used because it can generate primitive action sequences under a task level of abstraction.  The grammatical structure of PDDL consists of two types of files: the domain file and the problem file. The domain file defines actions in terms of their conditions and effects as state transitions. The problem file specifies the initial and terminal states associated with the objects to be handled. In our system, the sequence of assembly states for the robot to accomplish is automatically extracted from the human assembly script and is defined in the PDDL problem files, as illustrated in Fig. 14. The initial and terminal states of each problem file include the initial and terminal poses of the objects and the grippers of robot arms, together with the physical relationships between them. On the other hand, the PDDL domain files are selected from the library of domain files containing a set of predefined robot actions.
As shown in Fig. 14, the nominal action plans generated by the PDDL engine are subject to simulationbased verification to assess their executability by robots. For example, suppose that the nominal action plan for a robot is plugging a USB memory stick into a set-top box, where the nominal plan includes "approach USB memory stick," "pickup USB memory stick," and "insert USB memory stick into set-top box" by directly inheriting the sequence of states from the human assembly script. However, the simulated plan verification process reveals that the nominal plan is not robot-executable because the slot in the set-top box is already connected to another device. In this case, the robot has to remove the device from the slot before inserting the USB memory stick into the slot, which we refer to here as robot retargeting. In general, robot retargeting aims to find a solution that addresses the error caused by unexpected situational variation and by the difference in physical and perceptual capabilities between humans and robots, such as the robot kinematic constraints associated with joint limits and degrees of freedom. Robot retargeting is done by adding additional states for the robot to accomplish. For instance, "removing another device from the USB slot of the set-top box" is the complementary robot task that is assigned as the result of robot retargeting.
Robot retargeting also involves the control of waypoints in the assembly workspace. Waypoint control is necessary at times when the initial waypoints fail to produce the required degree of precision and avoid collisions during assembly, for example, consider assembling a bus bar into a power breaker where several positions and postures are defined as the initial waypoints. The initial waypoints through which the end-effector of the robot must pass become subject to retargeting, for instance, when the work cell configuration is altered, e.g., there is a change in the initial position of the bus bar. As another example, due to the limitations of the sensor systems, the 3-D poses of part features, such as holes, that are important for assembly operation may not be recognized as accurately as required. In this case, we need to incorporate the accurate part geometric model from CAD to precisely define the 3-D poses of part features and automatically modify the waypoints for retargeting. Algorithm 1 shows how to assess waypoint reachability and generate trajectories to reach the waypoints. If unreachable, the system analyzes CAD files of parts in the workspace to automatically modify the waypoints with the help of 3-D workspace modeling.

C. Robot Assembly Script
The application of robot action planning to the assembly tasks defined in a human assembly script, with the support of plan retargeting and 3-D workspace modeling, leads to the creation of a robot assembly script as a formal representation of a robot-executable task, as illustrated in Fig. 16.

D. Case Study: Power Breaker Assembly
In this section, we present a case study using a power breaker assembly in a manufacturing setting to show how a sequence of unit robot assembly tasks specified in the robot assembly script is executed in adaptation to a real-world assembly environment. The adaptation to a real-world assembly environment is aided by vision-based real-time recognition of the 3-D poses of the objects and tools to be manipulated during assembly. As described in Section IV-A, the real-time 3-D pose estimation of objects and tools involves segmentation of objects and tools from a workspace image, feature extraction and matching with their CAD counterparts, and 3-D pose refinement by aligning their CAD representations with their 3-D point cloud representations obtained from a 3-D camera.
For the assembly of a power breaker or an air circuit breaker, we set up a collaborative robot system consisting of a single-arm robot with a suction gripper and a dualarm robot with sliding grippers, as illustrated in Fig. 17. Power breaker assembly requires the assembly of two frames and five parts: the upper and lower frames and the fixed contact, fixed contact cover, moving contact, CT case,  and busbar. The single-arm robot picks up unaligned fixed contact cover parts and transfers them to the dual-arm robot. The dual-arm robot performs the assembly of the received parts and self-picked parts along with the lower frame.
The sequence of robot unit tasks for assembly of the power breaker is automatically generated as the robot observes a human assembly process consisting of 11 unit human assembly actions. Each human assembly action is then converted to the corresponding robot unit task or unit robot operation that matches the human assembly action. The sequence of unit robot operations, thus generated, as illustrated in Fig. 18, has the same order as that of the unit human assembly actions. Note that some unit robot operations, such as delivering parts between two robots, are added to the sequence of unit robot operations generated directly from that of unit human assembly actions.
The 3-D pose of a part, for instance, a fixed breaker cover that is randomly stacked in a bin, is recognized based on a process consisting of part detection and feature extraction from the cascaded object detector, the 3-D point cloud representation of the part from the 3-D bin image captured by the 3-D camera, matching between the CAD features of the part and the extracted part features, and 3-D point cloud registration between the CAD model and the captured point cloud of the part, as described in detail  in Section V-A. The order in which the randomly placed parts are picked up follows the order of their height. Refer to Fig. 19 for an illustration of this process.
The path the robot takes when picking up and delivering a part, while avoiding collisions with objects in the environment, is generated based on the geometric shape of the part, the kinematics of the robot, and the 3-D model of the environment, especially for bolting/inspection operations, as shown in Fig. 20. For instance, to pickup the fixed contact cover with a vacuum gripper, the part reference vector, the suction reference point, the normal vector of the suction plane, and the orientation angle of the part must be designated. Note that, when the direction of the normal vector is incorrect, the suction plate may not be in close contact with the surface of the part, and thus, the pickup or delivery operation may fail.

V. T A S K E X E C U T I O N W I T H L E A R N E D S K I L L S
When assembly environments can be precisely modeled by sensors such that the target state involved in a robot  unit task is well-defined, sensor-guided robot actions may be sufficient for completing the unit task, as described in Section IV. However, when robot unit tasks defined in the robot assembly script are under uncertainties and variations that are too difficult to model and control, the robot needs to resort to skills necessary to overcome such hurdles. To this end, we predefine a set of robot assembly skills that are pretrained for the robot to exercise for the unit tasks that require skills. Refer to Table 2 for an exemplary list of robot skills associated with robot unit tasks.
We propose that a set of robot skills is pretrained by integrating RL with DL and IL. More specifically, we represent each learned robot skill by a DCNN to be trained by a supervisor capable of self-improvement with RL and IL. The skill-embedding DCNNs provide both skill classification and motion generation, so it is possible to use DCNNs to select an appropriate skill from multiple choices. The learned skills can be improved upon (motion paths can be optimized and execution time can be reduced) based on policy learning by weighting exploration with the returns (PoWER). Note that robot skill can be modeled either by DMP for a task with milder uncertainties and variations, e.g., an insertion task, or by DCNN for a task with higher uncertainties and variations, e.g., a grasping task.

A. Learning Skills With DCNN
Unlike DL approaches that require a large amount of training data [17]- [20] to reach high-performance levels, IL approaches rely on a small amount of data from human demonstrations [21]- [23] but come at the potential cost of loss of performance. Here, we integrate IL, DL, and RL in such a way as to learn and improve upon skills by compensating for their shortcomings and maximizing their strengths. To this end, we introduce a supervisor that generates a sufficient amount of training data for the skillembedding DCNN [25]. The supervisor first learns from human demonstrations based on an IL process and then Fig. 21. Entire learning and execution processes: (a) learning   process, involving a mixture of IL, RL, and DL and (b) execution   process that is done after the learning process. carries out self-improvement or self-optimization through an RL process. DCNNs are then trained on the skill using the training data generated by the supervisor. Fig. 21(a) illustrates the learning process, which combines the IL, RL, and DL processes, as described above. Here, the supervisor uses the dynamic movement primitive (DMP) [36], [37] to learn from human demonstrations and augment skill-related data. DMP is defined as where X, V, X 0 , and X g represent the position, velocity, initial position, and target position vectors, respectively. Similarly, to a linear-damper system, DMP ensures convergence to the final goal or target [24] depending upon the external force term, ξ. Note that τ , K, and D indicate the constants that are used to adjust the time scale, spring, and damping terms, respectively. The external force term, which is learned from the human demonstration data set, is defined as is a Gaussian basis function with c i and h i , respectively, representing the center and the variance. The parameters L and ω i indicate, respectively, the number of Gaussian basis functions and their weighting values. The parameter L represents the number of Gaussian basis functions. The term ξ is directly dependent on the phase variable s, which monotonically decreases from 1 to 0, independent of time, and is obtained by the following canonical system: where α is a predefined constant. A DMP is learned from the average path of several demonstrations. First, the average path X(t) is recorded, and its derivatives, V(t) andV(t), are computed for each time step t = 0, . . . , T. Then, the canonical system, s(t), is computed for an appropriately adjusted temporal scaling parameter, τ , which is predefined. Based on (1), ξ target (s) is computed according to where X 0 and X g are set to X(0) and X(T ), respectively. With ξ target (s), ξ(s) can be estimated for motion generation by regressing ω i in (3) 2 . Now, we define a skill by the triple, Θ, as follows: where Ω indicates the parameters of the external force term of a DMP. Note that the target, X g , and the total length of the policy, T , are added to Θ in order to optimize the skill through an RL process. DMP generates a motion trajectory to satisfy the target X g during the length of policy T . We use a DCNN to implement the skill due to its proven strength in generalization with supervised learning given a sufficient number of training data points [25], [26]. The skill representation by DCNNs deals with the case in which the accurate target pose is not available, so DMP alone cannot represent the skill. Instead, the target pose is available only with uncertainties, which can be handled by the ability of the DCNN to generalize. Furthermore, here, we allow robots to carry out self-supervised learning of DCNNs based on the DMPs explored by RL, starting with the initial DMP from human demonstrations. To this end, the PoWER algorithm is used to improve the policy parameters of DMPs and to perform the self-supervised learning process for DCNNs, as presented in detail in Section V-B As shown in Fig. 21(a), an RL-updated DMP generates robot motion trajectories, during which process a number of training data points, including images, F/T sensor readings, and the joint and end-effector configurations of the robot, are collected. We let the robot collect a sufficient amount of data by repeating this process in various situations. DCNNs are trained by minimizing the loss between the reference motion trajectory generated by the DMP and the output of the DCNN, with the current image and joint configuration data given as inputs. After the learning process is completed, the robots can perform the tasks by generating robot control signals in the appropriate situations based only on DCNNs, as shown in Fig. 21(b). Fig. 22 illustrates the structure of the DCNN designed to represent a skill, which is similar to the one proposed in [27]. The DCNN consists of three convolutional layers, a spatial softmax layer, and three fully connected layers. The DCNN takes images (848 × 480 pixels) as input and outputs the 6-D position/orientation of an end-effector. Also, 64, 32, and 32 filters are created in the three convolutional layers, respectively. Unlike a DCNN for object recognition, the pooling process is excluded in the DCNN for motion generation because the accuracy of the target position in the image is important, while ReLU is used as an activation function in every layer. Here, the spatial softmax computes the expected position to convert the pixelwise representations estimated in convolutional layers to spatial coordinate representations, which can be manipulated by the fully connected layers. That is, the spatial softmax helps to estimate 3-D positions or motor torques that the robot can perform [26]. In the fully connected layers, the feature vectors have 64, 32, and six dimensions.

B. Improving Skills Through RL
We apply RL to the aforementioned skills trained on demonstrations to improve performance. As shown in Algorithm 2, the skill parameters of DMPs are optimized into Θ * based on the RL process, iPoWER, as a means of improving the DCNN supervisor. The iPoWER algorithm shown in Algorithm 2 represents a slightly modified version of the original PoWER algorithm [28] with reduced execution time.
The iPoWER algorithm is based on a deterministic policȳ a = Ω T Ψ(X, t) with the weighting parameters Ω and the basis functions Ψ of a DMP [24]. However, when optimizing a DMP, this policy is turned into a stochastic policy using additive exploration ε(X, t) for model-free RL. That is, the policy, (at|Xt, t), is represented in the following form: a = Ω T Ψ(X, t) + ε(Ψ(X, t)) with εt ij ∼ N(0, σ 2 ij ), where σ ij is a metaparameter of the exploration. Note that σ ij is also subject to optimization in this algorithm. In the iPoWER algorithm, the length of a corresponding DMP can be reduced by the stop signal tstop when X g = Xt and t < T k . This means that robots arrive quickly at the target compared to human demonstrations during the RL process. The stop signal for tstop is generated when the robot reaches the target, |X g ± |, within extremely small margins.
To calculate the expected return values for the improvement process, the reward function should be defined.

Algorithm 2 iPoWER Algorithm for Improving the Parameters of Skills Considering Execution Time
Step and Path Optimization 1: Input: a set of initial parameters Θ = {Θ 1 , Θ 2 , . . . , Θ N } of all skills. 2: Using initial parameters Θ i = {Ω i , X g i , T i } of a skill belonging to the 3: CNN with the maximum likelihood, 4: Set initial parameters Θ k =Θ 0 = {Ω 0 , X g , T 0 } of a motor skill 5: while true do 6: Sampling: Using Ω k , X g , and T k , generate rollout (X) from a = (Ω k + ε t ) T Ψ(X, t) 7: based on Eq. (1) with exploration ε t ij ∼ N(0, σ 2 ij ) as a stochastic policy. 8: if X t = X g and t < T k then 9: SetT = t and collect all information (t, X t , a t , X t+1 , ε t , r t+1 ) 10 for t = {1, 2, . . .T + 1}. 11: Its generation equation is defined as where X and Y indicate the robot state values, such as camera images, F/T sensor data, and tool orientations, which can be measured. The superscripts g and s denote the target and starting values of each variable depending on the given task. Here, the term (X g − X(t)) is used to get a high return value when the robot configuration is close to the target values, and the term (1/(Y s − Y(t))) is used to get a high return value when it is far from the starting value. The parameters α and β are constants that are used to adjust the degree of reflection of each term. Equation (7) is designed to take the form of exp −(x) ; therefore, a lower value of each term provides a higher return value.

C. Case Study: Set-Top-Box Assembly Using Skills
To show how the execution of pretrained skill-based primitive tasks helps cope with uncertainties and variations in a real assembly environment, we present a case study using a set-top-box assembly currently in practice by a local manufacturing company.
Here, the pretrained robot skills associated with unit assembly operations play a key role. We pretrain the robot to be able to carry out the following eight-unit assembly operations with skills: 1) "grasping-set-top-box (GS)" using the skill of grasping the set-top box after estimating its position and posture; 2) "inserting-set-topbox (IS)" using the skill of inserting the set-top box into a fixed jig; 3) "grasping-HDMI-cable-connector (GH)" using the skill of grasping the HDMI-cable-connector after estimating its position and posture; 4) "regrasping-HDMIcable-connector (RH)" using the skill of regrasping the HDMI-cable-connector to measure the F/T values during the insertion motion; 5) "inserting-HDMI-cable-connector (IH)" using the skill of inserting the HDMI-cable-connector into the hole of set-top-box; 6) "grasping-power-cableconnect (GP)" using the skill of grasping the powercable-connector after estimating its position and posture; 7) "regrasping-power-cable-connector (RP)" using the skill of regrasping the power-cable-connector to measure the F/T values during the insertion maneuver; and 8) "inserting-power-cable-connector (IP)" using the skill of inserting the power-cable-connector into the set-top box. Fig. 23 illustrates the eight robot skills introduced above, while Table 2 shows how each skill is implemented, either by DCNN or by DMP. A skill is implanted by DCNN when the degree of uncertainty associated with the goal pose or the target condition is high and by DMP when it is lower.
Note that skills 1), 3), 5), 6), and 8) are represented by DCNNs trained through the supervised learning process described in Section V. In contrast, skills 2), 4), and 7) Fig. 23. Illustrations of eight skills in the set-top-box assembly   task: (a) GS, (b) IS, (c) GH, (d) RH, (e) IH, (f) GP , (g) RP , and (h) IP . are represented by DMPs. This is because 2), 4), and 7) are used in situations that involve little change in the position and posture of the objects they deal with and can be performed in various environments. For 1), 3), and 6), it is necessary for the robot to consider the relative position and orientation between the robot and the target object. In contrast, for 5) and 5), it is important for the robot to generate the appropriate motion trajectories that account for the relative force and torque between the female and the male objects. Therefore, we use CNNs for 1), 3), 5), 6), and 8) to connect the motion generation to the perception of object poses and interaction forces.
We set up the experimental testbed for the set-top-box assembly with the support of the pretrained robot skills for the unit robot assembly operations, as illustrated in Fig. 24. The testbed is equipped with UR3 and UR5 robotic arms from Universal Robots, Denmark, two FT300 F/T sensors, two-and three-finger grippers from Robotiq, Canada, and two cameras from Intel, USA. Note that the positions and orientations of the set-top box, HDMI cable connector and power cable connector are designed to change randomly for the experiment. The robot can localize the objects (connectors and holes) using the camera on its wrist. The clearance between the jig and the set-top box is approximately 300 μm, and the clearance between the cable connectors and holes of the set-top box is approximately 10 μm.
To calculate the expected return values for the robot skills, the following reward function r g is assigned Fig. 24. Experimental setup of the set-top-box assembly task. to 1), 3), and 6): where I d (t) indicates the dissimilarity between the target image (Ig) and the current image (I(t)), e.g., the difference between the images of the target and current states. The reward value increases as the current image becomes more similar to the target image. On the other hand, the reward function, r i , for skills 5) and 8) is defined as where F, M, and P indicate, respectively, the force, moment, and distance components of the robot. In particular, P represents the deviation from the reference axis of the tool coordinate system. F, M, and P are calculated using absolute error equations in the following form, as illustrated for F only: where the superscript, g, indicates the target value. The last term, I(t), represents the dissimilarity between the target and the current images, as in (8), while α, β, γ, and δ are constants that are used to adjust the contribution of individual F, M, P, and I values to the reward function, r i . First, DMP and the target image, I g , are extracted from a human demonstration as the supervisory exemplar and for computation of the return value, respectively, in RL. Since the target image is obtained from the camera attached to the robot's wrist, which defines the relative position and orientation between the robot and the target object, the absolute pose of the robot is irrelevant to the skill description. For the "inserting" skill, the human demonstration defines the target force, F g x,y,z , torque, M g x,y,z , and insertion depth P g z . The motion trajectories of the robot were extracted at 50 Hz using the kinesthetic teaching method; then, training data were acquired through selfreproduction.
To perform the "set-top-box-assembly" task, eight DMPs are modeled as the supervisors that train the DCNNs using the training data. Individual DMPs improve by themselves, as do DCNNs, through the processes of self-exploration, self-reproduction, and self-improvement based on Algorithm 2 and the reward functions shown in (8) and (9). Three "grasp" and three "insert" type skills undergo this process of self-improvement of supervisors and DCNNs through 5.000-10.000 repetitions and 200-300 repetitions, respectively. We observe that the return values increase, while the execution time is reduced, with an increasing number of iterations. Fig. 25 shows the performance of iPoWER in comparison with the original PoWER algorithms. The iPoWER algorithm is better at generating optimal paths and reducing the number of execution time steps, as shown in Fig. 25(a) and (b).
Having pretrained the robot on several skills, we next carry out experiments to evaluate the performance of the learned skills. The first experiment aims to evaluate the performance of the grasping skill when it is used to pick up three different objects, the "set-top box," "HDMIcable-connector," and "power-cable-connector," which are randomly placed on a worktable. Specifically, we intend to test the DCNNs trained on the grasping skill to evaluate their ability to generate appropriate control signals to pick up objects using the images of the scene as input. The experiment results in 98 successes out of 100 trials. Note that the two failures occur when the cables are slightly tilted. These failures occur because we exclude the flipping or standing motion necessary to grasp tilted objects from the learning process. The experiment deals with insertion skills and uses the same three objects described in the first experiment. In the experiment, two UR robots are to insert the "set-top box" and the "cables" into the jig and the "set-top box," respectively, with respective clearances of 300 and 10 μm. The UR robots should complete the insertion process based on the input images and the reaction force/torque based on DCNNs. We evaluate whether the DCNNs can generate appropriate force/torque and pose control signals based on the images and the robot configurations given as the input. As a result, we achieved 97 successes out of 100 trials. Note that the three failures occurred when the connectors fell off of the cliff hole outside the "set-top box." Finally, we evaluate the performance of the task planning and DCNN skill arrangement based on the PDDL engine. We carried out four experimental cases, as shown in Fig. 26. The cases, illustrated, respectively, by Fig. 26 (a), (b), (c) and (d), are as follows. (a) The "set-top box," the "HDMI-cable-connector," and the "power-cable-connector' are to be randomly placed on the floor. (b) The "set-top box" is to be inserted into the "jig," and the "HDMI-cable-connector" and "power-cableconnector" are to be randomly placed on the floor. Note that the "set-top box" is already in its target state. (c) The "set-top box" and "HDMI-cable-connector" are to be inserted into the "jig" and the "set-top box," respectively. In this case, two target objects are in their target states, with the exception of the "power-cable-connector." (d) The "set-top box" and "power-cable-connector" are to be inserted into the "jig" and the "set-top box," respectively. In this case, two target objects are in their target states, with the exception of the "HDMI-cable-connector.' In the case of (a), the URs performed the "set-top-boxassembly" task according to their nominal plan. In the cases of (b)-(d), the URs performed their unit tasks that are not yet in their target states based on the proposed task planning and DCNN arrangement method. The experiments described in Sections IV and V successfully validated the proposed automated assembly system for its applicability to power breaker assembly and set-top-box assembly in real-world manufacturing settings. Although successfully validated, the current prototype system is by no means without limitations and failures. In general, the success of the proposed system depends on the novelty of assembly environments or the deviation from what is learned and the capability of the system to cope with the novelty based on sensor-based workspace modeling, robotic task replanning, and pretrained robot skills. For instance, we occasionally observed the following failure modes during experiments: 1) robot fails in reaching the target as obstacles hinder the robot from reaching the target position, or the vision system fails in identifying the 3-D pose of the target object and 2) robot fails in inserting a screw into a hole as the robot either fails in grasping or incorrectly grasps the screw, or the sensing and control errors of the robot are excessive to insertion tolerances. Note that, by further extending the capability of replanning and robot skills, we plan to increase the power of recovery from failures due to excessive variations and uncertainties.

VI. C O N C L U S I O N
In this article, we presented an automated robotic assembly system built upon an autoprogramming environment that can reduce the setup time and cost for reconfiguring and reprogramming robots when it is frequently necessary to reassign robot tasks, as in smart manufacturing plants. A three-part approach was implemented: learning by observation, a robotic embodiment with action planning, and simulated retargeting and execution with pretrained skills. The approach was shown to be effective and viable through implementation and experimentation. We demonstrated that the DL-based real-time recognition of human assembly action sequences and grasping types allows the robot to effectively learn the given assembly task from observation. Furthermore, we showed that PDDLbased robot action planning from the learned human assembly, together with simulation-based verification and retargeting into action planning, represents an effective means of robotic embodiment. In particular, for task execution, we verified that pretraining of robotic skills through DL and RL is crucial for the robot to adapt to the uncertainties and variations that are often seen in the assembly process. Such uncertainties and variations would be difficult to handle otherwise. We successfully validated the proposed system by developing a prototype system and applying it to two real-world manufacturing scenarios, power breaker assembly, and set-top-box assembly, using commercially available robots. In addition, we also showed how recent advancements in DL and RL can impact the next generation of automated assembly for the smart manufacturing of the future.
In the future, we plan to continue improving this automated assembly system, especially its ability to deal with unexpected failures that may happen during assembly. Also, we are interested in applying the proposed system to smart workbench-based man-machine collaboration systems in order to determine the method and the order of robot operations in collaboration with human tasks while analyzing human work behaviors and methods, so as to provide guidance to improve productivity and safety. Moreover, the capacity to understand the ways in which humans work may be applicable to the development of an autonomous system that allows proficiency-or skillbased optimal task assignment to workers when planning production for smart manufacturing.