Reinforcement Learning-Based Path Generation Using Sequential Pattern Reduction and Self-Directed Curriculum Learning

Recent advancements in robots and deep learning have led to active research in human-robot interaction. However, non-physical interaction using visual devices such as laser pointers has gained less attention than physical interaction using complex robots such as humanoids. Such vision-based interaction has high potential for use in recent human-robot collaboration environments such as assembly guidance, even with a minimum amount of configuration. In this paper, we introduce a simple robotic laser pointer device that follows an arbitrary planar path and is designed to be a visual instructional aid. We also propose an image-based automatic path generation method using reinforcement learning and a sequential pattern reduction technique. However, such vision-based human-robot interaction is generally performed in a dynamic environment, and it can frequently be necessary to calibrate the devices more than once. In this paper, we avoid the need for this re-calibration process through episodic randomization learning and improved learning efficiency. In particular, contrary to previous approaches, the agent controls the curriculum difficulty in a self-directed manner to determine the optimal curriculum. To our knowledge, this is the first study of curriculum learning that incorporates an explicit learning environment control signal initiated by the agent itself. Through quantitative and qualitative analyses, we show that the proposed self-directed curriculum learning method outperforms ordinary episodic randomization and curriculum learning. We hope that the proposed method can be extended to a general reinforcement learning framework.


I. INTRODUCTION A. BACKGROUND
With the advancements in machine learning, research on deep learning-based human-robot interaction has attracted much attention recently [1]- [3]. Humans can interact with robots in various forms, and these methods can be categorized into physical interactions and non-physical (untact) interactions. For physical interactions (e.g., hand shaking, high fives, and manipulating objects), complex and high-cost robots such as humanoid robots are required. In contrast, non-physical interaction requires a relatively simple and lowcost robot such as a device that provides visual information using a beam projector [4]- [6] and an artificial intelligence The associate editor coordinating the review of this manuscript and approving it for publication was Joanna Kołodziej . speaker [7], [8]. In particular, unlike the recently commercialized artificial intelligence speaker, a visual-based interactive robot such as beam projector-based robot is highly expected to be useful in the future and should be studied and developed.
Such a visual-based interactive robot can be configured at a relatively lower cost and with a simpler structure than a humanoid robot or articulated manipulator. For human-robot visual interaction in three-dimensional (3D) space, the robot should have at least three degrees of freedom, and its simplest incarnation is a robotic laser pointer (RLP) device that consists of pan-tilt joints and a laser pointer. Although a laser pointer cannot provide high-dimensional visual information like a beam projector, effective provision of visual information and interaction with humans is possible when it is equipped with motorized laser control [9]. For example, the RLP can provide visual information such as a warning signal or emotional expressions to users by generating repetitive motor trajectories for specific patterns. Moreover, when combined with a camera sensor, it can be utilized in guidance for assembly, path planning, and object finding applications. For these applications, it is essential to equip the robot with image-based path generation skills, which generate laser trajectories from target patterns such as the contour of a target object.
In previous approaches, image-based path generation basically requires complex manual processes such as image processing, curve fitting [10], path planning [11], and calibration [12]. In a dynamic environment in particular, frequent recalibration is expected because the positions and orientations of the robot, camera sensor, and target objects are more likely to change. Therefore, there is a high need for a learning-based method so that the robot can perform path generation with much less effort than would be needed for manual programming in an uncalibrated environment.
In this paper, we propose an image-based path generation method that is able to generate paths in an uncalibrated environment using reinforcement learning. The path generation process can be formulated as a sequential visiting problem for each region of a given pattern, where the visitor (agent) and the path are regarded as the pixels of laser point and the pattern region respectively. We modeled this problem as a sequential pattern reduction (SPR) problem in which the laser pointer sequentially reduces the pattern (path). In addition, we designed a Markov decision process (MDP) that includes an SPR reward function to employ RL in our learning system.
In the image-based path generation problem, the agent should be able to generate paths even in an environment that is not calibrated. To this end, we applied an episodic randomization learning approach that randomizes the environment parameters in every episode [13], [14]. However, the method converges slowly because difficult tasks are given from the initial state. Moreover, local minima can be a problem. To solve these problems, this paper proposes a selfdirected curriculum (SDC) based learning. Through the proposed SDC and reward function, the agent is able to choose an appropriate level of task difficulty for itself considering its capability and learns the target task effectively during training. We experimentally show that the proposed SDC outperforms other existing learning methods.

B. RELATED WORK
There have been many attempts to automatically generate paths from the visual data of target patterns. In robotic path generation based on 3D point cloud data [15]- [19], the target object's shape is scanned by a 3D scanner, the data are converted to point cloud data, and these data are used to generate the object's surface. The surface is used to generate tool paths in automatic computer numerical control (CNC) machining [15]. These methods require precise 3D sensors and reverse engineering, which transforms the point cloud data into a CAD model. Automatic robot path generation from 2D images, which can be thought of as image-based visual servoing [20] have been studied intensively in many previous studies. Pachidis and Lygouras [21] studied a path generation method for robotic arc welding from 2D images. Given a stereo-vision image, the corresponding line segments are obtained by image processing and a correspondence algorithm. Then, from the detected edges, robot paths are generated by the path point calculation algorithm. Aritos et al. [22] proposed a robot path generation method from lines on a flat 2D planar surface image using image processing and robot-camera calibration. Chang et al. [10] proposed an image-based motion planning method using Pythagorean-hodograph (PH) splines. Using this approach, they are able to follow the contours of an object with proper acceleration/deceleration using a PH quantic spline interpolator in an eye-to-hand camera structure [23].
Recently, there have been attempts to reduce the number of calculations needed for curve fitting and path generation during runtime based on a learning approach. Li et al. [24] utilized RL to generate a fast and smooth tool path for a target pattern in a calibrated environment for CNC applications. Using deep RL, they replaced most steps of the previous pre-planning smoothing method with a neural-network based method and achieved the development of an algorithm involving fewer calculations than the usual solution. Jing et al. [25] proposed a computational framework for robot path generation for surface/shape inspection application. They used an RL-based tree search algorithm to efficiently generate online paths based on the proposed MDP formulation to solve the coverage planning problem [11]. In the research cases mentioned above, RL has shown remarkable performance in various studies including computer games [26]- [28] and graphics [29], [30]. Especially in the robotics field, RL is used to learn visuomotor skills [31], [32], path planning [33], navigation [34], [35], and human-robot interaction [36]. In this paper, we also used RL as our main learning method.
The aforementioned methods typically adopt the following set of steps: point data extraction from the image, curve generation using curve fitting methods (using, e.g., B-splines [37] or Bezier curves [38]), and then path planning for the curves. This procedure has become the general approach of applications that require high precision such as CNC and computer aided manufacturing [39], [40]. Moreover, in the learningbased methods, deep learning is used as a subprocess of the main algorithm. However, because our aim is to develop an end-to-end learning-based method in the context of humanrobot interaction while minimizing the effort needed for manual programming in uncalibrated environments, an evaluation of the precision is outside the scope of this paper.

C. CONTRIBUTIONS
In this paper, we propose a deep RL-based robotic path generation method that is efficient at learning and does not require mathematical modeling for the target patterns or VOLUME 8, 2020 calibration. Our main contributions can be summarized as follows: • For given target patterns, we propose a deep RL-based SPR method for automatic path generation without any prior knowledge of mathematical and geometric models.
• We propose a system architecture characterized by a novel reward function and distributed learning framework for fast and effective learning.
• For learning unknown patterns in uncalibrated environments, we propose a SDC learning method that automatically controls the level of difficulty of the task considering the current capability of its own policy. We experimentally verified the superiority of the SDC learning method. The remainder of the paper is organized as follows. Preliminaries for our algorithm are presented in Section II. In Section III, we describe the details of the proposed method, including the SPR algorithm, MDP design, distributed learning environment, and SDC learning. Experiments and an ablation study are presented in Sections IV and V, respectively, along with their results, and finally the paper is concluded in Section VI. FIGURE 1. Kinematic diagram of the RLP. The kinematics of the RLP are described by two revolute joints and a prismatic joint, where θ 1 and θ 2 correspond to pan and tilt and d 3 is the laser pointer distance.

A. ROBOTIC LASER POINTER
We designed a simple RLP that consists of two revolute joints (for pan and tilt) and a laser pointer (see Figs. 1 and 2). The laser pointer's on/off action and the pan-tilt motors are controlled by an embedded processor board and the main PC controls the device through serial communication. To control the laser pointer in Cartesian space, we derived the analytic inverse kinematics of the RLP using trigonometric functions based on its kinematic structure. The kinematics are described by the Denavit-Hartenberg parameter [41] of Table 1 and Fig. 1 as follows: represents the desired position of the laser pointer in Cartesian space and o tilt w is the origin of the tilt joint. Moreover, R z and R x are 3 × 3 rotation matrices and their inverse is multiplied by the desired position to produce that position in robot coordinates. We set the origin of the pan coordinate system to the origin of the world coordinates and the pan and tilt joint angles are calculated by an arc tangent function.

B. REINFORCEMENT LEARNING
In this paper, the robot agent generates paths only from the observed target pattern images. Therefore, we model the robotic pattern generation problem as a partially observable MDP [42], [43] with a tuple M = {S, O, A, T , r, γ , S}, where each element represents a space for a partial observation of state, action, state transition probability T (s t+1 |s t , a t ), reward function r : S × A → R, discount factor γ ∈ (0, 1], and initial state distribution S, respectively. The agent learns a deterministic rule π : O → A that maximizes the expected discounted reward from the initial reward R 0 over a finite horizon: The return at time t is defined by the discounted reward: where r(s i , a i ) returns a reward when the agent performs an action a i at state s i . The return R t at time t is defined by the sum of discounted rewards during T . In recent RL-based studies, the actor-critic network [44] has become a popular method for building the agent's policy because its performance is more stable and better than when using the actor policy alone. Based on this, we previously introduced an asymmetric actor-critic network [45] to our policy for Q-function-based policy evaluation. The critic network Q ζ estimates the Q-function, which describes the expected return from action a t at state s t as follows: where (6) is the Bellman equation, which represents the expected action-value (Q-value) of the current state s t and action a t estimated by the discounted Q-function of the next state s t+1 and action a t+1 . The input to the critic is a feature vector consisting of the current observation and state with current action generated by the actor π ω . The experiences of the agent are gathered from the simulator and stored as a set of tuples (o t , s t , a t , q t , r t ) in the rollout memory. Each element of the tuple represents the current observation, state, action, Q-value, and reward, which are described in a later section.

C. PROXIMAL POLICY OPTIMIZATION USING THE Q-FUNCTION
In recent RL-based studies, proximal policy optimization (PPO) [46] has shown superior performances especially in robotic tasks [47], [48] and character animation [29]. Because our agent is a kind of robotic device, we also trained using the PPO algorithm. PPO optimizes the actor-critic policy network based on the following conservative policy iteration L CPI , clipped surrogate objective L CLIP , and squared Bellman error loss L QF as follows: where ϕ t (ω) represents the ratio between the current and previous policy's action probability given an observation. Instead of the value function error used in [46], we define the Q-function error loss for evaluation by the critic network during training [45]. The final objective is defined by the weighted sum of those three objectives as where S denotes an entropy bonus term, c1 and c2 are weights for L QF and S, which are set to 1.0 and 0.01, respectively, in this paper.

D. RANDOM ENVIRONMENTS AND CURRICULUMS
The episodically randomized environment (ERE) is widely used in RL to handle task variance [13], [14]. We also adopted ERE in our solution to eliminate the need for re-calibration in an unknown environment. However, the ERE makes policy convergence difficult and slow because challenging tasks are given from the early stages of learning and rewards are sparse.
To solve this issue, curriculum learning has been used in many papers. However, most previous work manually determines the curriculum without considering the maturity of policy learning [49]- [52]. While some recent studies consider the policy capability [53]- [56], the update rules are still determined manually or based on temporal difference error [55].
Our proposed SDC learning, in contrast, focuses on selfdirectedness: the policy actively controls the curriculum difficulty by itself; it is not controlled by a human. In particular, our method is distinguished from others by the incorporation of an explicit control parameter τ for curriculum difficulty. To our knowledge, this is the first study of curriculum learning equipped with a self-directed control signal from an agent to the learning environment. Details are described in Section III-D. In our experiments, we compare four methods characterized by their handling of random environments and curriculum. Table 2 lists the properties of each learning method. Episodically invariant environment (EIE) learning is based on a fixed learning environment without randomness or a curriculum. ERE learning episodically randomizes the environment without a curriculum. Linearly increasing curriculum (LIC) learning has both randomness and a curriculum, but the level of randomness (curriculum) is linearly increased. In SDC, randomized learning is performed with a curriculum that is determined by the policy itself. The criteria for selfcurriculum determination is described in Section III-D.
For the random environments listed in Table 2, we randomly reset the positions and orientations of the target plane, camera, target pattern texture, and initial position of the laser point with respect to their initial state during training. For the target pattern texture, only the x-and y-directional position (x p , y p ) and orientation with respect to the z-axis (z p ) in the local coordinate of the target plane are considered in the randomized reset (see Fig. 4). The randomization is triggered at the beginning of every episode and the range of values for each object is described in Table 3. We also randomized the colors and pattern geometries of the background plane with a Perlin noise texture [57].

III. METHOD
In this section, we describe the overall robotic path generation method. Beginning with the target pattern generation, we describe the SPR, MDP design including  network architecture, distributed learning framework, and SDC learning.

A. TARGET PATTERN GENERATION
To learn a robust path generation skill, various training datasets are required. We prepared 15 classes of patterns ( Fig. 3) using custom equations and the following superformula [58]: where m, n 1 , n 2 = n 3 are the parameters used to form a particular pattern and f (φ) is represented using  [58], where the parameter details are given in the figure's caption.

B. SPR
As the name suggests, SPR is a method to reduce the target patterns sequentially so that all pattern pixels are completely removed and the corresponding full robotic paths can be generated from the given images. SPR consists of pattern reduction and repaint (PRR) and a reward function-based rollout process.

1) PRR
In the initial state, the agent observes an initial target pattern image I t at time t. Then, it performs an action and observes the next image frame I t+1 . At this stage, there are two important steps in PRR that the agent should perform, which we call pattern reduction and repaint, respectively (Algorithm 1 and Fig. 6). In the pattern reduction, the target pattern region that overlaps the region of the laser point is removed from image I t . To determine the location of the laser point on the target plane, we first create a binary mask image B plane t of the target plane using image processing techniques [59] FIGURE 5. Overall system architecture including the actor-critic based SPR policy network. It consists of an Actor-CNN (convolutional neural network), Actor-LSTM (long short-term memory), and Critic-LSTM, for encoding the image, performing an action, and evaluating the actor network, respectively. The agent observes the pattern images from the simulator and sequentially reduces the pattern using the proposed algorithm. Based on the processed observation, the agent performs the next action. In SDC, the frequency at which each element of actionâ t = {∂x d t , τ t } is utilized differs. The positional action x d t is applied to the environment at full frequency whereas the curriculum control action τ t is used at an episodic frequency to determine the degree of randomization. such as Gaussian blurring, thresholding, and morphological operations (lines 1 to 4). The region of the laser pointer is represented by a binary mask image B laser t and detected by color slicing [59], where the aim of this technique is to obtain a binary image of the target color distribution based on color thresholding.

Algorithm 1 PRR
The pattern-reduced image I − t is then obtained by making the pixel colors of some region that overlaps with B laser t in I color t equal to the background color (lines 5 to 8). The next step is repainting, which paints the laser point region of the next observation I t+1 on the pattern-reduced image I − t . After repeating lines 5 to 7 on I color t+1 to acquire the masked laser point image at the next time step, we use it to repaint the laser point on I − t (lines 9 and 10). The final repainted image is the pattern-reduced and repainted image I reduce t+1 at time t + 1.

2) OVERALL ALGORITHM FOR SPR
From the previous step, we obtain the pattern-reduced and repainted image I reduce t+1 . The SPR reward is then calculated based on the I reduce t+1 , where the reward consists of pattern reduction, combo, and miss rewards. Details of the reward calculation are described in Section III-C. In this step, the agent collects rollout experiences using PRR and then calculates the reward. The overall SPR process is described in Algorithm 2.

VOLUME 8, 2020
We incorporated an early termination technique [29] into our learning to avoid the collection of very poorly performing experiences (e.g., when the agent points to regions outside the target plane). Such experiences will hinder the robot from learning the optimal policy and consequently, the policy can overfit to local optima. Early termination was implemented by the following processes: If the agent points to a location inside the region of the target plane, a zero penalty r − t+1 = 0 is given. In contrast, a negative penalty r − t+1 = −1 is given and the environment is reset when the agent points to a location outside the target plane.
Whenever the agent succeeds in reducing the target patterns, the combo and miss count variables are increased and reset, respectively. If the agent fails, the reverse operation is conducted (lines 14 to 17 in Algorithm 2). If the miss count is over 40 or the episode ends, early termination is implemented so that the environment and all the variables are reset to the initial state (lines 19 to 21).

3) SYNCHRONOUS DISTRIBUTED LEARNING FRAMEWORK
Recent RL studies train their policy network using a distributed learning environment [60]- [62] to reduce the sample correlations and learning time. Our learning framework is based on a distributed system that is similar to the A3C algorithm [62]. However, ours is a synchronous distributed system, in which the learning framework simultaneously collects the experience data from N distributed processes and updates the policy using the merged data, whereas the A3C operates asynchronously. The authors of A3C claim that the asynchronous operation improves performance, but Wu et al. [63] disagreed and presented experimental results supporting this counter-argument. Considering their claim and the simplicity of synchronous implementation, we built a synchronous distributed learning framework (see Fig. 7) that can be regarded as a synchronously distributed version of A2C [64] using a message passing interface (MPI) and V-REP simulator [65]. During training, our framework creates multiple processes, each of which runs a V-REP simulator. The training data and network parameters of the agent in each process are shared via the MPI. Set combo = 0, miss = 0 Execute action a t in simulator and obtains observation o t+1 , state s t+1 , penalty reward r − t+1 , done d t+1 9 Convert tensors o t , o t+1 to images I t , I t+1 10 PRR with I t , I t+1 to obtain I reduce T /M iterations are performed to update the policy. In each of the M rollout processes, the experience data is simultaneously acquired from the N distributed processes for rapid data collection; thus, the total number of rollout frames per iteration is M /N . From the initial observation and state, the agent performs an action in the environment and obtains the resulting observation, state, reward, and episode-end-flag d t+1 at the next time step. When the episode reaches the end or early termination is activated, the episode finish flag d t+1 is set to true and at this moment, the current target pattern is randomly changed to another one. The current and next observations are then converted to images and input to the PRR process (Algorithm 1) to obtain I reduce t+1 . We defined the reward function for the SPR method so that it accepts images corresponding to the current and next observations as well as additional variables such as combo and miss for learning sequential behavior generation.
After finishing M /N rollout steps, the rollout data collected by the N processes are merged to the root node P 0 and used to update the root policy. Then, the weight parameters of the root policy are broadcasted to all child nodes (P 1 , P 2 , · · · P N −1 ) and used for updating their policies using the following soft target update [66]: where β(= 0.05) is a balancing constant between the previous parameters, ω old and ζ old , and new parameters, ω new and ζ new , of the actor and critic network, respectively.

C. MDP DESIGN 1) POLICY NETWORK
As shown in Fig. 5, the policy network consists of Actor-CNN π ω , Actor-LSTM π ω † , and Critic-LSTM Q ζ networks, where the ω, ω † , and ζ represent the hyper-parameters of each policy network respectively. The Actor-CNN takes an image frame o t = I (3×96×96) of the target pattern at time t and encodes it to a feature vector v f t = I (1×4608) by flattening the last convolved image tensor I (32×12×12) . This feature vector is fed to the Actor-LSTM network and an action is sampled. The input of the Critic-LSTM is a concatenated vector v c t = I (1×4637) consisting of a feature vector, action a t = I (1×4) , and the agent's state s t = I (1×25) obtained from the simulator. Then, it outputs an estimated Q-value Q ζ (a t |s t , o t ). All networks use the ReLU activation function.

2) STATE AND ACTION
The state of the agent consists of the information of the agent and other related objects. Specifically, the 25-dimensional state s t = I (1×25) includes the angular values and angular velocities of the pan and tilt joints, the position and orientation of the RLP device with respect to world coordinates, the position and orientation of the camera and the target plane, and the position of the laser point. This state information is only used in the training phase.
Given an observation, the Actor-LSTM samples an action a t = {∂x d t , τ t }, where the first element represents the desired positional differences ∂x d t = {∂x t , ∂y t , ∂z t } of the laser pointer in world coordinates. To calculate the desired position, the sampled action, which is the output from the actor based on a Gaussian distribution, is scaled and added to the current position of the laser pointer using the following equation: where is a random noise constant to ensure exploration. During the test phase, the agent performs a deterministic action by considering µ only. The scale factor η = 2.5 × 10 −3 was introduced to stabilize the learning. We found that without the proper scaling factor, the learning curve of training becomes very unstable and performance may degrade because of the distribution of actions specified by the initial weight parameters of the policy network. The desired relative position of (16) is input to the inverse kinematics of (1) to obtain the desired angles, which are then input to the PID controller of the RLP in the simulator. The last element of the action is the randomization control parameter τ t , which is used in the proposed SDC learning method. We describe the details of τ t in Section III-D.

3) REWARD
For RL-based path generation, the reward function should be able to determine the SPR skill of the agent. We defined the reward function in terms of three sub-rewards. The first is pattern reduction reward r p , which is proportional to the area of the pattern reduction obtained by the laser pointer. Based on the binary images B, the ratio of the area of the reduced pattern to the area of the laser point is defined as represent the binary image of the laser point at time t + 1 and pattern at time t respectively. The second subreward is the combo reward r c , which reflects the sequential reduction skill. If the agent consecutively succeeds in reducing a pattern, the combo count increases, and this in turn leads to a high combo reward, which is expressed as shown in (20). The last sub-reward is the miss reward r m , which gives the agent a higher penalty if it fails to reduce the pattern (represented in (21)). All sub-rewards are normalized by an exponential function. The weighted sum of these three sub-rewards is the next reward r t+1 (line 13 in Algorithm 2). The final reward is calculated by adding the penalty term r − t+1 to the reward, as described in line 13 of Algorithm 2.

D. SDC LEARNING
In SDC learning, the following modified reward function is used to learn the curriculum control skill: VOLUME 8, 2020 where τ is the last element of an action a t = {∂x d t , τ t } to control the degree of randomization. The curriculum reward r c is normalized between 0 and 1 using an exponential function, and the agent can obtain a higher reward if τ in (24) is close to 1. This means that if the agent succeeds in reducing the pattern in a more challenging environment, a high reward is given to take into account its level of difficulty. Therefore, the agent should be able to effectively learn the SPR policy in a relatively larger dynamic environment by controlling the curriculum for itself (see Fig. 5).
The value of τ is rescaled to a value between 0 and 1 by the sigmoid function when it is input to the environment. If this value is set to 1, the agent requests a full randomization of Table 3 for the environment. In contrast, a value of 0 indicates an initial fixed environment without randomization. Each element of action a t is utilized at different frequencies, where the positional action ∂x d t is applied to the environment at the full frequency (every frame) whereas the randomization action τ t is used at an episodic frequency (only at the beginning of every episode) because a randomized environment setup should be maintained for the duration of at least one episode. In short, for efficient learning, the agent learns the curriculum control skill by the curriculum reward r c , which is a feedback from the environment incorporating the task difficulty control signal τ .

IV. EXPERIMENTS
In this section, we first describe the experimental details. Then, the SPR results in known and unknown environments are presented. We then describe the comparative analysis on learning strategy and quantitative analysis in terms of Hausdorff similarity and pattern reduction ratio.

A. EXPERIMENTAL DETAILS
For the experiments, experience data were simultaneously collected from eight distributed processes in a single machine equipped with an i7-8700K CPU, 64 GB RAM, and Titan Xp and 1080ti GPUs. Because of the memory capacity limitations of the GPUs, the rollout processes were conducted by splitting the tasks among the two GPUs and updating the network parameters using the merged rollout data.
For Algorithm 2, we set the rollout frames M = 4, 096, total learning frames T = 1.0 × 10 7 , and frames per episode S = 256. The rest of the learning parameters were set as follows: the learning rate for actor η a = 5.0 × 10 −5 , the learning rate for critic η c = 1.0 × 10 −3 , mini batch size = 256, the number of PPO epochs = 5, discount factor = 0.99, and entropy coefficient = 0.01. We also used generalized advantage estimation [67] for temporal difference-based policy optimization.
If the training is conducted using a single process, the overall learning time for each policy will take almost 11 days for an LSTM with a single recurrent layer and a total number of rollout frames of 10M. Owing to the eight distributed processes, in contrast, we are able to reduce the overall training time to about 1.5 days for each policy on a single machine.

B. PATH GENERATION IN A KNOWN ENVIRONMENT
To verify the SPR algorithm, we first evaluated the EIE policy in a fixed environment. Evaluations were conducted for 50 episodes in the simulator (see Fig. 8) and the results are shown in Fig. 9. For each pattern, the left image shows the whole path of the laser pointer during one episode and the right image shows the visual patterns reconstructed by the laser pointer. Through the proposed SPR method, we show that the SPR policy is able to perform image-based path generation for various patterns including convex (Circle) and concave (Starfish) shapes without using additional manual programming, geometrical modeling of the target patterns, or curve fitting algorithms.
The proposed SPR method, however, has several limitations on some patterns that involve multiple intersecting points such as the modified rose (Figs. 9(n) and (o)), or elaborate patterns. In the former case, the agent may lose direction when it revisits the intersecting point because the pattern region is already removed in the previous visit. For this reason, the path generation results of those patterns are less accurate than those of the others. In the latter case,  the scale difference between the pixel region of the pattern and the laser point is the cause for this problem. In Starfish2 (Fig. 9(j)), only a single path of the bottom protrusion was generated because the relatively large pixel region of the laser point unintentionally reduced the neighboring pattern region so that the agent could not find the source for path generation when it revisited that place. We call this the over-reduction problem and in future work, we plan to find a solution for it.
During evaluation, we found that the agent accelerates the laser pointer at initial state until it reaches the pattern region (see Fig. 12). This is because the policy is trained to maximize the expected rewards by reducing the pattern as quickly as possible during an episode. Once the laser point contacts the pattern, the agent sequentially reduces it at a relatively slower speed due to the observation-action sampling time frequency (20 Hz).

C. PATH GENERATION IN AN UNKNOWN ENVIRONMENT
We evaluated the SPR policy in an uncalibrated environment, where the positions and orientations of the camera, target plane, RLP, and initial laser position were initialized to random values in each episode within the ranges listed in Table 3. Figure 10 shows the evaluation results of the SPR policy, which was trained by the SDC learning. In most cases, the agent was able to generate the paths for target patterns even though the robot, camera, and target objects VOLUME 8, 2020 FIGURE 11. SPR results for unlearned patterns: (a) Line. The remaining patterns were created by the superformula with parameters (ρ(φ); m, n 1 , n 2 , n 3 ). (b) Eye (1; 2, 0.5, 0.5, 0.5); (c) ConcaveStar (1; 5, 1, 1, 1); (d) Bean (1; 2, 1, 4, 8); (e) Spark (1; 6, 1, 4, 8). The first row shows the unlearned target pattern images. In the second row, the left side image represents the paths of the laser pointer whereas the right side image is the visualization of the laser pointer traces during one episode. were uncalibrated. In terms of the pattern reduction ratio (described in Section IV-E), the average performance of the SDC greatly exceeds other methods (see Table 8 in the Appendix).
To evaluate the generality of the path generation skill, we further tested the SDC policy on unlearned patterns. Figure 11 shows the evaluation results for unlearned patterns such as Line, Eye, ConcaveStar, Bean, and Spark. Although the generated paths are not perfect, the overall performance seems better than other methods. Moreover, in Fig. 11(a), the agent attempted to begin the pattern reduction from the end of the line instead of the middle or somewhere close to the starting position. This indicates that the agent learns how to obtain higher rewards for a given pattern even if that pattern has never been observed before. From this experiment, we confirmed that the proposed SPR, reward design and SDC learning method are an effective way to learn a robust path generation skill for arbitrary target patterns in an uncalibrated environment.

D. COMPARATIVE ANALYSIS ON LEARNING STRATEGY
We compared the performances of the four models (EIE, ERE, LIC, and SDC) during the learning phase in a fully randomized environment. As shown at the top of Fig. 13, the LIC policy converges quickly in the early phase of training; however, the performance degenerates as learning progresses because of the discrepancy between the learning speed of the policy and the difficulty of the given task. As expected, ERE converges slowly whereas the proposed SDC outperforms others as learning progresses.
We analyzed the tendency of the randomization control τ to determine the relationship between the randomization constant and learning curve. Figure 14 shows the change in the randomization constant for each learning method for every 100 training epochs, where each epoch corresponds to 4,096 rollout frames. Because of its constant full randomization, the learning curve of ERE is the slowest to converge.
The learning curve of LIC shows better performance in the beginning and poor performance at the end because LIC merely increases the randomness without consideration for the learning capability of the policy network. In the SDC learning, in contrast, the agent appropriately controls the randomization constant for itself by taking into account its learning capability. Thus, we expected SDC to perform the best.
We evaluated the four learning methods in the test phase to demonstrate the performance of the SDC in terms of reward. Evaluations were conducted using only pattern reduction reward r p during 50 episodes, where each episode consists of 256 frames. The results shown at the bottom of Fig. 13 indicate a tendency that is identical to the one above. The proposed SDC outperforms the other methods, and the EIE policy performs the worst because it has never observed varied scenes during training. From these evaluation results, we can conclude that the proposed SDC learning approach could be effective in a randomized learning environment.

E. QUANTITATIVE ANALYSIS 1) HAUSDORFF SIMILARITY
For a more objective evaluation of the proposed method, we quantitatively analyzed the four policies (EIE, ERE, LIC, and SDC) using the Hausdorff similarity measurement [68]. We first extracted the point set of each target pattern and measured the Hausdorff distance between the extracted point set and the points of the generated path using Euclidean distance. Tables 6 and 7 respectively in the Appendix show the Hausdorff distances between the point sets of the learned and unlearned target patterns and the paths generated using the four learning methods. In each column of Table 6, baseline is the evaluation result using the EIE policy in the fixed environment and the distances are measured from the result of Fig. 9 for reference. We marked the minimum distance values in bold in all columns (excluding the baseline). The number of minimum distances for each learning method is shown in Table 4, which shows that the proposed SDC outperforms other methods on both the learned and unlearned patterns.

2) PATTERN REDUCTION RATIO
In addition to the Hausdorff similarity, we present a quantitative analysis of the four learning methods based on the ratio between the original and reduced pattern regions, calculated as follows: where I reduce i and I pattern j represent each pixel of the reduced pattern and original pattern respectively. The pattern reduction ratio λ is calculated by the sum of their pixels.  Tables 6 and 7 Tables 8 and 9, and Figs. 17-20 in the Appendix). Table 5 presents the number of minimum values of the pattern reduction ratio for each learning method, where the detailed data are shown in Tables 8 and 9 of the Appendix. The statistical data demonstrate that for the learned patterns, the SDC obtains average pattern reduction ratios that are 81.6%, 13.7%, and 20.6% less than those of EIE, ERE, and LIC, respectively. Similarly, it obtains ratios that are 81.5%, 37.8%, and 29.2% less for the unlearned patterns. The corresponding images for the reduced patterns are shown in Figs. 17-20 of Appendix.

V. ABLATION STUDIES
In this section, we present the ablation studies for components of the network architecture and reward function to verify their effectiveness in learning.

A. NETWORK ARCHITECTURE
Although the agent performs a frame-by-frame action, the SPR task should be considered in terms of a series VOLUME 8, 2020 of actions. From this point of view, we assume two things: the first is that the LSTM network performs better than a fully connected network because of its recurrence property. However, it is not clear how many recurrent layers the LSTM should have for the best performance. This leads to the second assumption: increasing the number of recurrent layers as a stacked LSTM will improve the SPR performance because the multiple recurrent layers allow the agent to learn more abstract feature representations.
To verify these assumptions, we compared the training results obtained with fully connected, LSTM, and stacked LSTM layer architectures in a randomized environment. We verified the first assumption: the normal LSTM performs better than a fully connected network, as shown in Fig. 15. However, in the case of the stacked LSTM, the result disproves our second assumption. The result instead shows that as the number of recurrent layers increases, the performance decreases and is even worse than that of the fully connected network. It can be interpreted that using more than one recurrent layer increases the overfitting problem and thus is too much for our task. However, we still believe that introducing a multi-layered LSTM may be advantageous for longer and more complex patterns. To sum up, it is best to use a single-layered LSTM network in our SPR task.

B. REWARD FUNCTION
In addition to the network architecture, we performed an ablation experiment on the reward function to determine the effects of each term. As described in (18) Fig. 16, the weighted sum of the three sub-rewards (PR+CB+MS) performs better than the other combinations. Moreover, (PR), which uses only the pattern reduction reward, performs the worst. From this, we can infer that r p is an insufficient reward for learning the SPR policy effectively because the corresponding rewards of consecutive pattern reduction and independent pattern reduction are not distinguishable. Moreover, no penalties are imposed on the agent when it misses a pattern; thus, nothing exists to force the agent not to miss. Even though the agent can learn appropriate consecutive actions from the discounted rewards, this is very inefficient. Although (PR+CB) yields a better result than (PR) owing to the combo reward, its performance is lower than that of (PR+CB+MS). From this result, we can verify the positive effect of r m in SPR policy learning.

VI. CONCLUSION AND FUTURE WORK
In this paper, we proposed a deep RL-based SPR method using a simple RLP device to learn the path generation skill for arbitrary target pattern. We also proposed the SDC learning method for effective learning in a randomized environment. Using our method, we were able to learn a robust path generation policy that can generate paths for arbitrary patterns in an uncalibrated environment. In particular, we experimentally verified that SDC learning is an effective way to learn randomized tasks and outperforms other ordinary and curriculum-based methods. We believe that the proposed device and learning-based path generation method could be used for non-physical human-robot interaction applications such as instructional aids for education and    assembly guidance. We expect that the SDC can be generally used to improve the performance of other RL problems.
We would like to note some limitations of our approach. The proposed SPR method still requires classic image processing. Moreover, the over-reduction problem still remains unsolved for some cases. These hinder the application of the proposed method to more complex patterns such as Starfish2 with multiple intersection points, which is the immediate direction of our future work. Encouraged by the simulation results, we plan to conduct experiments applying the proposed method to a real world robot and target patterns.

A. QUANTITATIVE EVALUATION OF GENERATED PATHS
In this section, we provide supplementary experimental results and analysis. Tables 6 and 7 show the quantitative VOLUME 8, 2020  evaluation results on the generated path using the Euclidean Hausdorff distance for learned and unlearned patterns, respectively. Tables 8 and 9 show the ratio values of the reduced patterns with respect to the original patterns in each learning method for learned and unlearned patterns, respectively. The data are visualized in Section B.  As an Artist of code painting, he has held several personal and group exhibitions with images generated with his own algorithms. He is currently a Principal Research Scientist with ETRI and a Professor with the Korea University of Science and Technology (UST), South Korea. His research interests include geometric modeling, photorealistic rendering, augmented reality, human-robot interaction, deep learning, and artificial intelligence. He was a recipient of the Wolfram Innovator Award, in 2019.