Self-Correction for Eye-In-Hand Robotic Grasping using Action Learning

Robotic grasping for cluttered tasks and heterogeneous targets is not satisfied by the deep learning that has been developed in the last decade. The main problem lies in intelligence, which is stagnant, even though it has a high accuracy rate in usual environment; however, the cluttered grasping environment is very irregular. In this paper, an action learning for robotic grasping using eye-in-hand coordination is developed to grasp the cluttered and wide range of various objects using 6 degree-of-freedom (DOF) robotic manipulator equipped with a three-finger gripper. To involve action learning in this system, k-Nearest Neighbor (kNN), Disparity Map (DM), and You Only Look Once (YOLO) were needed. After successfully formulating the learning cycle, an instrument assesses the robot’s environment and performance with qualitative weightings. Some experiments of measuring the depth of the target, localization of target variations, target detection, and the gripping process itself were conducted. The entire process is spread out in plan, act, observe, and reflect for each action learning cycle. If the first cycle does not suffice the results according to the minimum pass standard, the cycle will renew until the robot succeeds in picking and placing. Furthermore, this study demonstrated that the action learning-based object manipulation system with stereo-like vision and eye-in-hand calibration can improve intelligence over previous errors with acceptable errors. Thus, action learning might be applicable to other object manipulation systems without having to define the environment first.


I. INTRODUCTION
Mimicking human behavior for object manipulation means to study the inherent interaction between fast feedback involving perception and action. It is like a complex manipulation task to extract a single object from messy objects. It can be ascertained that almost without prior planning, without tactile feedback, and no vision, the manipulations cannot be done very well [1]. In contrast, robotic manipulation tends to rely on initial analysis and planning, with the following trajectory feedback, to ensure adherence during execution. In other words, it usually uses multiple sensors, fusion sensors, or tactile sensors but this requires a certain approach before being used as continuous feedback. Continuous feedback is required in visual servo technique that requires feature identification [2]. Both open-loop perception and feedback features require calibration to determine the accurate geometric relationship between the end-effector of the robot and the camera [3], also involving some deep learning processes.
Latest decade in deep learning, AlexNet was added to the Convolutional Neural Network (CNN) by Krizhevsky et al. [5]. The Faster R-CNN (Region based CNN) has better precision speed than AlexNet, CNN, R-CNN, and Fast R-CNN. Redmon et al. [6] provided another highly capable method with the YOLO (You Only Look Once), the last one being YOLOv3 [7]. In YOLOv2, the speed of detection is even more significant than the Faster R-CNN. The prowess of deep learning needs to be supported by other techniques to be applicable in robotics.
Previous work by Levine et al. [8] from Google Inc. employed hand-eye coordination for the grasping robot that successfully grasped a new object through continuous servoing. The construction of hand-eye coordination has the advantage of ease in estimating object localization, but the camera field of view (FoV) will be blocked by the robot arm itself [8], [9]. Besides, study [8] used huge data about 800,000 handheld experiments involving at least 6 to 14 manipulator robots in parallel. This method is not practical in terms of time and needs many robot units. The interesting thing about this study is to grasp a new object that has not been recognized.
Although work [8] has involved many datasets in deep learning, the number of successful grasping experiments is unpredictable. Generally speaking, deep learning that has currently being developed still has weaknesses on applying to dynamic environments, such as in heterogeneous objects, wide range of targets, and cluttered objects. The nature of deep learning is very dependent on the learning rate at the training stage that has been given. However, the ability of deep learning that is specific and generalized turns out to have a weakness if the targets are overlapping and/or partially visible and is related to decision making. For that, deep learning generally needs to collaborate with other systems, as for supervised or unsupervised learning [10]- [12].
Some examples incorporating deep learning in several systems are becoming prevalent and have been widely applied, such as Shi et al. [13] and Tsai et al. [14] respectively implemented Deep Reinforcement Learning (DRL) and Deep CNN (DCNN) for mobile robots. Similar work was done by Chen et al. [15] by combining DRL, RNN (Recurrent Neural Network), and LSTM (Long Short-Term Memory), but its ability was less than 47%. Riviere et al. provided an outstanding achievement with end-to-end learning using the DCNN and Graph Neural Network (GNN) approaches that can be run on low-end microcontrollers but limited to the number of six obstacles and neighbors only [16]. In addition to the use of end-toend techniques, incorporating deep learning is often found for Reinforcement Learning (RL).
Currently, [4], [14], and [17] worked with RL and followed up by deep network networks. Although RL is quite powerful after being combined with other techniques, its intelligence does not be improved because the environment determines the value at each stage and agents are trained with static data, which are not suitable for a changing environment. We try to solve RL's shortcomings by offering a novel action learning; it is an improved method without setting the value for each state. Action learning has been implemented in education for a long time ago [10], [11], [18], [19], but adapted to robotic or artificial intelligence (AI) has not been reported.
The action learning principle imitates the human learning method, where in addition to having past learning, the robot will also evaluate itself and the environment from several assessment indicators. In this way, the action learning will have a learning cycle repeated until it meets specific passing grade. Besides the robot's primary intelligence, it also learns to improve its capabilities by introducing the action learning. In practice, we will apply to the cluttered bin for the pick and place task. Therefore, we expect that our robot system, powered by an action learning in grasping, will be more effective.
Specifically, we propose to develop a vision-based object manipulation system using a standard robotic manipulator that is capable of picking and placing objects from cluttered positions and overlapping, which are frequently confronted while picking for the eye-in-hand manipulator. The following details are given in this paper:  A stereo camera-like is employed to estimate the targeted depth, which is variable, heterogonous, and cluttered, also a localization method based on DM-kNN is proposed.  We strive to be as accurate as possible recognition and detection objects with modified YOLOv3 as basic detection and self-correction validation.  The learning independently from mistakes is considered by developing action learning for targetpicking and placing tasks and depth collision problem for manipulators in layered environments. The environment as a refence value to make decision in a single cycle is assessed.  The proposed action learning system in the task of picking targets applied to a six degree-of-freedom (DOF) robot manipulator with a three-finger gripper was performed and evaluated, it might provide alternative ways to similar robot cases. In this paper, we discussed the system design overview in Section II. Section III introduces self-correction for robotic grasping and action learning on cluttered environment will be detailed in Section IV. The next Section V describes the experimental results. Finally, we conclude the work and offer ideas for possible future works in Section VI.

A. PROPOSED SYSTEM DESIGN
The whole system is shown in Figure 1, where the dashed line box refers to the action learning process, while the green part provides preprocessing inputs of action learning and the red box is the robot goal. The goal to complete the moving and picking-placing task makes the gripper avoid confusing decision; hence, the procedures will cut off the time by assessing some indicators or inputs.
The inputs of DM (Disparity Map), YOLOv3, kNN, orientation/edge detection, and β are RGB images with a resolution of 640×480 pixels. The output of DM is far/near distance in the range of 270-300 mm. The YOLOv3 output is the result with a confidence level in percentage (%) and kNN output is in the form of coordinates (X, Y, Z). Output orientation is in the form of position degree (°) and environmental assessment value β and passing grade value are in the form of values 0-100. All these values will be fed into the plan in each cycle to proceed and then become a decision. Given the large number of inputs and a variety of targets, it is necessary to limit the specific scope of work from our proposed action learning.

B. THE SYSTEM LIMITATION
The developed action learning with eye-in-hand configuration is limited to being able to pick and place for the 10 target classes that have been trained, and the number of cycles in action learning cannot be predicted if we do not make a limitation. In this paper, we only limit twice. We did this to minimize the target dislocation due to the collision between the robot's finger and the tray if there is no restriction on the retrieval experiment. Further explanations are discussed in Section IV.

A. STEREO CAMERA-LIKE WITH DEPTH ESTIMATION
The stereo camera was developed from a mono webcam Logitech C920 with the resolution 640×480 pixels as shown in Figure 2. The same camera is placed on the coordinates of the initial point (50,450) and shifted along the x axis to become point (150, 450). The same camera is placed in an imaginary rigid surface aligned in the y axis and 100 mm apart in the x axis. The cameras must also be perfectly aligned to avoid the height offset generated on the resulting 3D image. To measure the discrepancy of two cameras aligned, the blue dot positions of the object in 2D image plane are computed then the x values and y values between two images on left and right cameras are compared as illustrated in Figure 3. The difference value of and should be zero which indicates the two cameras aligned. Figure 3 shows the blue dot appeared on image plane of left camera with coordinate is , and the dot point appeared on the right image plane with coordinate , . The distance between left camera center (optical center) and the right camera center (optical center) is called baseline (b). The distance between and is called disparity distance (d), as shown in Eqs. (1)(2)(3) where Z is the depth of point P and f is the camera focal length. (2) Substituting Eq. (2) to Eq. (3) the depth (Z) is seen in Eq. (4). After obtaining Z, we could use Eqs. (5) and (6) to obtain the X and Y coordinates of P point, respectively, * ( 4 ) * ( 5 ) * ( 6 ) where pixel locations on the 2D image are and , and actual positions on the 3D image are X, Y, and Z.
On the other hand, the camera's FoV can be used for depth verification by finding the dx value, see Eq. (7). If diagonal FoV (DFoV) is given in Figure 4, both vertical FoV (VFOV) and horizontal FoV (HFoV) for the C920 camera can be found. Because this camera employs a 16:9 CMOS sensor by default, we should convert it to the 4:3 aspect ratio using Eq. (7), where dx denotes the length between the camera pinhole cp and the frame center and dh denotes the length between the camera pinhole and the frame vertex. The horizontal line is half the length of cx, the vertical line is half the length of cy, and the diagonal line is half the length of dy. As a result, the difference between the 4:3 and 16:9 aspect ratios is related to the length of dy.
With HFoV and VFoV, then to recognize the depth of a position can be done through a comparison of the perimeter or volume of an object. Illustration of distance, object, and camera has a linear relationship in the FoV. In computer vision, another method, the Disparity Map (DM) is quite popular. A disparity map refers to the difference in visible pixels or motion between a pair of stereo images. The existence of the baseline causes a shift of several pixels in several baseline lengths. The results of the disparity map can show a gradation of distance; although it is not specific in length units, it is pretty helpful. Fang et al. [20] utilized this method with CNN to estimate the depth and the disparity map results. Thus, the disparity map capability can be used for the benefit of depth estimation.

The
, disparity map represents the displacement of the corresponding pixels between the left and right images. However, locating corresponding pixels is difficult. Some variables may cause problems in the non-occlusion pixels, such as non-textured, camera noise, homogeneity, and repeated texture. The disparity is calculated for all pixels using Block Matching (BM), and the validity of the disparity significance is defined as follows, The disparities between the left image and the right image are derived from Eq. (9) and Eq. (10), where ε → x, y is the normalized BM error with the horizontal disparity d, W is window of the BM, is the maximum value of disparity within the permissible limit, and u and v are the number of pixels in the xy camera image plane, respectively. To check the observed disparity, Eq. (11) expresses the disparity from the right image frame to the left image frame , The Minimum Matching Error (MME) determines how close the pair image values in the left (x,y) and right (x+d,y) images are to the same points. The MME is well-known from its effectiveness Eq. (12).
Apart from DM, other methods such as kNN are needed. The principle of kNN [21] is found in the application of robotic assistance, another study involved the Kinect sensor with the kNN algorithm [22]. kNN is applied for classification based on the closest distance to reference. At the same time, kNN is reported to have a weakness in distinguishing entities from each object, but the opportunity for classification with multiple references is open to this method.

B. TARGETS DETECTION AND LOCALIZATION
In this study, YOLOv3 was used as the basis for determining target localization. The results were in the form of confidence level and its bounding box. The square shape of the bounding box will be used as the basis for determining the target grip point. Therefore, detection using YOLOv3 is critical. Two crucial things related to target detection with YOLOv3 and the detected object's orientation need to be explained further.

1) TARGET DETECTION
The essential part of a detection involves deep learning, in other words which one feature extractor will be chosen. So far, no one claims about a standard feature extractor for one algorithm like YOLO. This opens opportunities for developing customization of the algorithm into hardware [23], [24]. YOLOv3 can add more variations to the training data by utilizing data augmentation rather than increasing the number of labelled training samples, YOLOv3 with SqueezeNet shown in Figure 5. Data augmentation techniques include random horizontal flipping, random scaling by 10%, and color jitter augmentation in HSV space.
The cyan is a feature extraction network using SqueezeNet, the purple color indicates the first detection head, and the gray is the second detection head with their respective outputs. In this SquezeeNet, we use nine depth concatenation layers with an input size of 227×227×3 in the image form. There are 86 layers with a connection number of 75, and the output type is a classification of 10 classes. The basic idea of the YOLO architecture is to employ two networks simultaneously and the process to be quickly bypassed in certain parts. YOLOv3 uses logistic regression to estimate an objective score (confidence) for each bounding box.
The original SqueezeNet settings for the activation function are preserved by using the Rectified Linear Unit (ReLU) function in the fire modules [23]. The leaky ReLU function will be followed by the fully connected (FC) layers. Leaky ReLU is a modified version of ReLU with a slight slope in the function output for negative data [6]. So, the derivative is never zero; it can reduce the appearance of silent neurons, which solves the problem of ReLU failing to learn when negative intervals are encountered. The following is how the term leaky ReLU is defined as Eq. (13).
During training, our model will be optimized using the categorical cross entropy loss function: loss ∑ y log y y log y ⋯ y log y (14) where n and m represent the number of samples and the number of categories, respectively. The y represents the true value and y represents the prediction value. In practice, it is necessary to make the loss function pay more attention to the categories with small samples, which will help solve the sample imbalance problem. To make the model training run smoothly and avoid overfitting, we add loss factors to the loss function as in Eq. (15): loss ∑ λ y log y λ y log y ⋯ λ y log y (15) The values of loss factor have been listed for different target categories, calculated as Eq. (16): ( 1 6 ) where n C represents the total number of samples. The i N represents the sample amount of class i, while n is the number of target categories.

2) TARGET ORIENTATION
After succeeding in identifying the grasping point, object orientation is necessary for robot grasping. If the target on tray position is cluttered, so orientation recognition is required. In a cluttered environment, orientation is necessary because overlapping or overlapped objects can form new orientations. On the other hand, picking up and placing objects such as circles and spheres or picking up using a vacuum gripper (non-finger) does not require orientation. Broadly, the traverses are grouped into five types based on the ratio of the longest axis to the horizontal x axis.
The estimated region of the subject provided by the MATLAB function is used to calculate object orientation ranging from -90° to 90°. During the eye-in-hand adjustment process, these orientation data must be adjusted to the endorientation effector's so that the object can be grasped properly. The angle formed by the x axis and the ellipse's major axis, as shown in Figure 6, is known as objectorientation [25]. The relationship among the horizontal line x, vertical line y, width W and height H of the object is given in Eq. (17) The ellipse on the left side of the diagram refers to the blue axis's lines, the red dots are the blue line's centre. The orientation is defined as the angle between the horizontal dashed line and the central axis. The picture region and its ellipse are represented on the right side of the figure. Each map function is classified as four categories: (b) horizontal, (c) vertical, (d) left diagonal, or (e) right diagonal.

3) TARGET LOCALIZATION
The localization of the target is the combination of the X, Y coordinates, Z depth and orientation. Targets that YOLOv3 successfully recognizes will be used as an external reference in addition to the centroid of the target containing the XY coordinates. Meanwhile, FoV and disparity map order verify the results from camera-like stereo to Z depth. Both are combined, including orientation, so 3D points are formed with each orientation, as shown in Figure 7. Figure 7 exposes the DM kNN architecture combined with FoV. The two image inputs are used as inputs through the popular histogram of oriented gradient (HOG) approach, the centroid value of each target is obtained [26].
The combined results produce 3D coordinates in image frame I. The kNN classification method is one of the most powerful classification methods, and it strengthens our adoption [21]. The problem of identifying the position of an object with respect to its nearest neighbor can be solved using Euclidean method.

C. ACTION LEARNING FOR SELF-CORRECTION
After the emergence of AI in the last decade, algorithms for robotic manipulators seem to increase again. It is commonly known that learning for robots is formerly imitated from human education learning. Several learning theories have been adopted and each learning theory has its syntax, so it can be reduced into a procedure or algorithm.

1) APPROACH OF ACTION LEARNING IN ROBOTIC
Broadly, the emergence of action learning was introduced by Altricther et al. and Dick et al. [28], [30], and [39], it had undergone several modifications by Bell, Aldridge, Whitehead, Mc Niff, Norton, Stringer et al., and some even called it classroom action research/action research. Details about action learning are discussed in the next subsection. So, the development of action learning in robotic manipulator has not been in scientific publications in engineering, and it is still limited to the field of education [29], [35], [40], and [41].

Continuous improvement
Unpredicted when learner will get of the max. results

Plan Act  Reflect Learn
Active learning [31]- [33] Proactively selects the subset of examples to be learned next from the pool of unknown data.

Can query a user interactively
Iterative human-in-theloop method and sampling rate is needed Collaborative learning [34], [35] Apart from learning from the system, also can learn from other agents involved

Rich of learning resources
Double focus and takes time for learner GoalActivity Sequence DistributionRepresent Reinforcement learning [3], [13], [15], [36] If a certain behavior is reinforced, it will most likely be repeated To adopt action learning in the robotics field, it is necessary to understand the concepts of general learning approaches that have existed, including benchmarking them in Table 1. It should be emphasized that action learning is different from other learning approaches. Meanwhile it has some similarities in syntax, such as planning, acting, assessing, reflecting, evaluating, or reviewing in active learning, reinforcement learning, metacognitive learning, and experimental learning, but these are different as a whole process.
Action learning architecture contains cycles; there are four stages in one cycle. The number of cycles cannot be determined or limited. It is just the cycle will stop when it reaches a predetermined threshold. The threshold value is obtained from an evaluation instrument, and usually, in a single instrument, some items indicate performance indicators. The performance value of this instrument will continue to be evaluated in each cycle.

2) STEPPING IN EACH CYCLE
The four stages in one cycle are planning, acting, observing and reflecting. The first stage is a plan; some of the inputs are analyzed using a particular approach at this stage. The second stage, act, this part is a form of execution of an action that has been planned. Act in the robotic manipulator is described as a motion series starting from the initial position towards the target until the gripping process returns homing. The third stage observes the system's observations after the act is carried out through the assessment instrument. The last part is reflecting, which performs an evaluation for the robot, especially the success of the verification in this section. If the target grasping process is not successful, then the next cycle is recycled.
In order to verify an eye-in-hand configuration using action learning, the target pick and place task was performed as following details. Although we introduced action learning, reinforcement learning was an inspiration. The Bellman equation used by reinforcement learning gives a discounted value from the goal point; the possible paths are trained to get the maximum value. In this way, of course, the value of each state in reinforcement learning is defined previously. In contrast, on action learning, the value is removed and replaced with a real-time assessment based on the instrument, or we also call it a pass grade for learners. The pass grade value is denoted by . The assessment value comes from the eight assessment instruments in Table 2 and the illustration in Figure 8.
It should be declared that the output of YOLOv3 detection is and is the result of target localization, where , ∈ . The value obtained from the eight assessment indicators in Table 2 could be written as Eq. (18), So, the plan in the first cycle can be written as Eq. (19), where the plan is symbolized by , action by , observe by , and reflect by .
If the has fulfilled the conditions by , , , it will continue to the process, with conditions such as Eq. (20). 0 ⟹ From Eq. (20), we could write Eq. (21), and the value of the reflection result is dependent on with binary properties, If 1, then the cycle stopped; otherwise, if 0, it will scroll to the next cycle to evaluate and return its value. In Eq. (21), when observation o collects inputs from the value β ∧ δ, it means that in this position, YOLOv3 works for the second time to make sure if the target has been grasped or not and turns into second cycle.

3) ASSESSMENT OF THE ENVIRONMENT
To develop action learning in manipulator robots, the robot's perception of the target must be valid [34]. Assessment of robot perceptions is carried out using eight items of assessment instruments. Every single process of grasping attempt will obtain one assessment result with a range of 1-100. Recording and comparing the data β with the results obtained at that time are presented in a probability density curve. The normal distribution (also known as the Gaussian distribution) is a two-parameter curve family. The central limit theorem states (roughly) that as the sample size grows to infinity, the number of independent samples from any distribution with finite mean and variance converges to the normal distribution. The normal distribution curve has been widely used to generalize, predict, and analyze decision making [42]- [47]. Furthermore, the basics of normal distribution have been commonly used for the development of deep learning. Compare the current of with the previous of (%) -≥21 6 ~ 20 ≤5 3 The value CDF in the next cycle - For action learning to work well, apart from the innate intelligence obtained by the manipulator robot through YOLOv3, other instruments are still needed. This assessment instrument is a function to assess environmental conditions in one cycle. This dissertation uses eight initial data to be generalized plus the latest data to update the latest environmental conditions based on eight indicators. Each datum has its scale and weighting. These scales and weights are not standardized but are arranged accordingly. Instrument details with all indicators are presented in Table 2, and the results of this instrument assessment on environment are denoted by which is separated into two parts and . The results of this assessment will be the decision-maker for action learning in stimulating the robot.
The results of the assessment in Table 2 are presented in the form of a bell curve. The normal distribution is popular for modeling unbiased uncertainties and additive random errors, as well as symmetrical distributions of many natural processes and phenomena [44]. A commonly cited rationale for assuming normal distributions is the central limit theorem, which states that the sum of independent observations asymptotically approaches a normal distribution regardless of the shape of the underlying distribution: where is the mean and is the standard deviation.
Although a CDF is cumulative distribution function does not have a closed-form solution, it is frequently presented using the complementary error function solution. However, it can be expressed in terms of a standard normal , The probability coverage corresponding to a given interval around the mean is often used to describe the symmetrical nature of the distribution. For example, the interval 1 corresponds to P(A) = 0.683, the interval 2 corresponds to P(A) = 0.954, and the interval 3 corresponds to P(A) = 0.997.

D. ROBOT COORDINATE TRANSFORMATION
Triangulation of methods for completing camera to robot manipulator coordinate transformation has been reported [7], [25], [48]. Triangulation does not require prior knowledge, calibration, training, and a wide variety of methods. The use of kNN, disparity map, HOG and FOV is a potential approach. The advantage of each method will complement the transformation of the camera coordinates to the coordinates of the manipulator robot, thereby overcoming the complexity of estimating the 3D position of the world target.
The camera is put on the end-effector with the eye-inhand configuration and takes pictures in the cluttered 2D target coordinates of camera frame C. It's essential to transform frame C to end-effector frame E [49]. Figure 9 will make it simpler. The target frame P is the object to be grasped, in the image it is indicated by a blue ball that is on the chessboard frame B. Suppose B is the location for cluttered target O in the frame. Let represent the position of cluttered targets O in relation to the robotic base frame R, and represent the position of cluttered targets O in the C frame. Eq. (24) is used to express the transformation of the target coordinate from camera frame O to robotic base frame R.
( 2 4 ) be obtained from the structure of the MELFA RV-3SD robot manipulator shown in Table 3, including the joint j, angle between two connection rods θ, length of link l, angle of torsion connected with rod α, and the distance between the two connection rods d. The Denavit-Haternberg (DH) parameters shown in Table 3 for manipulator control are the most common for inverse kinematics to control manipulators.

A. VARIOUS TARGETS ON CLUTTERED ENVIRONMENT
The targets laid on the tray are within the gripper's reach so that the centroid position of each detection result needs to be searched. The most uncomplicated technique is to calculate the centroids from the bounding box expressed in Eq. (25) below: The bounding box matrix has four columns a …, and the number of rows depends on the number of detected targets ( 2 7 ) From Eq. (27), the centroid can be calculated and becomes the reference point for a gripper to pick the target. The centroid point in this condition is still in 2D image, so it is necessary to add Z value obtained from stereo cameralike Eqs. (4)-(6) and verified by Eqs. (25)-(27).

B. GRASPING THE LOCALIZED TARGET
Before the target detection process and proceeding with localization, the parameter options for YOLOv3 need to be clarified. The difference in parameters certainly affects the results of deep learning itself. The value of the initial learning rate, the mini-batch size and maximum epoch applied will significantly affect detection accuracy and time consumption during training. For example, if the learning rate is too low, then training takes a long time. On the other hand, if the learning rate is too high, then training might reach a suboptimal result or diverge. The followings are the training option parameters applied in the paper; SGDM optimizer (Stochastic Gradient Descent with Momentum) as a solver for training network, initial learn rate 0.001, verbose set true, minibatch size of 16, max. epoch of 30, shuffle being never, and verbose frequency of 30. A detector that has been formed from training can be seen in general performance based on the training loss for the required iteration numbers. Figure 10 shows the results of the YOLOv3 detector training with different optimizers (SGDM, ADAM (Adaptive Moment Estimation), and RMSProp (Root Mean Square Propagation) in this paper. It was proved that, compared to the other two, SGDM was the best as displayed in Figure 10.a. It can be seen that in the 100 th iteration, the value of training loss is almost close to zero and continues to tend to stagnate after more than 200 iterations in Figures 10.a,c,e, while the RMSE training can be seen in Figures 10.b,d,f. The precision of this detector is crucial for overall system testing verification. Ideally, the precision is one at all recall levels. Figure 10 is only a sample to see the detector's performance from the attractor class and ring class. For that, we summarize it in the form of training RMSE for cautiously understanding. It can be seen that out of ten classes, the ring has better performance than other detectors, as shown in Figure 11.
We also compare detectors; in general, the process of making detectors is preceded by a labelling process. As we have done in our previous work [24], [50], modifications in labeling also have a significant effect on detection speed. In the paper, ten objects are used as targets in Figure 12. In a typical training detector, this can be done simultaneously for labelling ten targets to be only one detector for all ten targets. In contrast to the second method we used, where the ten targets have their own detectors, the number of detectors will increase. We call this method a parallel detector so that it can shorten the detection time by 1.34 times, and we applied in action learning.

FIGURE 13. A sequence of detection using YOLOv3; a) an original input image, b) detection results using parallel YOLOv3 detector, c) the b) result added by orientation, d) centroid of bounding box detection adjusted with gripper range, e) the d) result with closest point kNN = 5 (green circle) and f) final result detection with kNN = 5 and possibility remove obstacle (blue circle).
In this paper, SGDM is the best as an optimizer and we use it. After that, after the detection and localization process went well, we determined the grasping point, as shown in Figure 13. Figure 13 shows the sequence of the recognition and detection of interfering target. Starting from Figure  13.a as the input image, Figure 13.b is the result of recognition by YOLOv3. The orientation of each object can be determined by using the traditional image processing technique, and now the target is marked with a (yellow ×) and a grip point (black +) as shown in Figures 13.d. The green circle shows the gripper finger range, including the five green circles in Figure 13.e that is the minimum number of obstacles assumed to be on four sides. It is clear the green circle on the ring and bottle overlaps the target (attractor). In the last process, Figure 13.f, the grasping point is given for each potential obstacle with a blue circle so that the bottle with the highest overlapping area will be shifted first. In Table 4, performance by three optimizers in YOLOv3, SGDM (S), ADAM (A), and RMSProp (R) are compared.

A. DETECTION METHOD EVALUATION
The following is a separate test of the YOLOv3 performance for ten targets. YOLOv3 testing includes the level of confidence, accuracy, precision, recall, performance, average precision, and computation time required [26]. A confidence value of 0.85 was set to compute the detection result metrics. The results are shown in Table 4, in which can be seen that targets had a higher rate of detection precision.
Obviously, the annotation process simplifies the targets, whereas it is tougher to define the obstacles, such as piled up or sticky. This could not disturb the detection network but challenges to perform pick and place on that target. So, this experiment is necessary to measure the distance between the target and the obstacle using kNN. This measurement is done by assigning five closest points, assuming one closest point for the target to the centroid of the bounding box and the remaining four closest points for the right, left, front, and rear obstacles.

B. EXPERIMENTS OF SELF-CORRECTION FOR GRASPING
Regarding stereo camera-like coordinates, the coordinates of the 3D object's position are obtained using the verification method as discussed in Section IV. The green circle indicates the width of the gripper, and if there is an overlap with other circles, it means that other objects are disturbing. The overlap needs to be separated in order to grasp the target. In eye-in-hand coordination, we don't do any training. Because in this section, the system performs calculations based on the estimation results of the eye-inhand camera. We have also included a video version at https://youtu.be/DJZ8oLop5E8 to provide a comprehensive understanding. Figure 14 shows the sequence of targeting without using action learning. In Figure 14.a, the gripper approaches the object's position along with the estimation results of its orientation. The 6D (XYZABC position) object pose estimation has succeeded in estimating the position including the target orientation to the camera coordinates on the end effector, as follows: -10.1 mm (x axis), -474.3 mm (y axis), 268.0 mm (z axis), 178.0° (a axis), 1.0° (b axis), -51.6° (c axis/orientation). The correction pen (square-shaped) is the target and β was set at 48. However, in Figure 14.b, it seems that the position of the y axis is a little less precise, and it is not good enough for the robot trying to grip the target. As a result (Figure 14.c), the robotic manipulator failed, and it does not try to repeat because it does not use action learning, in other hand β = 48 only.
The last two grasps attempt using action learning. The gripper is positioned parallel to the target orientation's estimation results as shown in Figure 15.a. The 6D target pose has been successfully estimated, which includes the orientation of the target object to the camera coordinates on the end effector, as follows: 50 mm (x axis), 450 mm (y axis), 500 mm (z axis), -178.0° (a axis), 1.0° (b axis), -88° (c axis/orientation) with value of β = 70 and attractor (cylindrical shape) as a target. The target is successfully grasped (Figure 15.b) and placed into black tray ( Figure  15.c). Figure 16 shows the sequence of targeting using action learning, but in Figure 16.a, the robot evaluates the attractor's situation by recognizing the presence of overlap interference by the scissors. The system's decision is made so an overlap scissors is shifted by gripper finger as shown in Figure 16.b and grasping try to re-identify, re-grasp ( Figure 16.c), and place it into a black tray (Figure 16.d). The 6D object pose estimates are as follows: 101.8 mm (x axis), -453.5 mm (y axis), 380.0 mm (z axis), -178.0° (a axis), 1.0° (b axis), -88.6° (c-axis/orientation). Based on the experiment results, object detection based on YOLOv3, stereo camera-like, kNN, DM, and orientation estimation have succeeded in distinguishing objects from the background and other interference objects. However, when critiquing from the average performance of success using action learning, it is still in the range of 0.857 a maximum cycle limitation of 3 times.

C. EVALUATION ROBOTIC GRASPING USING ACTION LEARNING
Now we focused on Table 5 above that action learning which is discussed with 14 experiments. Evaluating action learning performance means recording four processes simultaneously. The success of action learning is judged by the robot's success in carrying out the gripping task. It is difficult to determine the value position β in the first-time system running; for this reason, we create dummy data with =70. Initially, action learning does not limit the number of cycles.
Nevertheless, this experiment has to limit to only two cycles. This consideration is based on the safety factor of the robot because the possibility of changing positions is very high. All details of the limited experimental results are totally of 21 grasping experiments performed in the paper. The 14 experiments employ action learning as mentioned above and the remaining seven without using action learning. The results are listed in Table 6. The success rate results without and with action learning (cycle limit = 2) are 0.142 and 0.642, respectively. After we increase the cycle limit to 3, the results are 0.285 and 0.857, respectively. Every process of grasping by the robot in action learning, whether it fails or succeeds, the data are always stored by the robot. Re-reading the data becomes an essential part of observing and reflecting in a single cycle. The value of is set to 60 and 70, respectively, while the value of the instrument's assessment result is β. The system will continue to do repetition to reach . An experiment of 14 grip attempts with five failures required 19 cycles in total. The values of β, , and if described in PDF will look like Figure 17. The value of when first set at 70, then after the robot works its value, is very dependent on the value of β. The last ten data are accumulated to calculate the PDF.

VI. CONCLUSION
This study has successfully developed action learning for grasping objects by deep learning and used a standard manipulator robot with a stereo-like camera in an eye-inhand configuration. A robotic manipulator equipped with a gripper can pick up and place targeted objects at cluttered positions in the workspace. A camera stereo-like is created by shifting the initial position to the second position on the x axis by a baseline of 100 mm. The process of grasping targets in action learning consists of four steps; planning, acting, observing, and reflecting-several prerequisites; DM, kNN, YOLOv3, and orientation. However, the results show around 0.857 successful grasping task with selfcorrection using action learning, while separately tested; an accuracy for the YOLOv3 of 0.923, and depth estimation around 0.341 mm. This evaluation process calculates with limited cycle in action learning within three cycle and environmental pass grade of 60 and 70.