Robot Target Location Based on the Difference in Monocular Vision Projection

Visual guidance is widely used in industrial robots. At present, the traditional method of template matching and positioning is often used in industrial robots under monocular vision guidance. However, for complex workpieces with height differences, if the angle and position of the workpiece are not consistent with the template, the projection will be different, and the positioning accuracy will be obviously reduced. A new method is proposed to solve this problem by dividing the whole contour of the workpiece and using the weighted method of the contour module to correct and match the workpiece image. First, the contour region of the template image is divided according to the position of the grasping points. Then the weight is assigned according to the distance from each contour region to the grasping points. Then, a fast feature point matching method is used to match the initial image of the workpiece to be measured. Finally, accurate contour matching is carried out for each weighted contour region so that the robot can grasp the object accurately. A large number of experiments show that the method has the design requirements of high stability and high accuracy.


I. INTRODUCTION
With the rapid development of robot technology, robots have been widely used in industrial production. At present, there are robots in the fields of automobiles, electronics, metals and machinery; robots will eventually replace artificial production and become an important development trend in the manufacturing industry and the foundation of industrial intelligence. The traditional industrial robot realizes repetitive action through teaching and solving the problem of efficiency in mass production, but emerging manufacturing industries are moving toward more varieties and smaller batches. Therefore, traditional manual teaching The associate editor coordinating the review of this manuscript and approving it for publication was Shuai Liu . and fixed-action robots cannot meet the needs of the emerging manufacturing industry [1], [2]. Machine vision can not only replace artificial vision, but also analyze and explain through visual processing [3], [11]. As a result, the cooperation between machine vision and robots is developed under a background, such as automatic grasping of mechanical parts and automatic sorting of logistics.
Bin-picking systems generally use 2D or 3D sensors, each of which has advantages and disadvantages: 1) 2D sensors are much cheaper than 3D sensors. 2) 2D sensors are more robust than 3D sensors in material suitability. 3) 3D sensors can obtain information on target height, but 2D sensors cannot. Because 2D sensors cannot obtain the target height, a projection difference will occur in the imaging process of 2D sensors for complex three-dimensional workpieces in VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ industry; the difference leads to a decrease in the grasping precision of the robot, as shown in Fig. 1. Especially when the grab point is the inclined plane of the workpiece, the influence is more obvious.
In the early stage, we proposed an ADK algorithm based on feature matching for robot grasping [5], which uses the distance difference and angle difference of feature points for clustering to obtain accurate registration, and proposed a non-linear multi-parallel surface calibration method based on adaptive k-segment master curve algorithm, but did not consider the difference in monocular vision projection. In order to solve the problem, we propose a new method based on Reference [5], which combines feature matching and contour matching and uses the weighted method of the contour module to correct the image matching. Our contribution in this paper includes the following aspects: (i) The combination of feature point matching and contour matching is used for fast location.
(ii) The weighted decision-making module is constructed to correct the matching results of the grab point region.

II. RELATED WORK
Machine vision is widely used in various fields [6], [7], [8], [9]. Vision recognition and positioning technology is a typical application of machine vision in industrial automation, which is widely used in automatic assembly, moving objects on production lines and defect detection in products [10], [11], [12]. The use of visual sensors for object detection and attitude estimation is still an active research area, especially in the application of industrial robots.
Dekel et al. [13] proposed a template matching algorithm for the best similarity points, which can effectively solve the problem of image matching and positioning in the case of similar area interference and partial occlusion. However, the matching algorithm is complex, slow and obtains poor results in real time. Xu et al. [14] proposed an intelligent watermelon recognition and positioning method in a natural scene and designed a squint binocular positioning algorithm by taking the left camera as the coordinate origin to obtain the actual three-dimensional spatial coordinates of the watermelon. However, the recognition error was approximately 15%. In [15], Yang et al. proposed a new learning framework that segmented the learned skills into a sequence of subskills automatically; then, each individual subskill was encoded and regulated accordingly. However, skill representation, trajectory alignment, and skill segmentation could not be addressed satisfactorily using such skill transfer modeling. Gu et al. [16] developed a portable assembly demonstration (PAD) system by using a RGB-D camera, which automatically identified the involved objects (parts/tools), the actions performed and the assembly state representing the spatial relationship between parts. Fazeli et al. [17] proposed a method of simulating hierarchical reasoning and multisensory fusion in robots. This method used a camera and wrist strap to receive visual and tactile feedback and compared these measured values with the previous actions of the robot. Meanwhile, it learned in real time whether to keep pushing the wood block or steer it to a new target in case the tower collapsed. Kalashnikov et al. [18] addressed the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. This study introduced QT-Opten and realized closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. Chen et al. [19] proposed a simple model-free framework that can learn to redirect objects when the robot hand is upward and downward and has strong zero sample migration performance. Pertsch et al. [20] proposed a deep latent variable model, which combines the embedding space of learning skills with the a priori skill from the experience of off-line agents. This method can efficiently migrate skills from rich datasets. Zeng et al. [21] proposed a transporter network, which can represent a complex multimodal strategy distribution and be extended to multistep sequential tasks and 6DOF picking and placement. Li et al. [22] proposed a skill learning method based on multimodal information description. Using these multimodal information parameters, the robot can acquire the skills of searching, position determination and attitude adjustment. Furthermore, the robot can achieve the assembly goal by analyzing the two-dimensional image without position constraints.
Although current research uses deep learning to enable the robot to grasp the target object accurately, most algorithms have difficulty completing the accurate recognition and positioning in various cases for the high-precision grasping of the robot, especially the application of robot grasping and welding. Therefore, for the problem of the difference in projections produced by vision, we focus on the recognition and localization of multitype workpieces by a monocular vision system and industrial robot to realize the recognition and localization of multipose workpieces and guide the robot to complete the grasping of the workpiece. The rest of the paper is structured as follows. The third section elaborates on our proposed method. The fourth section presents experimental results and comparative evaluation to verify the proposed method. Finally, the fifth section summarizes the results of the current work and proposes future research directions.

III. PROPOSED METHOD
In this section, we describe the proposed target location algorithm in detail, as shown in Fig. 2. First, we segment the image of the workpiece to be measured by using binary processing. Then, we adopt the feature matching algorithm to quickly obtain the initial rotation angle and displacement. Finally, we use weighted modular contour matching for accurate positioning by using the initial rotation angle and displacement to avoid the influence of the projection error of the workpiece.

A. INITIAL TARGET MATCHING
Since multiple artifacts may appear in the image to be tested, binary preprocessing is used to segment the artifacts, as shown in Fig. 3. Then a feature point matching algorithm is used to match each region. First, the OBR feature points [23] of the sample image and the image to be registered are extracted; then the feature point matching algorithm [24], [25] is used to quickly match and screen the area to be measured. Finally, the rotation angle and displacement are calculated.
ORB features combine FAST feature point detection with BRIEF feature descriptors in an improved and optimized manner. ORB uses the FAST feature point detection method to detect the feature points and then selects N feature points with the largest Harris corner response value from the FAST feature points. The Harris corner response function is defined as: where M is the covariance matrix of the gradient, det = λ 1 * λ 2 is the determinant of the matrix, trace = λ 1 + λ 2 is the trace of the matrix, λ 1 and λ 2 is the eigenvalue of matrix M , and a is an empirical constant in the range (0.04, 0.06). At the same time, the ORB takes the connection between the feature point and the particle as the feature point direction and improves the BRIEF feature descriptors by using the connection between the feature point and the particle obtained previously to rotate the image and obtain the descriptor.
After extracting orb feature points, we use the FUGC algorithm [25] for feature point matching. First, we use brute force to construct a set of hypothetical matches for orb feature points. Considering the local feature consistency, we use a unilateral grid to segment image features, introduce local clustering constraints to remove the mismatches contained in the set, and use local linear transformation to rescreen the feature point pairs. The screening condition is matching distance error and is defined as follows: where q i and p i are the coordinates of the matching feature points corresponding to the two images, and H p i are the corresponding coordinates after the transformation of p i , τ is the distance error threshold. This strategy can extract high-accuracy matches from coarse baseline matches of low accuracy, screen out outliers, and quickly obtain accurate matching point pairs, as shown in Fig. 4. Finally, the rotation angle θ init and displacement d init between the sample image and the image to be measured are calculated by using these high-accuracy matching points.

B. ACCURATE MATCHING OF CONTOUR MODULAR WEIGHTED DECISIONS
Due to the projection difference of the monocular camera, the rotation angle and displacement of grasping points may not be equal to the global rotation angle and displacement of the feature points mentioned above. It can also be simply considered that the projection difference leads to local deformation of the image, which interferes with the accuracy of the overall matching. Therefore, the influence of projection difference can be reduced by capturing the local contour matching around the points.
First, the gradient value and gradient direction of the sample workpiece image are calculated, and the region with a significant gradient value is extracted as the target contour. Projection differences are easy to produce as the grasping points are not at the same height. As shown in Fig. 1, the VOLUME 11, 2023 contours of different heights need to be segmented according to the changes in the contour and grasping height of the sample image. Manual participation in the division can be required. Additionally, contours of the same height can be divided according to different shapes or positions, as shown in Fig. 5.
It is necessary to determine the matching weight of each region after dividing the region of the workpiece contour. The closer the region is to the grasping point, the smaller the influence of projection error on the matching angle and displacement is. The weight j of the j th region contour is calculated as follows: where (x g , y g ) is the grasping point coordinate, (x k , y k ) is the contour point coordinate, d i is the average distance from the grasping point to the contour of the i th region, n i is the number of contour points in the i th region, m is the number of the divided regions and is the height parameter. When the contour and grasping point are at the same height, = 1. Otherwise, the value is far less than 1. Before contour matching between the workpiece sample image and the image to be tested, the previously calculated rotation angle and displacement are used to reduce the search range of the sample image, which can improve the contour matching speed and reduce mismatching. Then, the similarity measurement method of the contour gradient is used to calculate the matching similarity S between the sample contour and the contour to be measured after translation and rotation.
where T ix and T iy are the X-and Y-direction gradients, respectively, of the i th contour of the sample, and B ix and B iy are the X-and Y-direction gradients, respectively, of the i th contour of the contours to be tested. According to the rotation angle θ init and displacement d init , the sample contour will move in the rotation angle range (θ init − r, θ init + r) and the translation range (d init − l, d init + l). The similarity of the contour to be measured S θ,d is calculated. By comparing all the S θ,d , the maximum similarity S max is obtained; S max must be larger than the given threshold σ . σ set according to the number and weight of contour points. Only when the similarity meets the conditions can it be considered the final matching result.
where r and l are the rotation angle and translation range radius, respectively.
When S max meets the threshold, we obtain the rotation angle θ s and the displacement d s as the rotationd angle and displacement of the final grasping point. The result is shown in Fig. 6.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
To verify the proposed robot target positioning method, we evaluate the performance of the algorithm from two aspects: accuracy and time consumption. We compare our algorithm to contour matching based on the Hu moment and RANSAC feature points matching. In the whole experimental process, the parameters of the algorithm are consistent. These experiments were carried out on a notebook with a 3.5-GHz Intel i7 CPU and 16-gb memory using c++ and the open source toolkit OpenCV4. Grasping error sequence of four kinds of workpieces. The first column is the workpiece type, the second column is the grasping distance error, and the third column is the grasping angle error.

A. DATASETS AND SETTINGS
To comprehensively evaluate our method, we selected four different workpieces with different heights, each of which has a total of 1000 pieces for a continuous test dataset. The image size is 2448 × 2050. We use the positioning accuracy as the corresponding test index: the average distance and angle of the offset error of the target grasping point.

B. EXPERIMENTAL RESULTS
To compare the performance of each algorithm, we specify that the placement angle of the workpiece is −60 • ∼ 60 • , and the grasping point is set on the plane. Some parameters in the paper, such as τ , , r, l and σ , need to be set according to the image size. According to the experimental test of the data in this paper, we set τ = 3, = 1, r = 5 and l = 0.2 of the radius of the maximum circumscribed circle of the outer contour.
We quantitatively compare our algorithm to a contour matching algorithm based on the Hu moment algorithm [26] and the feature matching algorithm [25] on the dataset. By calculating the displacement distance and rotation angle of the grasping point of the manipulator, the grasping error of each algorithm is obtained. The grasping error of the same algorithm in the dataset is sorted from small to large and sampled at intervals. The results are shown in Fig. 7. The error of the three algorithms in the first row of Fig. 7 is not large because the workpiece is a plane workpiece and there is no height projection error. However, because our algorithm has a twice-matching effect, it has a smaller error. At the same time, the large error data in the back basically do not exist. However, the second, third and fourth rows of Fig. 7 have different height planes, which will produce a certain height projection error. Therefore, affected by the height projection error, there are many large errors in the contour matching algorithm which uses the Hu moment and the matching algorithm which uses feature points. The matching performance of our algorithm is better and more stable than the first two algorithms.
If we set the threshold of the distance error value to 4 pixels and the threshold of the angle error value to 0.5 degrees, when the distance error is less than or equal to 4 pixels and the angle error is less than or equal to 0.5 degrees, the grasping is successful; otherwise, the grasping is considered to fail. We list the success rate of each algorithm in Fig. 8. Our algorithm has a higher number of success rate than the other two algorithms. Therefore, it is better proposed for our grasp planning approach. The accuracy of both offset distance and angle is relatively high for monocular cameras, which effectively solves the automatic grasping of manipulators. Furthermore, the stability and positioning results of the system can meet the needs of actual production.

V. CONCLUSION
In this paper, we presented a robust, flexible, low-cost robot grasping system that allows the detection and precise location of workpiece targets using only a single camera. In order to solve the problem of positioning error caused by projection difference of monocular camera, We developed a method for workpiece image correction and matching by dividing the full contour of the workpiece and using the contour module weighting method, which improves the robustness, stability and accuracy of robot grasping. The experimental results show that the success rate of the proposed scheme is 92.1%, 10.9% higher than the method based on Hu moment matching, and 12.6% higher than RANSAC feature matching algorithms. The method proposed in this study can be extend to otherfields, such as defect detection, visual navigation and so forth. For future research, we will study how to use the deformation feature to correct when the grasping point is on the inclined plane and apply it to 3D image matching.
ZHAOHUI ZHENG received the Ph.D. degree from Wuhan University. He is currently a Lecturer with the School of Mathematics and Physics, Wuhan Institute of Technology. His research interests include machine vision, image processing, biometrics, intelligent systems, and robotic dynamics. He is currently an Associate Professor with the School of Artificial Intelligence, Hubei Business College.
QIANG ZHANG received the master's degree in signal and information processing from Nanchang Hangkong University. She is currently a Lecturer with the Department of Public Courses, Wuhan Railway Vocational College of Technology. Her research interests include image processing, and pattern recognition. VOLUME 11, 2023