SCARA Self Posture Recognition Using a Monocular Camera

Robotic manipulators rely on feedback obtained from rotary encoders for control purposes. This article introduces a vision-based feedback system that can be used in an agricultural context, where the shapes and sizes of fruits are uncertain. We aim to mimic a human, using vision and touch as manipulator control feedback. This work explores the use of a fish-eye lens camera to track a SCARA manipulator with coloured markers on its joints for the position estimation with the goal to reduce costs and increase reliability. The Kalman Filter and the Particle Filter are compared and evaluated in terms of accuracy and tracking abilities of the marker’s positions. The estimated image coordinates of the markers are converted to world coordinates using planar homography, as the SCARA manipulator has co-planar joints and the coloured markers share the same plane. Three laboratory experiments were conducted to evaluate the system’s performance in joint angle estimation of a manipulator. The obtained results are promising, for future cost effective agricultural robotic arms developments. Besides, this work presents solutions and future directions to increase the joint position estimation accuracy.


I. INTRODUCTION
R OBOTIC manipulators usually rely on data obtained from rotary encoders to determine their joint angles and to relay it to a closed-loop control system. However, rotary encoders have flaws [1], [2], due to effects of elastic joints, joint frictions, flexible links, gearbox backlashes, etc [1]. Furthermore, using an incremental rotary encoder requires an absolute position sensor, such as a hall sensor, to obtain a reference point, and requires the current position saved in memory before shutting down the system. Therefore, on startup, the system will assume the manipulator's joints are in the same angular position as they were on shutdown, which is not necessarily true.
Several applications of vision systems on robotic manipulators, namely for joint pose estimation and calibration, have been proposed in the literature. Balanji et al. [3] proposed a vision-based calibration framework for industrial manipulators using ArUco markers. The authors concluded that their system is reliable in real-world applications with millimetric and near 0°errors. Kuo et al. [1] proposed a single cam-era vision-based system for estimating a manipulator joint angles. The authors used visual markers and pointed at the manipulator with a fixed camera. They concluded that the vision system could not compete with the joint encoders; however, it works perfectly as a backup if the encoders fail. Li et al. [4] proposed a vision-based system using a monocular camera to estimate planar joint angles. The camera was positioned above the manipulator and pointing down, and it tracked visual markers to determine the joint positions. The authors concluded that this system offered high precision and more usability than other methods, such as laser tracking. Hajiloo et al. [5] developed an image-based visual servoing control system for a six degree-of-freedom manipulator with an eye-in-hand camera. The authors successfully attenuated the displacements between the initial and desired position, often large in conventional visual servoing controllers. They concluded that their controller increases success by keeping the system within the desired limits. Zhang et al. [6] proposed an inversion-free image-based visual servoing system for manipulators with an eye-in-hand camera configuration using neural networks. Their system was theoretically effective at converging feature errors to near-zero values while within the manipulator's velocity and position limits; moreover, the authors propose implementing the proposed system on a physical manipulator. Wang et al. [7] developed an adaptive visual servoing system for soft manipulators. Their control system is based on piecewise-constant curvature kinematic and does not require the true values of the manipulator link lengths and the target positions. The authors verified the adaptability of the soft manipulator to the environment in free space, constrained environments, and environments with the influence of gravity. They concluded that the manipulator could be applied to such environments with the developed controller. Xu et al. [8] proposed a vision-based method for cable-driven robots to simultaneously measure the manipulator configuration and the target position. The authors used "global cameras" pointing at the manipulator and at the scene. The vision-based system was able to complete a docking maneuver with a 98 % success rate with position errors bellow 2 mm. Xu et al. [9] developed a prototype vision-based control system for an excavator manipulator using a monocular camera and visual markers, with the purpose of increasing safety and production. The experimental results showed position errors of 22 mm and orientation errors of 8.5°. The authors concluded that their experimental results proved the effectiveness of the approach for practical applications.
For agricultural purposes, such as harvesting and pruning, a high precision joint angle estimator is not as required as it would be for industrial purposes, where the shapes and sizes of objects are known and constant. Thus, an agricultural manipulator picks fruits or other products with unknown properties -and in some cases in bulk, such as grapes -that are not positioned in a specific place and do not have their properties stored in a database. Given this, or to attenuate the previously mentioned problems, a vision-based joint angle estimation method is proposed using a monocular camera on the manipulator base. This method, known as visual servoing [10], [11], offers a backup in case the encoders fail [1], providing absolute angular positioning. The sensor (camera) can be used by other systems simultaneously. The visionbased system can be considered a reference for incremental encoders to reduce the costs of buying an absolute position encoder, assuming the camera will also be used for other purposes in the system. Furthermore, with visual servoing, an agricultural manipulator can mimic a human being, as humans only use vision and touch to grab objects. This proposed system differs from the ones presented in the literature as the camera is to be placed behind the manipulator with a fish-eye lens to have a greater field of view. This proposed system offers an advantage in an agricultural scenario where the manipulator needs to be moved to different locations, and an onboard camera with an entire field of view of the manipulator at all times is crucial. Using a monocular camera will reduce the processing cost, as fewer frames will be processed, leaving the rest of the processing power for other applications. Moreover, the camera can detect fruits, like tomatoes, so the manipulator knows their position and, through inverse kinematics, knows which joint angles it should have to reach the fruit.
This document is organized in the following way: in Section II the experimental setup for this work is presented; in Section III the methodology for the angle estimation is shown; in Section IV the used marker tracking algorithms are presented; in Section V the experimental results and their analysis are shown; and, finally, in Section VI the conclusions to this work are drawn.

II. EXPERIMENTAL SETUP
The manipulator used in this work is a Selective Compliance Articulated Robot Arm (SCARA), presented in Figure 1 along with the camera (R C ) and world (R W ) coordinate references, modified from an existing igus 1 manipulator, with an incremental encoder (with 500 steps) and a hall sensor on each motor. It has co-planar joints, meaning a homography matrix can define the entire manipulator space. Furthermore, it has three links and three coloured co-planar spherical markers: two green markers on the first and second joint and one red marker on the end of the third link. The markers have a diameter of 25 mm and the first and second links have a length of 0.35 cm and 0.26 cm, respectively. Finally, a Raspberry Pi camera is positioned 26 cm behind the first marker and 19 cm above it, using a fish-eye lens to increase the camera field of view on the manipulator. The camera captures images with a resolution of 1920x1440 pixels. This will be used to validate if a feedback system, solely based on a monocular camera and markers, is reliable for pruning and harvesting operations.

A. PLANAR HOMOGRAPHY
The homography matrix translates the relationship between the image plane and, in this case, the visual marker plane. This relation can be defined by Equation (1), where x is the point in the image plane, P is the homography matrix, and X is the point in the world frame.
Expanding the previous equation, Equation 2 is obtained.
The Z coordinate in the world frame is constant, as the markers are parallel to the Z axis. As such, the coordinate is assumed to be Z = 0, and thus, the third column of the translation matrix is multiplied by 0 and can be removed. Given this, equation (1) can be decomposed into Equation (3), where H is the homography matrix.
The image coordinates presented previously can be further decomposed into pixel coordinates using the camera matrix, show in Equation (13), where u and v are the pixel coordinates in the x and y axis, respectively, and f and c are the camera intrinsic parameters.
This work's objective requires the translation of pixel coordinates to world coordinates. This can be achieved with linear algebra by multiplying, on both sides of the previous equation, the inverse of the multiplication of the camera matrix, and the homography matrix, and thus Equation (5) is obtained.

B. OBTAINING THE HOMOGRAPHY MATRIX
To obtain the homography matrix, the Raspberry Pi camera was first calibrated using a ChArUco board, shown in Figure  2, using the calibration and arUco classes in the OpenCV 2 libraries. The camera calibration process provided the camera matrix used to calibrate each image frame and calculate the planar homography by transforming the pixel coordinates into image coordinates. Furthermore, each frame was undistorted, with its radial and tangential distortions being attenuated.
As mentioned previously, the colored markers share the same plane (co-planar). Given this, an image of the chessboard parallel to the marker plane, shown in Figure 3, was taken. The OpenCV functions were able to determine the chessboard corners, both in the image frame and the pixel 2 https://opencv.org/  frame. Each corner is then defined by its pixel coordinates, (u, v), and by its image coordinates (x, y). To calculate the homography matrix elements, h, Equation (3) can be rearranged into Equation (6).
T Given, at least, four known corresponding points, and defining h9 = 1 -since the last element of the H matrix is 1 -, the previous equation can be transformed into Equation (7).
For this work, functions from the OpenCV libraries were used to determine the homography matrix from the obtained chessboard corner points.

C. ANGLE CALCULATION
Once the marker's pixel coordinates are known, their world coordinates can be obtained through planar homography, as they are co-planar. A diagram of the manipulator is presented in Figure 4. In this diagram, the first joint can be defined as the origin, and the two links can be defined as two vectors v 1 and v 2 . Both vectors have their starting point on the second green marker to simplify the following equations. Before performing any angle calculation, the first joint must be set as the origin. For this, the world coordinates of the second green marker, and the world coordinates of the red marker, are subtracted by the world coordinates of the first green marker (base marker). After this translation, the first joint angle can be calculated using Equation (8), where (x 2 nd green , y 2 nd green ) are the second green marker world coordinates.
To calculate the second joint angle, the angle θ between vectors v 1 and v 2 , both vectors were calculated using Equation (9).
The end-effector's cartesian coordinates, x,y, can be obtained through forward kinematics. In this work, these coordinates are the red marker coordinates obtained through the vision-based feedback system. However, the forward kinematics are used to determine the cartesian ground truth by calculating the end-effectors position using the encoder data. Since the manipulator has two links and moves in a two-dimensional frame, the forward kinematic model can be translated by equation (12), where L 1 and L 2 are the first and second links, respectively.

IV. MARKER TRACKING
To introduce robust tracking of coloured markers' position, two estimators, a Kalman Filter and a Particle Filter, were compared. These filters were chosen since they do not require much computational power to track coloured markers. In different scenarios, where coloured markers were not used, a more advanced method, such as neural networks, should be used. The first joint is always static in the cartesian frame. Therefore, this specific marker's position does not need to be estimated with the used filters, and thus, only the green marker and the red marker positions need to be estimated.

A. KALMAN FILTER
The used Kalman Filter has a state where v and w are the pixel coordinates, and V v and V w are the pixel velocities. The filter parameter settings consist of a state transition matrix (A = I 4 ), a measurement matrix (H = I 2 ), a process noise covariance matrix (Q = I 4 × 1e −1 ) and a measurement noise covariance matrix (R = I 2 × 1e −1 ). This filter is divided into two states: (i) Prediction and (ii) Innovation.
In the prediction state, the filter calculates the predicted state vector,x − k , and the predicted error covariance, P − k . The predicted state vector is calculated in Equation (14), where A is the state transition matrix, B is the control input matrix, a k −1 is the control vector andw k −1 is the process noise.
In this work, no control input was considered, and thus, Equation (14) can be rewritten into Equation (15).
After each pixel measurement, the filter jumps to the innovation state. In this state, the filter outputs an estimate based on the prediction and the measurement, using a gain that increases, or decreases, depending on the predicted error covariance. If the gain is high, the filter gives more weight to the measurement; if the gain is low, the filter gives more weight to the prediction. The estimated state,x + k , is calculated using Equation (16), where K is the Kalman Gain, z k is the measurement vector and H is the measurement matrix.
The Kalman Filter was developed using the OpenCV libraries, which contain a Kalman Filter class. For the measured pixels, each frame was divided into two separate frames. Each of these frames was filtered so that only the red and green colours were available. The program then searched for contours. If the contour was circular and had a minimum area of 900 square pixels, it was a valid contour. The centroid of the contours was calculated, and the resulting pixel coordinate was used as the measurement for the filter. VOLUME 4, 2016 In contrast to the Kalman Filter, the Particle Filter only considers a single frame per cycle and does not filter out the colours apart from green and red. This filter initializes by placing several points -or particles -around a predefined pixel with a uniform distribution ranging from -70 to +70. The filter reads the hue, saturation and value (HSV) values from the pixel positioned on the particle location (the green and red markers have their HSV range of values predefined). After reading these values, the filter calculates the weight of the particle using a normal curve equation for each of the HSV values, as shown in Equation (17), where H, S and V are the hue, saturation and value values of the pixel positioned at the particle, respectively; µ H , µ S and µ V are the predefined HSV values for the green and red markers; and σ H , σ S and σ V are the standard deviations of the HSV values, defined by the HSV range of the two markers.
This equation will output a weight value between 0 and 1 per particle. For this work, if the weight is below 0.1, it is negligible. After this process, the particles with less weight are deleted and then resampled around the particles with a higher weight, and the process restarts. The fewer the particles with a high weight, the farther away from each other the particles resample, up to a limited distance. To determine the marker point, all the particles with weights superior to 0.1 have their position multiplied by their weight and are summed together, giving the marker pixel coordinates. The distribution of the particles depends on an uncertainty parameter of the filter. The lower this parameter, the less distributed the particles will be around the higher weighted particles. In this work, the frames were taken with a time difference of seconds in between each image. Given this, the manipulator did not move smoothly between frames; instead, there were jumps of several degrees between each frame. For this filter to work in this scenario, the uncertainty parameter was set high so that the particles are distributed over a large area around the higher weighted particles. Meaning if this marker jumps 10°b etween frames, the marker will still have particles on it. However, this affects the quality of the predicted points.

V. RESULTS
Three experiences were performed on the manipulator: (i) with the first joint rotating from 0 rad to π 2 rad and to − π 2 rad and with the second joint static on 0 rad, (ii-iii) with the second joint rotating from 0 rad to π 2 rad and to − π 2 rad, while the first joint is static on 0 rad, and − π 4 rad, respectively. Incremental encoders on both joints were used to determine a ground truth. The re-projection error translates how exact the found parameters are. The smaller the value, the more exact the measurements are.
After calibrating the camera, the re-projection error was obtained by calculating the difference between the ChArUco pattern points and their corresponding projected world points. This was done using the OpenCV functions 3 . The average re-projection error was found to be 0.15 pixels.

A. JOINT ANGLE ESTIMATION WITH KNOWN PIXEL COORDINATES
For the first set of experiments, all of the marker pixel coordinates were manually annotated and were used as inputs to the computer vision system. The results of experiments iiii are presented in Figures 5, 6 and 7, respectively.    In Figure 5, there is a gap in the first and second joint estimated angles. This gap happens because the manipulator goes out of the camera field of view. Nevertheless, as soon as the markers re-entered the camera field of view, it started detecting the angles again.
Although the calculated angles match the angles from the encoders, there is an error between the calculated data and the ground truth. The mean error and standard deviation for each experiment are presented in Table 1.
As shown in this table, the standard deviation of the first joint for experiments ii-iii is 0. This is because the joint remained static the whole time, and the same pixel coordinate was used for all frames. However, in experiment i, even though the second joint was static, there is a standard deviation on its error. This happened since, as the first joint rotated, the second joint moved with the first link, and the system had to recalculate its angle with different pixel coordinates. Overall, the average error and standard deviation were under 10°.
Using the forward kinematic model, presented previously in (12), and the data presented in Table 1, the position error can be calculated. This error is presented in Table 2.

B. JOINT ANGLE ESTIMATION WITH MARKER TRACKING
The markers were tracked using a Kalman Filter and a Particle Filter on the second set of experiments. Although there are two green markers, the first green marker (base marker) was assumed to always be in the same pixel coordinate for these experiments. This is because the marker is positioned on a joint that does not move in the Cartesian frame; thus, the marker only rotates on the same place at any given time.
The performed experiments are the same as the previous ones.

1) Kalman Filter
The results of the angle estimation with automatic marker tracking using a Kalman Filter for the experiments (i-iii) are presented in Figures 8, 9 and 10, respectively. The Kalman Filter tracking performed similarly to the manual marker tracking. Nevertheless, there were moments where the filter only converged after a few frames and not right at the start, such as in Figures 9 and 10. The first joint converged to a stable angle after a few frames in these figures. The mean error and standard deviation for each experiment are presented in Table 3.   The position error was calculated and is presented in Table  4.
In this table, it is observable that in experiment iii the error was 107.83 mm. This error is high; however, on the other experiments, the error was lower. On experiment i the mean error was 41.21 mm. This error value is more acceptable for agricultural purposes; however, it needs more refining to become suitable for pruning and harvesting.

2) Particle Filter
The results of the angle estimation with automatic marker tracking using a Particle Filter for the experiments (i-iii) are presented in Figures 11, 12 and 13, respectively. Like  Table 5.
Comparatively to the Kalman Filter, the errors are very similar with low value on the first joint during experiments i and ii and a high mean error value on experiment iii, and a high mean error value on the second joint on experiment    Table 6.
As the joint angle errors were similar to the ones of the Kalman Filter, the position errors were also similar with a slight deviation.

VI. CONCLUSION
This work developed a vision-based system to estimate joint angles on a SCARA manipulator. The system, composed of a SCARA manipulator, a raspberry pi camera, a fisheye lens and coloured markers (painted spheres), was able to estimate the joint angles with some errors. The markers were tracked using a Kalman Filter and a Particle Filter and were subsequently transformed into world coordinates on a plane using planar homography. Given their dimensions, the marker trackers could not precisely detect the centre of the marker. Furthermore, being painted markers, as the manipulator turned, the HSV values of each marker would change slightly, affecting performance, and the homography matrix requires further calibration to obtain better results.
The results proved that a system like the proposed one can calculate the joint angles of a co-planar manipulator and thus reduce the number of sensors required. Moreover, the use of the camera allows to obtain a feedback system and thus to develop applications where adjustment based on visual perception represents an added value. Furthermore, the visionbased feedback system can contribute to further advances in robots vision, thus becoming more similar to human vision, since the feedback system from the eyes is used to move the arms and reach objects. The use of this type of feedback can instruct the robot to move the manipulator slightly to the left, to the right, up and down to reach an object, thus not requiring encoders. This innovative procedure can be used, for instance, to adjust the positioning of the arm of the robot to reach and pick fruits and vegetables, randomly placed in a selection conveyor. In this work, there were distinct scenarios where the errors were around 10 cm and could compromise the system in pruning and harvesting operations as the branch dimensions and the dimensions of some fruits are smaller than these errors.
A marker system with high-power coloured light-emitting diodes (LED) is proposed for future work. The use of highpower LED will attenuate the change in colour due to the lighting environment. Moreover, the LED blinking can be synchronized with the camera shutter, reducing power draw and dimming the LED to the human eye. A self-posture calibration system and a relative position error attenuation system are also proposed to reduce errors derived from an illcalibrated transformation matrix. Using a known reference point, the relative position error attenuation significantly reduces the positioning error and can eliminate the need for VOLUME  He is the author of more than 100 publications in international journals and conferences (https://www.researchgate.net/profile/Manuel-Silva-8) and has been involved in several R&D projects. He has also been actively involved in the organization of several international conferences, integrating the Management Team of the CLAWAR Association (https://clawar.org/) and the General Assembly Board of the Portuguese Robotics Society (of which he has been President of the Steering Committee). His research interests focus on modelling, simulation, robotics, bio-inspired robotics, control and education in robotics and control. He is coauthor of about 100 papers published in scientific journals, book chapters, and proceedings of peer-reviewed international scientific conferences. His current research interests include Computer Vision and the application of artificial intelligence in images. VOLUME 4, 2016