Neuromorphic Eye-in-Hand Visual Servoing

Robotic vision plays a major role in factory automation to service robot applications. However, the traditional use of frame-based camera sets a limitation on continuous visual feedback due to their low sampling rate and redundant data in real-time image processing, especially in the case of high-speed tasks. Event cameras give human-like vision capabilities such as observing the dynamic changes asynchronously at a high temporal resolution ($1\mu s$) with low latency and wide dynamic range. In this paper, we present a visual servoing method using an event camera and a switching control strategy to explore, reach and grasp to achieve a manipulation task. We devise three surface layers of active events to directly process stream of events from relative motion. A purely event based approach is adopted to extract corner features, localize them robustly using heat maps and generate virtual features for tracking and alignment. Based on the visual feedback, the motion of the robot is controlled to make the temporal upcoming event features converge to the desired event in spatio-temporal space. The controller switches its strategy based on the sequence of operation to establish a stable grasp. The event based visual servoing (EVBS) method is validated experimentally using a commercial robot manipulator in an eye-in-hand configuration. Experiments prove the effectiveness of the EBVS method to track and grasp objects of different shapes without the need for re-tuning.


I. INTRODUCTION
In robotics, visual servoing is a well studied research topic [1], [2] and a well known real-time technique to control the motion of a robot using continuous visual feedback. Such vision based closed loop control increases the accuracy of an overall task, flexibility, functionality and efficiency in robotic automation and safety in collaborative environment while reducing the need for complex fixtures. In conventional visual servoing, frame based cameras are mainly used to extract, track and match visual features by processing images at consecutive frames which causes delays in visual processing and timely robot action.
In high-speed applications, the visual information is expected to be fast, efficient, accurate and reliable in providing real-time information of dynamic surroundings. Recently, neuromorphic vision sensors that mimic the neuro-biological architecture of a human retina encodes illumination changes to evolving temporal spikes. Thus, they overcome the limitations of conventional camera and open up a new paradigm shift to visual processing. An event camera attached to the robot end-effector to perform visual servoing is depicted in Fig. 1.
Unlike conventional vision sensor which is frame based and clock driven, neuromorphic vision sensor [3], [4] is event driven and has low latency, high temporal resolution and wide dynamic range. Moreover, the independent sensor pixels operate asynchronously and in continuous time respond to varying illumination. We exploit this inherent property of the sensor to achieve more efficient and less resource demanding visual servoing to facilitate robotic object manipulation.
In the literature, robotic manipulation pipeline act as a global framework to study such servoing methods [5]. Visual servoing approaches differ by the camera placement, type and number of camera used, 2D or 3D motion command generated, vision algorithm utilized and kinematic and dynamic control strategy deployed. This emphasizes the interdisciplinary efforts for the development of approaches from various fields such as computer vision, control theory, system integration and real-time computation. Classical approaches are mainly divided into position based visual servoing (PBVS) and image based visual servoing (IBVS). PBVS adopts eye-on-hand configuration and employs the object pose estimated with respect to a calibrated camera as control objective. Thus, they are not able to control the image feature directly, suffer from calibration and estimation errors and requires knowledge of the 3D object model. IBVS on the other hand adopts eye-in-hand configuration and directly use 2D image measures as control objective. They still remain a popular scheme since they exclude the calibration and estimation process. Visual servoing technique has been extensively studied for manipulation applications that accounts rigid, flexible, soft and continuum robot manipulator [6], [7], [8] .
Similar to the IBVS approach but in the line of event based vision research, we present an event based visual servoing method that adopts the traditional eye-in-hand configuration and process event stream from relative motion to control the motion of the robot. Event camera in such configuration need to act to perceive and perceive to act. We define event based visual servoing as a way to control the motion of the robot using instantaneous spatio-temporal information as feedback. Our approach rely on extraction, robust tracking and matching of event features such as points and lines to reach a desired pose of the event camera, starting from a arbitrary initial pose.
Visual servoing also assist grippers in the grasp alignment process where object shapes and target changes. They even enable a low cost vaccum gripper to align in a range of position and orientation for grasping the object reliably.

A. Contributions
A rich survey on event-based vision is available in [9] where several areas relating to robotic applications such as pose tracking, object recognition and tracking, SLAM etc. are reviewed. In the line of event-based vision research, we address the classic problem in robotic grasping and manipulation that is visual servoing.
In the following the primary contributions of this paper are summarized. 1) We propose an event based visual servoing (EBVS) method which operates on three layers of active event surface to detect, extract and track high level features and uses a simple control law to dictate the robot motion. 2) We propose a switching strategy within EBVS which enables the robot to explore the work-space to detect key object features and track those features to reach and align the gripper to grasp such that an object manipulation task is facilitated. 3) By constraining the robot with eye-in-hand configuration in a 2D plane, we demonstrate event based visual servoing and gripper alignment to perform a top down grasp using a vaccum gripper which can fit into applications of smart manufacturing.

II. EVENT-BASED VISUAL SERVOING METHOD
An event-based visual control scheme for a robotic manipulator with an eye-in-hand configuration to achieve a manipulation task is illustrated in Fig. 2. Instead of a frame-based camera, an event camera is mounted on the robot's end flange maintaining a relative position with the vaccum gripper. Such setting offer flexibility in viewing the workspace and assistance in grasping. Employing a double loop structure, first, the event stream from neuromorphic vision sensor caused by the dynamic motion is processed to extract high level features. The switching strategy changes the modes of operation (explore, reach and align) in event-based visual servoing and regulates the feature stream accordingly. Then, these features are used to estimate an error signal between the goal event state and the current state of the feature events. A simple control law ensuring the minimization of the feature error outputs control signal in the form of velocity screw of the event camera. A second loop locally controls and stabilizes the joints of the robotic manipulator. The step by step processing of events, control law and switching strategy is detailed in the following.

A. Event Processing
Let us consider a moving event-based camera observing a rigid object placed in a workspace. The movement of the camera generates a stream of events on the sensor plane of the event camera. The standard pin hole model can be still applied in event camera since they use same optics as traditional perspective camera. The pin hole projection is shown in Fig. 3, mapping a 3D point χ = [x, y, z] into a 2D point p = [u, v] on the camera's sensor plane which is expressed in homogeneous coordinates as: (1) where f denotes the focal length of the camera, K accounts the camera's intrisic components and R and t refers to the extrinsic rotational and translation components.
Event cameras represents visual information in terms of time with respect to a spatial reference in the camerapixel arrays. Pixels in the dynamic vision sensor respond independently and asynchronously to logarithmic brightness changes in the scene. For a relative motion, a stream of events with a microsecond (µs) temporal resolution and latency is generated, where an event e = p, t, P ol is a compactly represented tuple which describes the point p = (u, v) in the sensor plane coordinate at time t detailing the brightness increase and decrease by polarity P ol. However, analysing a single latest event does not give much information in operational level and exploring all past events is not scalable.
In this work, we consider three sequential layers of surfaces of active events shown in Fig. 4 for performing operations on the evolving temporal data in camera pixelspace to achieve EBVS. The first layer is known as the surface of active events (SAE) where the surface represents the timestamp of a latest event at each pixel from the raw event stream. For each upcoming event, the function Σ SAE : N 2 → R takes the pixel position of a triggered event and assign to its timestamp: In SAE, we apply feature based algorithms to filter out insignificant events and extract highly informative events such as corners. The second layer is the surface of active corner events (SACE) which maps the pixel position of recent corner events to its time stamp, where we extract the center of the object by robustly localizing the corner events. The object center is the extracted high level feature that is a virtual event and not an actual event used in visual servoing. Moreover, we introduce random and goal state events and consider them as virtual events. The third layer is the surface of active virtual events (SAVE) that maps the extracted and artificially induced virtual events pixel position to its timestamp, where the contiguity of the high level feature is analysed for switching the control objectives. EVBS modes of operations such as exploration, reaching and grasping are determined by the SAVE.

B. Feature Detection
In conventional image processing, Harris detector is one of the most widely used technique that detects features such as corner, edge and flat points based on Strong intensity variation in a local neighborhood. This feature detector is known for its efficiency, simplicity and in-variance to scaling, rotation and illumination. Unlike conventional camera that records large amount of redundant data in sequence of frames, the DVS records only the changes in the visual scene as stream of events characterized by the pixel positions and its timestamps and does not include intensity measures. Therefore the frame based harris detector cannot be directly applied on the SAE. Event-based adaptation of harris detector is proposed in [10] and [11] where each upcoming event is directly processed. Their method binarizes the SAE by the newest N events for the whole image plane or locally around the current event. Let Σ b be a binary surface locally centered around the latest event where 1 and 0 indicates the presence and absence of an event. The gradient is computed on the binary surface with 5 × 5 sobel operator as Compute Harris matrix Compute Harris score The Harris feature detector mainly relies on the analysis of the eigenvalues of the auto-correlation matrix. If the Harris score is large positive value, the event is classified as corner whereas a negative value is considered as edge. The rest of the events which are in-between is considered as flat points. In our case, the adapted e-Harris detector [11] is used to detect corner events from locally perceived information that is independent of the scene and sensor size. Selected corner threshold of HC th = 5, buffer of latest events N = 20 and a patch of 9 × 9 pixels gave the best performance over a wide variety of data-sets.
Whenever a corner event p hc : (x hc , y hc ) is detected it is projected in the SACE. To cluster these events into object corners and minimize the influence of noise events, consecutive corner points are concatenated to form a heatmap of corner locations. A heat-map matrix H ∈ R × R is introduced for this purpose. Whenever a new corner event is received, the elements of H are updated as: Where x c and y c represents the coordinates of an recent corner event, α is a scaling factor and σ is the standard deviation of the incoming corner event which dictates the area of effect each event has on the heat-map.
To keep only the recent events relevant to the process of detecting the object corners, the heat-map is continuously updated with time as indicated by eq. 7, where τ is a time constant dictating the period of influence for each corner point and t c is the timestamp of the last received corner event.
As such, the heat-map H represents spatio-temporal patterns in the corner events. The corners of the object are then obtained from these patterns by detecting the local peaks if these peaks exceed a minimum threshold of 0.7. Local maxima are obtained by dilating the heat-map with a window size of 10 × 10 and extracting locations where the original heat-map is equal to the dilated heat-map. Fig. 5 shows an example corner heat-map along with its local peaks for a sample object place in the camera's field of view. Let S = {(x 0 c , y 0 c ), ..., (x n c , y n c )} be the set of local peaks from the heat map of n corner points. We consider centroid of the object as the high level feature projected in SAVE for tracking operations in visual servoing which is computed as

C. Feature tracking
Let f * d denote the desired feature events triggered in SAVE (for example the center of the sensor plane) and f gives the coordinates of the detected high level feature events such as object center, both expressed in pixel units. The linear υ and angular ω velocity of the camera is represented as V c = (υ, ω). The primary goal in EVBS is to compute the camera velocity V c such that the error e = f − f * d is minimized.
The relationship between the velocity of the feature events and the camera velocity is given bẏ in which L ∈ R k×6 is the feature Jacobin. Moore-penrose pseudo-inverse L † = (L T L) −1 L T is used when it is full rank. To control 6 DOF, atleast three feature points are necessary, L can be stacked together in a composite form so to achieve. V c is the input to the robot controller ensuring an exponential decrease of the feature error (ė = λe) and the control law is expressed as As the end-effector moves towards the object, the location of the object's corners and centroid in the sensor plane must be updated. For this purpose, a simple moving average approach is adopted. For every new p hc detected by the e-Harris algorithm, the closest object corner p i c : (x i c , y i c ) ∈ S is determined. p i c is then updated as: Whenever the SACE is updated, the SAVE is also updated accordingly, leading to a refined estimate of the object's centroid.
Due to its simplicity, tracking corners using the moving average approach is much faster than the heat-map corner detection; making it more suitable for high speed application. However, it is prone to errors if tracking of one corner is lost. To account for such cases, corner tracking is regularly checked against heat-map corner detection at an interval of 0.3s, if considerable discrepancies were found over multiple timesteps, the system reverts into heat-map-based corner detection mode.

D. Gripper Alignment to Grasp
Once the robot tracks and reaches the object's center, the orientation of the grippers is adjusted to achieve a stable grasp. A target orientation θ is defined such that the two gripping points are aligned with a virtual line connecting the object centroid p v : (x v , y v ) in the SAVE with a corner point p i c in the SACE. To maximize the stability of the grip, p i c is Fig. 6: Constraint set for gripper alignment after servoing. Fig. 7: Illustration of switching stratergy that explore to detect, track to reach and align to grasp.
selected as the corner point furthest from the centeroid. θ is hence computed as: Fig. 6 shows the alignment process where the grippers are rotated at a constant angular velocity until θ is within an admissible range.

E. Switching Strategy
The switching strategy enables the robot to explore, reach and grasp in the process of event based visual servoing. In Fig. 7, the switching operation is illustrated in the surface of active virtual events. Let p cc be the artificially induced desired event representing the central pixel of the camera at the starting position, p vr a random feature event and p voc is the extracted recent feature event representing the center of the object. First, a virtual event p vr is triggered to motivate the robot to explore the workspace and detect object feature p voc . The highlighted yellow color indicates the pathway chosen by the robot in the exploration phase. While tracking, the contiguity of p voc is analysed. Once the count of contiguous pixel crosses above a threshold (e.g. 3). The robot changes its coarse of action and tracks p voc to minimize the error. The highlighted pink color indicates the new pathway to reach the object center. Switching can happen even in the reaching phase due to detection issues and contiguity breakdown. However, the strategy gives the robot the capability to recover and reach the desired feature. Finally, the robot aligns the gripper to perform a stable grasp. The switching function can be expressed as Extract weighted corners in SACE using heat-maps. 7 Compute object centroid from weighted corners events in SACE. 8 Monitor and operate in SAVE. 9 if Contiguity count < C th then 10 Initialize a random desired event in the SAVE.

11
Engage visual servoing to the random feature event.

12
Detect and track object feature events in SAVE. Align gripper orientation for a stable grasp.

17
Move to the pre-grasp pose and execute grasp.
III. EXPERIMENTAL VALIDATION OF EBVS This section describes and discusses the results of the experiments conducted to validate the proposed EBVS approach.

A. Experimental Setup and Protocol
The proposed method of visual servoing was incorporated in a top-down grasping paradigm to test its performance and applicability to real world smart manufacturing applications. The experimental setup consists of a Universal Robots UR10 6-DOF arm, a custom-made vacuum gripper, and a Dynamic and active pixel vision sensor (DAVIS240C) placed in an eye-in-hand configuration as displayed in Fig 1. The DAVIS240C provides a spatial resolution of 240 × 180, a minimum latency of 12 microseconds, bandwidth of 12 MEvent/second and a dynamic range of 120 dB. To successfully pick and place an object, the end-effector is first driven into alignment with the target object using the process described in section II. During the exploration and reaching phase, the end-effector's movement is constrained to a 2D plane perpendicular to the camera's optic axis. Once the end-effector is aligned with the target object, the endeffector translates in the camera's optic axis direction until contact with the object is achieved. Subsequently, the vacuum grippers are activated to grasp the target and relocate it to a desired location. Given limitations of the UR10's reach and the camera's field of view, the workspace of the experiments was limited to a 1.2 x 1.0 m virtual rectangle in front of the robotic platform.
To evaluate EBVS performance against different geometries, experiments were carried out with three different objects; a triangular prism, a cuboid and a pentagonal prism. Fig. 8 shows the various stages of the proposed EBVS method for a visual servoing trial with a cuboid. For each stage, the robotic platform is displayed along with the corresponding heat-map of corner events and SAVE. During the exploration phase, the end-effector first moves towards a random virtual event p vr to trigger events in the scene and update the heat-map. Based on the heat-map, the EBVS algorithm detects the object's high level features. Once contiguity is achieved in these features, the robot switches to the reaching phase where it moves towards the object's centroid p voc . The robot then enters the alignment phase where the grippers are rotated to achieve a stable grasp. Finally, the robot enters the grasping and manipulation phase to pick the object and place it in a desired location. By comparing the heatmaps and the SAVE with the top view pictures, the accuracy of the corner detection and tracking approach is demonstrated. Consequently, the centroid of the object in the SAVE is correctly inferred. As such, the proposed EBVS approach successfully drives and aligns the end-effector with the object prior to initiating the grasp.

B. Experimental Results
The same experiment in Fig. 8 was repeated five times with a different placement of the object in the workspace. Table. I shows the results of these experiments in terms of the grasp errors e grasp and the number of times tracking was lost and the algorithm switched back to detection mode N switch . The grasp error is defined as the distance between the center of the two gripping points and the true object's centroid as illustrated in Fig. 9.
Experiments were also carried out with different object shapes. Table II shows experimental results across five trials for three different geometrical shapes.   In all the experiments, the presented visual servoing approach was capable of successfully tracking and grasping the target object with both vacuum grippers adhering to the object. The average grasp error for all the experiments was 16.1mm. These errors are mainly attributed to design imperfections such as the misalignment of the camera optic axis with the workspace plane and the skewed positioning of the cemera with respect to the center of the vacuum grippers. Enhancing the proposed method to such irregularities would be the one objective for future studies.
The conducted experiments show that the proposed algorithm loses track more often with the pentagon shape; this in turns affects the accuracy of grasping as a larger deviation from the true object centroid was observed. As shown in Fig. 10, when the neuromorphic camera moves parallel to an edge, it is less-likely to trigger events corresponding to this edge. As a result, the event-based harris corner detection fails to detect the corners associated with edges parallel to the camera's movement; causing EBVS to lose track. As a pentagon shape has edges with more varied slopes than a rectangle or a triangle, it is a more probable case for EBVS to encounter this shortcoming. A possible solution would be a filtering mechanism that determines the most reliable corners for EBVS tracking based on the camera's velocity vector. Such modifications would be the focus of further development to EBVS, and can be highly beneficial to other event-based visual tracking applications. This study introduces a purely event-based visual servoing method that detects and tracks high-level features in a scene to perform a pick and place task suitable for smart manufacturing applications. A detailed explanation of the novel multistage servoing approach is presented, where three layers of active events are devised to process the incoming stream of events. Based on these layers, the gripper is accurately driven towards and aligned with the target object for grasping and placement.
Experiments validate the proposed EBVS method for use with objects of different geometrical features without the need for re-tuning or adaptation. The platform was able to precisely grasp objects placed randomly in the workspace with a 100% success rate. For future work, we plan to improve the performance of the presented procedure by accounting for alignment uncertainties and augmenting an optimal motion planning scheme.