3-D Pose Estimation of Articulated Instruments in Robotic Minimally Invasive Surgery

Estimating the 3-D pose of instruments is an important part of robotic minimally invasive surgery for automation of basic procedures as well as providing safety features, such as virtual fixtures. Image-based methods of 3-D pose estimation provide a non-invasive low cost solution compared with methods that incorporate external tracking systems. In this paper, we extend our recent work in estimating rigid 3-D pose with silhouette and optical flow-based features to incorporate the articulated degrees-of-freedom (DOFs) of robotic instruments within a gradient-based optimization framework. Validation of the technique is provided with a calibrated ex-vivo study from the da Vinci Research Kit (DVRK) robotic system, where we perform quantitative analysis on the errors each DOF of our tracker. Additionally, we perform several detailed comparisons with recently published techniques that combine visual methods with kinematic data acquired from the joint encoders. Our experiments demonstrate that our method is competitively accurate while relying solely on image data.


I. INTRODUCTION
Minimally invasive surgery (MIS) has provided surgeons with a less invasive method of accessing the surgical site with a cost of having less control and information about the operation compared with open surgery. Laparoscopic instruments reduce the surgeon's dexterity and ability to sense force feedback from applied tissue pressure and the limited field of view of the surgical camera makes self-localization challenging and increases the cognitive workload on the surgeon. In addition to this, the learning curve for MIS is steep with surgeons taking significant periods of time to obtain mastery of the techniques [1]. In recent years, computer assisted surgery (CAS) and robotics have played a large role in reducing these complications through advanced instruments, control and visualization. Using the surgical console or laparoscope display, pre-and intra-operative imaging can be integrated to the surgical workflow improving planning and understanding during the operation. In robotic systems, master manipulators are used to control articulated instruments which provide the surgeon with precision and dexterity which rival open surgery. However, significant challenges remain with achieving full integration of computer assistance and robotics within MIS. working with during the operation. This can be used to provide direct benefits such as dynamic motion constraints [2] or to detect tool-tissue interactions [3] or alternatively the motion data from tracked instruments can be used to help quantify the training process for junior surgeons, giving specific feedback on areas of weakness or to provide metrics for surgical skill.
Early methods of instrument tracking involved attaching external electromagnetic or optical markers to the instruments and then estimating pose with a specialized tracking system [4], [5] and these methods remain popular today. However, the process of attaching markers to instruments as well as introducing tracking systems to the operating room (OR) complicates the surgical workflow and adds issues with sterilization and cost. In contrast, image based solutions based on computer vision provide an alternative that can be realised entirely in software with no modification to the surgical setup. This is hugely advantageous as methods can be easily translated to clinic use without an extensive process of distributing markers to hospitals and training medical staff how to attach them correctly [6].
Estimating the pose of instruments using the images from a surgical camera involves a process of extracting image features such as edges, points or regions and then solving alignment cost functions which measure the agreement between parameterized models of the target object and the extracted features. This has been achieved using pipelines of simple models [7] where manually specified thresholds are iteratively applied to estimate parameters. This has also been achieved from an information maximization perspective [8]. More recent methods achieve greater robustness and accuracy by building much more complex cost functions where parametrized models are iteratively fit to image data however optimization in the case of articulated instruments has proved challenging [9].
As an alternative to complex generative models, discriminative models have also shown strong performance, particularly when accompanied by larger training datasets. These usually take the form of 2D sliding window detectors [10], [11] but dealing with in-plane rotation of laparoscopic instruments is challenging. This can be achieved with rotated features [12] however online updates to the window orientation requires an additional tracker. As more procedures are carried out with robotic instruments, interest in tracking these articulated joints has increased. Using deep neural networks to directly regress articulated joint locations has been demonstrated with excellent results [13]- [15]. However, for surgical instruments these methods are limited to 2D pose estimation and for 3D localization, mainstream computer vision methods [16], [17] have achieved success in learning pose distributions from vast datasets which are used to find plausible candidates. However, for robotic surgical instruments, training data in the quantities required to perform this type of modelling does not yet exist and in this case the most straightforward method of achieving 3D pose estimation is to use the kinematics of the robot, for which the several mm of absolute positioning error at the tip is corrected by 2D detections, for instance using learned texture features on the instrument head [18] and with rendered templates [19]. Although these methods achieve excellent accuracy, they are limited as they require real-time access to the robot API to read the joint data. Although this is feasible in controlled laboratory setups, in the operating room this access is uncommon. In addition to this, articulated laparoscopic instruments are unlikely to support joint access at any point reducing the scope of this type of method.
In our recent work [20], we demonstrated a region-based tracking method which solved for 3D pose by aligning a rigid CAD model with image features and optical flow. In this work we have made several significant improvements. Firstly, the original work was limited as it could not track the articulated DOF of robotic instruments as the optimization was only performed over a parameter set of a single rigid Euclidean transform. Here we incorporated the articulated DOFs which can be achieved naturally within the CAD model alignment system. This involves extending the jacobians to take into account the rotation of the wrist and claspers of the robotic instruments. To the best of our knowledge, this is the first method of gradient-based optimization which is capable of tracking articulated robotic instruments in 3D without the need for external markers or kinematic data from the robot. This is a significant advantage of our method as it is applicable to both articulated laparoscopic instruments and robotic systems that generally do not give access to public APIs to read joint encoder data. Additionally, our method enables the tracking of flexible [21] and hydraulic [22] surgical robots which typically provide very inaccurate encoder based tracking. Our method also allows retrospective analysis of the numerous available datasets where only video data has been captured. A further improvement of our method is that we introduce an online learning system to dynamically update the color models used to generate segmentations. This enables our method to handle more complex appearance and lighting changes. A final contribution of our current work is the extensive comparative evaluation against 2 currently published 3D robotic instrument tracking methods, this is a meaningful contribution as very few published works make direct comparison to other methodologies.

A. 3D Tracking With Level Sets
3D instrument tracking attempts to estimate the parameters of the transform c T m between the camera coordinate frame F c and a model centric instrument coordinate system F m (see Figure 2a). When the target object is fully rigid, this transform is composed of a 6 DOF Euclidean transform made up of a rigid rotation R ∈ SO3 and a translation t ∈ R 3 . However, for complex articulated and deforming objects, c T m contains the standard rigid transformation but is augmented with a separate transform which articulates the model relative to its base coordinate frame m T warp . The entire rigid transform is parameterized by a vector θ however we generally omit this for brevity and refer to c T m m T warp (θ) as T. Region-based methods of estimating the parameters of T involve using an estimate of this transform to position the vertices of a CAD model of the instrument in F c and generating one or more silhouette regions from the projection of these vertices onto the camera plane using the classic pinhole camera model (see Figure 2). Pose estimation is then formulated by finding the set of parameters such that the generated model silhouettes match data silhouettes obtained from a pixel-wise classification of the image pixels [23]- [27]. Many methods [27], [28] perform a 2 step estimation process whereby a full data silhouette is extracted from the image and backprojected to allow reverse engineering of the pose parameters in a separate step. However, [23], [29] proposed a direct method of which bypasses obtaining a full data silhouette and instead assesses the model silhouette using local information from around the projection. This formulation is greatly simplified over working with a 2 step process as it does not require complex regularizations to maintain a suitable shape when finding the data silhouette, instead relying on a strict shape prior provided by the CAD model projection. Bayesian approaches using learned shape spaces have also been used to this end [30]- [32].
In typical 3D tracking frameworks, a single contour is used to model the entire shape [24], [33], [34]. This allows the problem to be cast as contour matching using silhouettes. This simplification affords a great deal of invariance with (a)  respect to the chosen object and typically works well when the appearance model between foreground and background is strong, resulting in a clean contour. However, for manufactured robotic instruments, this simplification ignores strong internal homogeneous regions which can be useful in generating strong delineating contours (see Figure 1b) between the plastic shaft and the metallic clevis. A particular advantage of this additional contour is that it constructs a fully visible single contour, which is not the case for a binary silhouette as this contour intersects the edge of the image, and this can in principal provide information about foreshortening and additionally constrain the instrument when the clevis is occluded by tissue.
Estimating the optimal 3D pose using region-based methods involves defining an energy functional E r (r denotes region) which measures the alignment of K data silhouettes obtained from statistical models over the image data with K model silhouettes generated from projections of a surgical instrument CAD model. This functional is composed of a sum over K binary alignments, where the form of each summed-cost mirrors a standard region-based segmentation [35]: where the terms f (I(x), χ i ) and f (I(x), χ n(i) ) are functions which return the probability that the pixel data I(x) belongs to either the class i or the set of all other classes n(i). Each statistical model is dependent on appearance parameters for the i th region χ i . The term H(.) represents the smoothed Heaviside function, which is commonly used in mathematical models to filter other functions by discreet membership and in this case is used to indicate if a pixel x belongs to the silhouette i or the background. This silhouette is described by a closed contour C i which is described as a level set by embedding it in a signed distance function φ. This is a beneficial representation over parametric competitors such as splines as it allows greater mathematical flexibility and does not suffer from numerical problems during optimization. This distance function is directly generated from the projection of the model and hence this function, and the contour, are parameterized by θ.
We use random forests (RFs) to provide the response f (.) allowing data silhouettes to be extracted from a single background region. RFs are popular for solving many challenging problems including pose estimation [36], semantic image segmentation [37] and camera relocalization [38]. They have been shown to be fast, parallelizable and accurate while providing simplicity to the user and an ability to handle even high dimensional data [39]. An RF is an ensemble learner where a collection of randomized decision trees vote on a hypothesis for an input x which is aggregated into a single output using an averaging scheme. The decision trees are constructed as a sequence of linear classifiers y = wx which direct input samples to one of two child nodes depending a thresholding of y. This parent to child splitting is applied recursively until x reaches a leaf node where a posterior distribution is assigned.
Rather than using RGB pixel intensities directly, we instead transform our training data into the Opponent 1, Red, a from the CIE Lab color space and Gabor filter output. A small but important modification which we make to our training implementation compared with [20] is to use class balancing. In normal MIS images, background data are much more common than instrument data which, in the case of a 0-1 indicator loss function, leads to learning decision boundaries which favor selecting background labels over foreground labels in ambiguous cases. However, when working within our silhouette based framework, correctly labelling foreground examples so that a complete silhouette is observed is more important than eliminating isolated regions of noise (effectively false negatives are much more detrimental than false positives).
To improve the quality of the segmentation used to drive the region-based pose estimation, we can make improvements to the RF. Firstly, as we only wish to classify the background and foreground in regions near the model contour, it makes sense to learn a highly specific model for the appearance using only pixels which sit close to this boundary. As we have a full 3D model of the instrument, we can generate automatic ground truth segmentations from the signed distance function φ and select training data from a 30 pixel wide boundary, this value was chosen experimentally. After 5 frames, we retrain the forest. Preliminary experiments showed that the most effective strategy was to learn a constant foreground model from the first frame and update the background model data online by sampling from the first frame. This prevents model drift from affecting the training data significantly by incorrectly placing background pixels into the foreground class and vice-versa. This works as we use a bag-of-pixels model which is resilient to movements of the tissue that occur in normal operating interaction. However, upon camera motion the background model would have to be relearned. We could in principal detect this motion with optical flow and reinitialize the model from the segmentation boundary once the camera motion ceases. This technique was discovered to be much more effective than using the current frame to update the background model as this leads to drift when tracking begins to fail.

B. Optical Flow Tracking
When using a silhouette to estimate the pose of any object, a significant challenge arises because of ambiguities in the mapping between pose and silhouette. A simple example being when a sphere is rotated to any angle, the silhouette does not change. A similar problem occurs with the near cylindrical shape of the instruments used in minimally invasive surgery which, when undergoing rotation around the roll axis, do not change their silhouette significantly.
To solve this problem, we propose to combine the silhouette based features, which represent the surface appearance of the instrument as a bag of pixels, with multiple independent Lucas-Kanade optical flow features [40]. This retains enough surface spatial information to allow the ambiguous DOF to be estimated without the penalty of a highly non-convex cost function, which is common in full photo-consistency based object tracking. The idea of tracking 2D information on the instrument surface as an additional method of constraining the pose estimation is very simple and works on the principal that if we can match several 2D tracked image points to 3D points on the model surface, we can estimate the 3D transformation to the instrument by minimizing the reprojection error between the predicted 2D point locations [x, y] T and their correspondences [x,ŷ] T in the image. This can be defined by with objective energy function E p , where similarly to Equation 1, p denotes the use of a point-based cost: where ||.|| 2 2 denotes the squared L 2 norm, although other distance metrics are commonly used [41]. [x t+1 i ,ŷ t+1 i ] T denotes a corresponding point location in the frame at time t+1 which was matched with the point projected from the vertex location X t i at t. W t+1 is the set of matched points between frames at times t and t + 1. K is the calibration matrix for the classic pinhole camera model.

C. Modelling Articulation With Kinematic Chains
In MIS, manufactured robotic manipulators such as surgical instruments have a known set of possible transformations which constrain the vertices of each joint to rotate or translate around or along a single axis (see Figure 5). Hence, this allows the warping transform m T warp to be represented as a composition of several single axis transforms n−1 T n which are applied consecutively to different subsets of the model vertices.
A kinematic chain is the most common method of describing a robot manipulator by dividing it into an assembly of Γ links or rigid bodies each of which define a coordinate frame F. These links are connected together at a shared axis known as a joint, where for an Γ link chain there are at most Γ − 1 joints. The coordinate frames of consecutive links are related with a single 4 × 4 transform n−1 T n which is described with one or more DOFs, which specifies how many parameters are required to fully locate the geometry of the connected n th link in the reference frame of the parent n−1 th link [42]. The most common case for robotic manipulators is to use a single DOF joint where the transform is defined to rotate around 1 axis (rotary) or translate along 1 axis (prismatic) and in fact any K DOF joint can be modelled as a series of single DOF joints [42].
When combined together, the links and joints of a kinematic chain describe how a point X defined in the local coordinate system of the j th : j ≤ Γ link F j can be transformed into the coordinate system of the base frame of the robot as: where 0 T 1 1 T 2 ... j−1 T j can be compactly represented as 0 T j , X Fj is the representation of X in F j and X F0 is the representation of X in F 0 .
There are several methods to define the transform between neighbouring links and for general transforms, 6 DOFs are required to fully specify the relative orientation. However, for single DOF joints, the Denavit Hartenberg (DH) representation [43] defines the n th joint to be parallel to the x = 0 plane of F n−1 , effectively cancelling out 2 degrees of freedom, 1 in rotation and 1 in translation reducing the number of parameters to 4, 2 distances and 2 angles [44]. 1 distance parameter is required to describe how far along the x axis of F n−1 the plane defined by joints n − 1 and n lies and 1 angle parameter describes the rotation between the joints in this plane. These 2 parameters are denoted a n−1 and α n−1 respectively. Describing how F n is attached to the z axis of F n and orientated relative to F n−1 involves a further 2 parameters. Firstly, the distance along this common axis between where a n−1 from link n−1 intersects the common axis and where a n from link n intersects the common axis is defined as d n and describes the vertical shift between the two links. Additionally, the rotation around the z axis of F n between the 2 links is defined as θ n . When applied to a prismatic joint i a i , α i , θ i are fixed and d i is the DOF whereas for a revolute joint i, a i , α i , d i are fixed and θ i is the DOF. These 4 rotation and translation operations are applied consecutively to provide a single transform n−1 T n as: n−1 T n = R xn−1 (α n−1 )·T xn−1 (a n−1 )·R zn (θ n )·T zn (d n ) (4) where R xn−1 refers to a 4 × 4 transform composing a rotation matrix around the x axis of frame F n−1 with a zero translation and R zn has the same meaning but the rotation component is defined around the z axis of frame F n . T xn−1 and T zn refer to same concept but the rotation part of the transform is the identity matrix and the translation part is a translation along the x and z axes of frames F n−1 and F n respectively.

D. DH Parameters for da Vinci Robotic Instruments
In this work we focus solely on working with the instruments of the da Vinci robotic system, particularly the LND instrument which is commonly used in surgical procedures to control a suturing needle. However, the methods are easily applicable to any robotic instrument with the appropriate minor modifications. The LND, like any da Vinci instrument, has 3 DOFs on the wrist: firstly, the wrist pitch (WP) which articulates the entire wrist to mimic the motion of a human wrist enabling the mirroring of motions such as stitching to be captured more precisely. The second DOF is the wrist yaw (WY) which corresponds to a coordinated motion of two mechanical joints representing the claspers and enables the claspers to be oriented towards a target. The final DOF allows the clasper to open and close so that the instrument can grasp and hold objects. This results in the final parameterisation of our instrument being the 6 rigid DOFs of the model to camera rotation and translation and a further 3 DOFs which describe how the instrument wrist is oriented relative to the shaft of the instrument.  Figure  4 and the relationship of each frame to the instrument links can be seen in Figure 5.

E. Optimization
We jointly optimize over the region based energy, referred to from here on as E r (θ), and point based energy computed optical flow, E p (θ) using gradient descent and a weighting factor λ to allow both terms to have more equitable influence. In our experiments we set λ so that the Jacobians from the point estimates have 0.8 of the magnitude of the Jacobians from the region-based energy: where the derivative is computed as: where where ∂φ k (x, θ)/∂x, y can be computed using finite differences and δ(.) is the derivative of the smoothed Heaviside function and corresponds to a smoothed Dirac delta function which has the effect of weighting the derivative terms so that only the points around the contour contribute to the optimization.
Equations 9 and 10 requires derivatives of 2D pixel coordinates with respect to the transform T.
where [X, Y, Z] T = c T i X Fi is the representation of the vertex which generated the pixel (x, y) transformed from the link frame F i into camera coordinates. The derivatives of these terms with respect to the translation and rotation are well known [24] however the derivatives of the parameters of the articulated components merit further discussion. They are obtainable in closed form by differentiating the kinematic chain with respect to each articulated component parameter. The variables of Equation 11 and 12 which depends on these components is the projected 3D vertex position x = K c T i X Fi , where X Fi is defined in the local coordinate system of the link i on which X lies and c T i defines the transform from the camera frame to this frame. The Jacobian of the frame to camera transform part of this equation breaks down as: where j−1 T j is the transform from the parent of frame F j to F j . If we consider the parameter θ j which is responsible for rotating the j th link around the z axis of its frame (see Section II-C), then the derivative becomes: where the product rule is applied to each transform of the kinematic chain and, as each parameter influences directly only a single T, all but a single term is zero. The vertex X Fi is effectively transformed into the coordinate frame F j as this equation measures how motion of the frame j influences vertices in frames towards the distal end of the kinematic chain.

III. EXPERIMENTS
To evaluate the accuracy of the articulated tracking we perform quantitative ex-vivo and qualitative in-vivo studies. However, as several recently published methods of articulated instrument tracking provide comparison datasets, we can also perform a quantitative comparison with these methods.

A. Implementation details
Our implementation 1 makes use of OpenGL/GLSL and we describe our model as a tree of nodes in a parent-child relationship. For the example da Vinci LND model, this consists of a base frame containing the shaft which has a single child node containing the wrist model (see Fig. 5). This again has a single child node containing the clasper axis but no geometry which in turn has 2 child nodes containing each clasper. At each successive pose iteration, the vertices of each node are projected to an index image which contains the numerical index of node which owns the geometry of the vertex. This is used to determine which vertices influence each term in the Jacobian computation. Currently our non-optimized method is not real-time, with processing time for a single 720x576 image taking ≈ 0.3 seconds per gradient descent step with between 10-20 steps required for convergence. However, the cost function gradients are evaluated as an independent sum-over-pixels and is therefore highly parallelizable, with similar implementations achieving real-time performance [24]. We solve our cost-function by reinitializing from the pose in the previous frame but do not incorporate any motion modelling to make forward predictions. Our method requires manual initialization in the first frame, which we achieve with a GUI based tool. 2 This is used to initialize the pose of the instrument model which in turn is used to generate the initial ground truth image segmentation to train the RF.

B. Ex-vivo experiments
We construct 2 ex-vivo experiments using the da Vinci LND instrument and several different animal tissue samples. The camera maintains a static position and observes 1000 frame sequences showing an instrument moving with articulation of the wrist and claspers. The DVRK platform is used to capture synchronised joint and video data and we use the GUI based manual initialization technique to correct errors in the joint configuration and obtain a more accurate ground truth. Plots showing the translation and rotation parameters of the instrument reference frames, the errors in the wrist and clasper position and errors in the relative position of 3 static points on the MR LK tracked model and the ground truth model (see Figure 1b) are shown in Figures 6 and 8. We evaluate parameter errors in 3D space directly, rather than measuring 2D projection error given that most applications of 3D tracking are impacted more heavily by errors in world space. Furthermore, using the error between corresponding points allows us to represent the accuracy of our algorithm without dependence on an arbitrarily chosen origin. We also show renderings of the instrument pose over the video frames are shown in Figures 7 and 9.

C. Quantitative Comparison Results
Recent articulated robotic tracking methods [9], [19], [45] allow us to provide a quantitative comparison method between our fully visual technique and methods that combine visual tracking with robotic kinematic information. Our first comparison is between our method and that of [9] which provided a method of tracking general 3D articulated object and contained a validation section on robotic surgical instruments. This method used a similar region overlap type metric to our technique incorporating multiple instrument regions to provide added robustness. However, this was formulated within a gradient-free optimization as the simple overlap metric did not allow for analytical Jacobians to be computed. This lead to slow and often inaccurate solutions for robotic instruments 2 https://github.com/surgical-vision/viz/ although the method worked well for retinal instruments and human hands. We show results using the 4 frame evaluation used in the original paper where the 25 th , 75 th , 125 th and 175 th frames are manually segmented. We use classification metrics of precision, recall and the F1 score to compare the overlap between the manual segmentation and the rendering of the instrument in that frame. Precision (P), Recall (R) and F1 score (F1) are computed as P = T P T P + F P R = T P T P + F N F 1 = 2(P × R)/(P + R) (16) where the F1 score is the harmonic mean of the precision and recall and is often used as a weighted average of the two measures. The original work of [9] tends to underlap the ground truth slightly, whereas our method tends to overlap slightly which is reflected in the higher precision value for [9] and the higher recall value for our work. However, when taken together, the F1 score shows much higher performance in our method. In this dataset, we make one modification to our method, as the first frame of video does not show a good view of the instrument clasper meaning the color distribution for this class was badly learned from the first frame. To counter this, we chose a later frame to learn our RF, however this is similar to the original authors who chose frames from across the video to learn their color model.

Precision -[9]
Recall - [9] F1 - [9] Precision -Ours  TABLE II: Overlap precision, recall and F1 score for the 4 frames used in the evaluation in [9]. As we performed this evaluation ourselves using hand-crafted masks the results reported in this table for the method of [9] are slightly different, albeit better than the results in the original paper.
The recent method and data of [19] allows us to compare with the state-of-the-art for 3D articulated instrument tracking which combines robot kinematics with a point based detector to provide accurate real-time tracking. We evaluate on 2 phantom sequences with LND instruments which contain complex articulations which make visual tracking extremely challenging. The results are evaluated quantitative in Table III where the authors manually labelled the centre locations of several tool parts that were used in their point-based detection system to obtain a ground truth. The authors then computed the relative pose between the predicted instrument location and the manually labelled instrument location for all frames in the video. Qualitative evaluation is show in Figure 11. In our analysis of dataset 2, we encountered 1 tracking failure for our method at frame 1200 when the left instrument obtained an inaccurate pose due to a challenging period of articulation. Although both instruments go through periods of the video when they exhibit inaccurate tracking, this particular sequence    [9]. This dataset shows a challenging in-vivo sequence with 2 da Vinci LND instruments. The top row shows the raw video frames 25, 75, 125 and 175, the corresponding frames from the method of [9] are in row 2 and the frames from our method are in row 3. Although the data is challenging, both methods show good alignment. Typically our method has better alignment but the right instrument fails to track the clasper opening in frame 175, which is correctly tracked by [9]. was followed by a period when the instruments crossed over one another. This caused large drift in the left instrument which was deemed unrecoverable and a manual initialization was required.

IV. CONCLUSION AND DISCUSSION
In this work, we present a novel system of tracking the articulated DOFs of surgical robotic instruments in 3D using a fully vision-based region and point based solution. Our system trivially extends to different instrument models and color schemes which greatly increases the range of robotic systems it can be tested on. Our extensive comparative evaluation draws together data from a wide varies of sources and demonstrates the superior performance of our method against the only other   [19]. The rotation and translation error is computed for each frame from the manually labelled ground truth part locations. Although our results are not as accurate as the method of [19], we are still able to obtain good tracking over the majority of the sequence and critically are not relying on kinematics to perform our estimation. published 3D articulated instrument tracking method that does not make use of robot joint encoders demonstrating the advantage of using gradient based searches for pose estimation. We also obtain competitive results when compared with state-ofthe-art methods which unlike our method rely heavily on the data from the robot joint encoders which is a well documented drawback [20]. The method however shows errors in the roll rotation DOF due to visual symmetry as this this DOF is explorer which prevents the region based tracker from locking onto reliable shape information. In principal this is best solved by incorporating more reliable detection information on the instrument surface, for instance making use of recent robust feature detection methods [13]. Additionally depth estimation is a challenge, particularly due to the small baseline of robotic surgical cameras. The main limitation of our method is its requirement for a manual initialization, however this can potentially be provided with user interaction, for instance using the GUI tool we have developed, and additionally we noticed in our experiments that the model suffers from drift, which is a common problem in model based tracking which incorporate temporal information. Future work will look mainly at the integration of prior information to restrain the rigid pose space from a 6 DOF transform to a restricted space and in principal these priors can be learned from kinematic data offline.