OBJECT tracking has numerous applications such as traffic surveillance [1], [2], [3], [4], augmented reality [5], mobile robot navigation [6], robotic assembly on a moving line [7], etc. For many of these applications involving 3-D objects, it is not sufficient to just do 2-D tracking; the tracking algorithm must also provide the 3-D pose of the object. For example, for the case of robotic assembly on a moving line in a modern factory, it is essential that the 3-D pose of the object being tracked—such as a car engine cover—be fully known at all times so that the robot end-effector can interact with the object in meaningful ways. Since the 3-D pose of a rigid object involves 6 DOF, three for translation and three for rotation, the tracking algorithm for such applications must yield all six parameters of the pose. These parameters must obviously be estimated despite occlusions, background clutter, varying illumination, etc.

The contributions that have been made in the past on tracking that allow for the estimation of the 3-D pose of an object fall into two categories depending on whether or not backtracking is used in matching model and scene features. In the first category, we have approaches that use point features. The matching strategies used in this category are usually one-shot, meaning the scene features are paired up with the best possible candidates from the model (but this is done only once), and iterative in pose space, meaning a gradient-based approach is used to find the best possible 3-D pose that minimizes some error functional between the model and the scene. This synopsis applies to the work reported in [8], [9], [10].

With such a one-shot correspondence search strategy, the pose estimate often drifts away from the true pose, especially when the target object moves nonsmoothly, and the predicted pose for each frame has large discrepancy from the true pose. In order to alleviate this problem, some approaches employ a robust estimator, such as the M-estimator [9], for minimizing the error function, or a voting-based strategy, such as the generalized Hough transform in the pose space [1]. Marchand *et al.* [11] estimate a rough location of the target in the scene by calculating the 2-D affine transformation between each consecutive frame, and then, applying multiresolution generalized Hough transform to estimate the finer pose. Vacchetti *et al.* [12] use an appearance-based offline registration method to get around the drift problem associated with the one-shot approaches to pose estimation. Recently, there have been attempts to get around the need for explicit matches between the model and the scene by directly estimating the location of model contours in the scene [4].

These approaches are not suitable for accurate estimation of 3-D pose on a continuing basis as an object is tracked against cluttered backgrounds. The main source of difficulty with these approaches appears to be a lack of a backtracking-based search framework for matching model features with scene features. A backtracking-based solution to the problem must of necessity include some sort of an uncertainty model for locational and other properties of scene features. We believe that the problems that can be caused by the lack of backtracking also apply to the recent work of Lippiello *et al.* [13].

The second category of approaches for tracking while the 3-D pose is constantly updated combines a backtracking-based strategy for matching with pose-uncertainty modeling in order to achieve greater robustness in matching [2], [6], [14], [15]. The contribution by Lowe [14] uses the Gauss–Newton method for minimizing the error between the predicted pose and the true pose. Koller *et al.* [2] and Tonko and Nagel [15] use the extended Kalman filter (EKF) for updating the positional uncertainties associated with the model features. Although these approaches are similar to ours in using EKF for estimating the target object pose, the feature-correspondence-seeking strategies in these approaches are not as elaborate as needed to accommodate jerky motions of the sort we address in this paper. The backtracking strategy used in [2] is similar to that described in [6]. Such one-feature-at-a-time backtracking often fails when the motion is too jerky, as we will argue in the rest of this section. And the EKF implementation described in [15] does not even use any backtracking. So, it too cannot be expected to deal with sudden large changes in object pose during tracking. As we will explain in this paper, backtracking is necessary for coping with large sudden variations in object pose, but the strategy used for backtracking must allow the system to completely abandon a pose hypothesis as opposed to merely undoing a previous model-to-scene match for a single feature.

Of the approaches listed above, the prior contribution by Kosaka and Kak [6] is particularly relevant to the new research reported here. Although this Kalman-filter-based formalism was originally developed for vision-based mobile robot navigation, it was later shown to be useful for 3-D object tracking also [16]. The work reported in both [6] and [16] is based on an incremental pose-update scheme in a prediction–verification framework. In this framework, the pose of an object in each input scene is predicted with uncertainty. As the features in the model of the object are sequentially matched with the features in the input scene, an EKF is used to reduce the pose uncertainty by observing the error between the matched features. As more and more features are matched, the estimation of the target pose becomes increasingly accurate.

The goal in this paper is to use the work reported in [6] as a starting point for developing a fast and accurate 3-D tracker that also continuously yields the 3-D pose of the object being tracked even when the object motions are large and jerky. Our research here goes beyond what was reported in [6] in the following important ways.

When a target object moves with a large variation in its motion, the predicted statistics of the object pose for each image frame tend to deviate significantly from the true pose. To solve this problem, a large amount of motion uncertainty has to be assigned to the predicted pose. As our experiments have shown, large uncertainty in the predicted pose causes the maximum likelihood frameworks for feature correspondence estimation, such as the one in [6], to break down. To understand what we mean by “break down,” note that all that a Kalman filter does is to update the pose mean and covariance. The uncertainty associated with such updates will always be smaller with each iteration even when we use inappropriate matchings between the model features and the scene features. *Inappropriate pairings between the model and the scene features are more likely to take place in the presence of large motion uncertainties.* To get around this problem in the research reported here, after we have updated the pose, we reexamine the model-to-scene feature pairings that went into the update calculations. If the new pose (and the new bounds on the uncertainties) does not support these pairings, they are undone in their entirety (*as opposed to one-at-a-time in traditional implementation of the backtracking step in EKF [6]*) and new pairings sought. This process is repeated until the updated pose and the set of matched model-to-scene features support each other fully and reciprocally. Detailed description of the hypothesis generation and verification scheme is presented in Sections III-F and III-G.

While more robust, being iterative, the framework mentioned earlier can extract a performance penalty unless care is taken in the initial selection of model-to-scene feature matchings. To minimize this potential performance penalty, our system first rank orders the model features on the basis of a number of criteria. At each iteration, scene features are sought for only the top-ranked model features. Experiments have shown that this significantly reduces the number of backtrackings needed in our framework. Rank ordering of the model features is described in Section III-F.

In the next section, we present an overview of our tracking system. In Sections III, IV and V, we present detailed description of our pose estimation algorithm that is used iteratively in the tracking system. In Section VI, we present pose estimation and tracking results with a few target objects. Finally, we conclude in Section VII.

SECTION II

## Tracking Algorithm

### A. Workspace Description and Definition of Pose

We use three coordinate frames to represent features in our workspace: the world coordinate frame, the camera coordinate frame, and the target object coordinate frame. The world coordinate frame, which we denote as *W*, is the reference coordinate frame for points in the workspace. This frame is usually attached to a fixed reference in the workspace. The camera coordinate frame *C* is the camera-centered coordinate frame whose *x* and *y* axes are aligned with horizontal and vertical directions of the camera image plane, respectively, and *z* axis is aligned perpendicular to the image plane. The target coordinate frame *T* is a coordinate frame that all model feature points of a target object are defined with respect to. Fig. 1 shows these coordinate frames and how they are related to each other.

The transformation of the feature vectors from *T* to *C* has 6 DOF: three for translation and three for rotation. We define the pose of an object as the 6-D random vector **p** = (*t*_{x}, *t*_{y}, *t*_{z}, φ_{x}, φ_{y}, φ_{z})^{T}, where *t*_{x}, *t*_{y}, *t*_{z} are the translational components and φ_{x}, φ_{y}, φ_{z} are the rotational components of the transformation from *T* to *C.* The three rotational components represent the Euler-III-type angles of rotation about the three axes *x*,*y*,*z* of *C*, respectively, as defined in [17]. The use of boldface font for **p** signifies that **p** is a random vector. We assume **p** has Gaussian distribution with mean and covariance matrix Σ_{p}. Alternatively (and more usefully), the pose vector **p** is represented in the form of a homogeneous transformation matrix from *T* to *C* and denoted as ^{C} *H*_{T} using the Denavit–Hartenberg notation [18]. When we want to show that the elements of this matrix are directly related to the pose vector, we write the matrix as ^{C}*H*_{T}(**p**).

Although we formulate our pose estimation algorithm for the relative pose of *T* with respect to *C*, the algorithm can be easily adapted to estimate the pose with respect to *W* by replacing ^{C}*H*_{T}(**p**) in (1) in the next section with ^{C}*H*_{W}^{W}*H*_{T}(**p**), where ^{W}*H*_{T}(**p**) represents the pose of the target with respect to *W.* ^{C}*H*_{W} is given by camera calibration.

### B. Modeling Target Objects

The model features extracted from the wireframe model, meaning the actual straight-line edges on the boundary surface of the object, are represented by the Cartesian coordinates of the two extremities in *T.* That is, a model feature *m* is represented by two 3-D vectors *m*^{k} = (*x*^{k},*y*^{k},*z*^{k})^{T}, *k* = 1,2 that are the 3-D Cartesian coordinates of the two extremities of *m* in the coordinate frame *T.* The superscript *k* of each vector denotes the extremity that it represents.

Fig. 2(a) shows a simple mostly polyhedral object at the top and its wireframe model. Fig. 2(b) shows a more complex object at the top and its wireframe model at the bottom. We refer to the latter object as the train station object. This object will be used to illustrate the various steps of our tracking algorithm in the rest of this paper.

### C. Tracking System Overview

Tracking in our system is executed by applying a model-based pose estimation algorithm to each consecutive frame in the input image sequence. An overview of our tracking system is depicted in Fig. 3. As shown in this figure, our tracking system consists of three modules: the feature extraction module, the pose estimation module, and the pose prediction module.

The feature extraction module extracts straight-line feature descriptors from the input scene image along with the associated measurement uncertainties. This module is presented in Section IV.

The pose estimation module searches for the best match between the features received from the feature extraction module and the model features projected into the camera image by the pose prediction module. The pose estimation module uses an EKF to estimate the pose in such a way as to minimize the error in the image space between the object as perceived through the extracted features and the object model as projected into the camera image. Details of this module are presented in Section III.

The pose prediction module predicts the pose of the target for the next image frame using a linear extrapolation method based on motion estimates of the target. Such a predicted pose of the target is used to project the object model into the camera image for constructing the expected view of the object. The initial value of the target object pose in the beginning of the tracking sequence is assumed to be given.^{1} Details of this module are presented in Section V.

SECTION III

## Pose Estimation Module

Although feature extraction is the first step that a camera image is subjected to, we will go ahead and explain how we carry out pose estimation in our system. The discussion on pose estimation will permit us to explain the overall uncertainty calculus used by our system, which will subsequently result in a more efficient explanation for the feature extraction module. Representing and manipulating scene and model uncertainties are key aspects of our pose estimation algorithm.

Overall, our system is aware of two different kinds of uncertainties and it keeps track of them separately: the feature extraction module associates a *measurement uncertainty* with each straight-line feature. The measurement uncertainty depends on what it takes to group together a series of edge fragments into a single straight-line feature. The other kind of uncertainty—the kind that is the focus of this section—is due to the discrepancy between the true pose of the model object and its currently known pose as the object is in motion. This is the uncertainty that must be associated with the model features that are projected into the camera image by the pose prediction module. We will refer to this uncertainty as the *pose prediction uncertainty.*

For describing our pose estimation algorithm in detail, we start with presenting the definition of the image error between the projected model features and their corresponding scene features in the following section.

### A. Constraint Equation for Pose Error in Image Space

Previously, we talked about a model object as being defined in an object-centered coordinate frame denoted as *T.* As an object moves in space, this coordinate frame moves with the object. In other words, the pose of a moving target object in its own coordinate frame *T* never changes. Our goal in tracking is to constantly update the pose vector **p**, which is equivalent to updating the transformation matrix ^{C}*H*_{T}(**p**). The pose of an object is estimated by predicting the current value and uncertainty of the ^{C}*H*_{T}(**p**) matrix from its previous value and the motion uncertainty parameters currently in effect, and then, using this predicted matrix to project the relevant model features into the camera frame *C.* For obvious reasons, we can refer to these projected model features in the camera frame as the expectation map. A difference between the expectation map and what the camera sees at the current moment, the difference being caused by the pose error, is then used for updating the **p** vector, the uncertainties, and the various motion parameters. Since ^{C}*H*_{T}(**p**) is random, the locations of the projected model features are denoted by their means and covariances that can be derived from the mean and covariance of **p** at the time of the projection.

The expectation map for a given value of **p** is constructed by applying ^{C}*H*_{T}(**p**) to the vectors representing the end points of the straight-line model features. For such a projection, we use a perspective projection model that is widely used for such purposes [20]. Using the perspective projection model, the projection of the two end points *m*_{i}^{k}, *k* = 1,2 of a model feature *m*_{i} is described by the following equation:
TeX Source
$$\left(\matrix{u_{m_i}^kw\cr\cr v_{m_i}^kw\cr\cr w}\right) = \left(\matrix{\alpha_u & 0 & u_0 & 0\cr 0 & \alpha_v & v_0 & 0\cr 0 & 0 & 1 & 0}\right) {^CH_T({\bf p})}\left(\matrix{ x^k_i\cr\cr y^k_i\cr\cr z^k_i\cr\cr 1}\right)\eqno{\hbox{(1)}}$$where *u*_{mi}^{k},*v*_{mi}^{k} are the image coordinates, *w* the scaling parameter, and (*x*_{i}^{k},*y*_{i}^{k},*z*_{i}^{k}) the actual coordinates in the coordinate frame *T* for the end points *m*^{k}_{i}. α_{u}, α_{v}, *u*_{0}, *v*_{0} are the intrinsic camera parameters that are given by the camera calibration, for which we use the algorithm presented in [21].^{2} Using this equation, we denote the projection of a model feature *m*_{i} for a given value of **p** as a 4-D vector *g*_{mi,p} as follows:
TeX Source
$$g_{m_i,{\bf p}} = [u_{m_i}^1, v_{m_i}^1, u_{m_i}^2, v_{m_i}^2]^T.\eqno{\hbox{(2)}}$$

For a given distribution of **p** with mean and covariance matrix Σ_{p}, the mean of *g*_{mi,p}, which we denote as , is calculated by replacing ^{C}*H*_{T}(**p**) with in (1). The uncertainty of *g*_{mi,p}, which we denote as Σ_{gmi,p}, is approximated using Σ_{p} as follows:
TeX Source
$$\Sigma_{g_{m_i,{\bf p}}} = J(g_{m_i,{\bf p}},{\bf p})\Sigma_{\bf p}J(g_{m_i,{\bf p}},{\bf p})^T\eqno{\hbox{(3)}}$$where *J*(*g*_{mi,p},**p**) is the Jacobian matrix of the pixel coordinates of *g*_{mi,p} with respect to **p**. For estimating the pose of the object, the projected model features in the expectation map must be matched with the straight-line features that are extracted from the edge map of the input scene. Let *z*_{j} be the scene feature that is selected for matching with the camera projection of the model feature *m*_{i}. We denote this scene feature as a 4-D vector with the image coordinates of its two end points as follows:
TeX Source
$$z_j = [u_{z_j}^1, v_{z_j}^1, u_{z_j}^2, v_{z_j}^2]^T.\eqno{\hbox{(4)}}$$Because of the various uncertainties that are involved in edge detection and straight-line extraction, *z*_{j} is also a random vector. We assume that this vector can also be characterized by a Gaussian distribution. Details on estimating the mean and covariance for *z*_{j} are presented later in Section IV when we describe our algorithm for extracting such scene features from input images.

The vector *g*_{mi,p} gives us the predicted locations of the end points of the model straight-line feature *m*_{i}. On the other hand, the vector *z*_{j} corresponds to the actual measured locations of such end points in the image space. If the predicted pose corresponds exactly to the current pose of the target, then obviously, *g*_{mi,p}− *z*_{j} will be zero. So, when that is not the case, any differences between the two must be minimized. Therefore, the following constraint equation must be satisfied by any pose update mechanism:
TeX Source
$$f({\bf p}, m_i, z_j) = g_{m_i,{\bf p}}-z_j = 0.\eqno{\hbox{(5)}}$$

### B. EKF-Based Recursive Pose Update Framework

We use a recursive framework that is similar to the framework used in [6] for updating the pose of the target given the error between the model features and the corresponding scene features. For each model and scene feature pair in a given set of feature correspondences, our framework uses an EKF [23] to transform the pose parameters to presumably more accurate pose parameters that optimally minimize the error between the corresponding features. The updated pose parameters serve as the initial state for the next pose update with another feature correspondence. The fact that the updated pose parameters are used as the initial state for the next update explains why we call our framework recursive.

Let *C* = {(*m*_{1}, *z*_{j1}),…,(*m*_{NC},*z*_{jNC})} be a set of model and scene feature correspondences where (*m*_{i},*z*_{ji}), *i* = 1,…,*N*_{C} denotes that the model feature *m*_{i} is matched with the scene feature *z*_{ji}. *N*_{C} is the cardinality of *C.* We also denote the pose vector after pose update using the match (*m*_{i},*z*_{ji}) as **p**_{i} and its corresponding mean and covariance as and Σ_{pi}, respectively. Our pose update equations transform **p**_{i−1} into **p**_{i} while minimizing the error between *z*_{ji} and *g*_{mi,pi−1}. With regard to the pose update processing for each new image frame, the statistics of the initial pose **p**_{0} are given by the pose prediction module using the estimated pose from the previous image frame as described in Section V. Let represent the actual measured *z*_{ji}. We assume that the feature measurement error is additive white Gaussian and we denote this error as ξ_{ji} with error covariance *V*_{ji}. By linearizing and rearranging (5) in the vicinity of and using the Taylor's series expansion, we get the following equation:
TeX Source
$$y_i = M_i {\bf p} + e_{j_i}\eqno{\hbox{(6)}}$$where
TeX Source
$$\eqalignno{y_i &= -f({\bar{{\bf p}}_{\bf i-1}}, m_i, \hat{z}_{j_i}) + { \partial f({\bf p}, m_i, z_{j_i})\over \partial \bf p} \bar{{\bf p}}_{\bf i-1}\cr M_i &= { \partial f({\bf p}, m_i, z_{j_i})\over \partial \bf p}\cr e_{j_i} &= { \partial f({\bf p}, m_i, z_{j_i})\over \partial z_{j_i}} (z_{j_i} - \hat{z}_{j_i}).&\hbox{(7)}}$$We denote the covariance matrix of *e*_{ji} as *E*_{ji}, which can be easily calculated from the covariance matrix *V*_{ji} of ξ_{ji}.

Using the EKF theory, the minimization of the constraint in (5) via the linearizations in (6) through (7) is achieved if the statistics of the state vector **p**_{i} are updated by the following equations:
TeX Source
$$\eqalignno{{\bar{{\bf p}}_{\bf i}} &= {\bar{{\bf p}}_{\bf i-1}} - K_if({\bar{{\bf p}}_{\bf i-1}}, m_i, \hat{z}_{j_i})\cr K_i &= \Sigma_{{\bf p}_{\bf i-1}} M_i^T (E_{j_i} + M_i \Sigma_{{\bf p}_{\bf i-1}} M_i^T)^{-1}\cr\Sigma_{{\bf p}_{\bf i}} &= (I - K_i M_i) \Sigma_{{\bf p}_{\bf i-1}}.&\hbox{(8)}}$$

### C. Building the Expectation Map

For constructing the expectation map, we identify two groups of model straight-line features that should not be projected onto the camera image plane. The first is the group of model features that are self-occluded by other parts of the object for a given pose matrix. For identifying this type of model features, we use the binary space partitioning (BSP) tree representation of a polyhedral model [24].

The second group is the group of model straight-line features that are parallel to the optic axis of the camera. Although such a group of line segments is expected to be visible in the expectation map, it is viewed as a group of very short line segments or points. It is obviously undesirable to match these kinds of model features to the scene. Currently, we exclude the line segments whose direction is within 20° of the optic axis of the camera. Fig. 4 shows an example of an expectation map constructed for the target object of Fig. 2(b). This map is superimposed on the scene edges extracted from an image frame. The expectation map consists of thick black lines.

### D. Selecting Match Candidates for Projected Model Features

The locations of the projected model features in the image space possess uncertainty owing to the uncertainty associated with the predicted pose of the target. This positional uncertainty for the projected model features defines regions in the image space in which the system should search for the scene feature candidates to be matched with the projected model features.

For each projected model edge *g*_{mi,p0}, the covariance matrices for the positions of its two end points are the 2× 2 submatrices in the diagonal of Σ_{gmi,p}0} that is calculated by (3). These covariance matrices define elliptical regions around the end points of *g*_{mi,p0} in the image space. We define an approximate convex hull that encloses these elliptical regions and use this convex hull as the search region for match candidates. Fig. 5 shows an example of defining the search region for a projected model feature in the image space. Note that our convex hull is approximate in the sense that it is polygonal, which allows for efficient computations.

For the search region defined around *g*_{mi,p0}, the extracted scene edges that are inside this region are tested for match candidacy. Let be a scene feature inside the search region of *g*_{mi,p0}. We evaluate the Mahalanobis distance measure for the image error between and *g*_{mi,p0} as defined by the following equation:
TeX Source
$$\!\; d_{f({\bar{{\bf p}}_{\bf 0}}, m_i,\hat{z}_j)} = f({\bar{{\bf p}}_{\bf 0}}, m_i,\hat{z}_j)^T\Sigma_{f({{\bf p}_{\bf 0}}, m_i, z_j)}^{-1}f({\bar{{\bf p}}_{\bf 0}}, m_i,\hat{z}_j)\hfill\eqno{\hbox{(9)}}$$where Σ_{f(p0,mi,zj)} is the covariance matrix of , which is calculated by the first-order approximation as follows:
TeX Source
$$\Sigma_{f({{\bf p}_{\bf 0}}, m_i, z_j)} = { \partial f({{\bf p}_{\bf 0}}, m_i, z_j)\over \partial {\bf p}_{\bf 0}} ^T\Sigma_{{\bf p}_{\bf 0}}{ \partial f({{\bf p}_{\bf 0}}, m_i, z_j)\over \partial {\bf p}_{\bf 0}} + V_{j}.\eqno{\hbox{(10)}}$$

Assuming is locally Gaussian in the vicinity of and , the distance measure has Chi-squared distribution with 4 DOF, since is defined to be a 4-D vector. With confidence level of 0.5, we choose *z*_{j} as a match candidate for *g*_{mi,p0} if is less than χ_{4,0.50}, which is 3.357.

The match candidates for each projected model feature in the expectation map constitute a set of model and scene feature correspondences. For convenience of notation, we denote such a set by *C*_{0} in the rest of this paper.

### E. Estimating Model and Scene Feature Correspondence Using Hypothesis Generation and Verification Scheme

After the set of model and scene feature correspondences is constructed, we determine the true correspondences, these being correspondences that satisfy certain criteria that we will present in the following subsections. A matching hypothesis, which we denote as *C*_{H} for convenience of notation, is a subset of the initial feature correspondence set *C*_{0}. For selecting *C*_{H}, we use a priority selection scheme that uses a certain weight measure for each feature correspondence pair in *C*_{0}. The weight measure is calculated with three heuristic rules that are described in the next subsection. While selecting feature correspondences for *C*_{H}, the feature pair that has a higher weight measure is given higher priority. Each time a feature correspondence pair is selected for a match hypothesis, the predicted pose **p**_{0} is updated with the feature pair. Hence, such updated pose, which we denote as **p**_{H}, represents the best estimate of the object pose for the feature correspondences that are currently selected for *C*_{H}. The feature correspondence selection procedure continues until the pose uncertainty associated with **p**_{H} is reduced below a certain threshold.

After *C*_{H} is constructed, it is verified with the two criteria that are described in Section III-G. If *C*_{H} is rejected, then the system regenerates *C*_{H} based on the criterion violated. Such regeneration procedures are also described in Section III-G. The hypothesis generation and verification process iterates until the two verification criteria are all satisfied, or no more model and scene feature pairs are available for generating hypotheses.

In Fig. 6, the overall control flow of the hypothesis generation and verification scheme is shown. In the next subsection, we describe the details of how the matching hypothesis *C*_{H} is generated.

### F. Hypothesis Generation

Regarding the number of feature correspondences needed for the hypothesis, one widely accepted strategy, such as the one with the random sample consensus (RANSAC) approach [25], is to use only the minimum required number of feature correspondences that guarantee a certain level of confidence in the estimated pose, and then, to verify the hypothesized feature correspondences with the estimated pose. The pose uncertainty associated with the estimated pose translates directly into the confidence level. Since we estimate the pose by minimizing the error between the projection of the model and the corresponding scene edges in the image space, we must also calculate the pose uncertainty in the image space. This requires that we project the 6 DOF uncertainty into the image plane.

In our EKF-based pose update framework, the extent to which each pose update reduces the uncertainty associated with the pose is controlled by the Kalman gain that is subject to the measurement uncertainty for the scene features, as shown in (8). Since the measurement uncertainty for scene features is subject to the image noise and the errors in the straight-edge detection process, it cannot be predicted in advance as to how much uncertainty would be reduced by a single iteration of pose update. We therefore update the pose incrementally for each matched pair of model and scene edge. At the same time, we compute the new uncertainty associated with the updated pose and project the uncertainty into the image plane. This iterative process stops either when we run out of all model edges, or when the updated pose uncertainty drops below a certain threshold, whichever comes first. The threshold for the projected pose uncertainty is chosen considering the level of pose estimation accuracy required for our application.

For selecting the model and scene feature correspondences for the hypothesized correspondence set *C*_{H}, if we can give higher priority to the feature pairs that have higher chance to be correct matches, we can minimize the number of hypotheses that should be generated and verified until we get the correct feature correspondences. For such priority assignment, we calculate a weight measure for each model and scene feature correspondences using three heuristic rules. The heuristic rules are:

Give high priority to a model feature that has a small number of matching candidates.

Give high priority to a model feature and scene feature pair if the Mahalanobis distance measure between these two features is small.

Give high priority to a model feature that is distant from its neighboring model features in the expectation map.

The first heuristic rule is to give a higher weight to a model and scene feature pair if the model feature of the pair has fewer scene candidates than the other model features. Obviously, if a certain projected model feature has many matching candidates, then it is confusing to decide which of the candidates the model feature must be matched with. Hence, the model features with smaller number of candidates have higher chance to be correctly matched.

The second heuristic rule is to give a higher weight to a feature pair *m*_{i}, *z*_{j} if it has a smaller value for the distance measure that is presented in (9). Although the predicted pose **p**_{0} is likely to contain errors, as we mentioned previously, it remains that **p**_{0} is our current best estimate the object pose. For this reason, we assume that any feature pair that has a small image error with regard to the current best estimate of the object pose has a higher chance to be a correct match.

The third heuristic rule is chosen based on the observation that a model feature whose projection in the expectation map is geometrically distant—both in location and in direction—to the other projected model features is less likely to be mismatched. If two model features are located close to each other in the image space, significant parts of their search regions may overlap. Hence, the chance of two different similarly shaped candidate scene features to be in both search regions would be high, making it more difficult to choose correct matches for the two model features. For example, as shown in Fig. 7, there is significant overlap between the search regions of the two projected model features labeled *MF*1 and *MF*2. Note here that there exist multiple scene features that have similar lengths and orientations in both search regions.

There have been previous approaches, including the one by Tonko and Nagel [15], that disregard model features that are geometrically close to each other in generating the expectation map. Our approach is different from those approaches in the sense that such geometrically close model features are included in the expectation map with low priority instead of being completely disregarded. With this strategy, the matches for the low-priority features are sought when the higher priority features fail to match, hence increasing the level of fault tolerance.

In order to use the third heuristic rule for calculating the matching weight, we define for each projected model feature *g*_{mi,p0} the distance measure *d*_{ς}(*g*_{mi,p0}) that describes how distant *g*_{mi,p0} is to other projected model features in the expectation map. The definition of *d*_{ς}(*g*_{mi,p0}) is as follows:
TeX Source
$$d_{\varsigma }(g_{m_i,{\bf p}_{\bf 0}}) = \min_{^\forall g_{m_q,{\bf p}_{\bf 0}} \in G}g_{\varsigma }^T \Sigma_{\varsigma }^{-1}g_{\varsigma }\eqno{\hbox{(11)}}$$where *G* is the set of projected model features in the expectation map, *g*_{ς} = *g*_{mi,p0}− *g*_{mq,p0}, and Σ_{ς} = Σ_{gmi,p0}+Σ_{gmq,p0}.

With the three heuristic rules listed earlier, we calculate the weight measure for each model feature and scene feature pair (*m*_{i},*z*_{j}) as follows:
TeX Source
$$W({{\bf p}_{\bf 0}}, m_i,\hat{z}_j) = { d_{\varsigma }(g_{m_i,{\bf p}_{\bf 0}})\over d_{f({{\bf p}_{\bf 0}}, m_i,\hat{z}_j)}n_{\rm cand}(g_{m_i,{\bf p}_{\bf 0}})}\eqno{\hbox{(12)}}$$where is the Mahalanobis distance of the image error between *g*_{mi,p0} and *z*_{j} as defined in (9) and *n*_{cand}(*g*_{mi,p0}) is the number of match candidates for *g*_{mi,p0}. For all members of *C*, we evaluate this weight measure before we start selecting the correspondence pairs for *C*_{H}. We then sequentially choose the feature pairs from *C* as sorted by the values of the composite weight . Then, as we mentioned earlier, the current best estimate of pose **p**_{H} with regards to the current hypothesis set *C*_{H} is recursively updated with the newly added feature pair. This selection procedure iterates until the uncertainty associated with the updated pose **p**_{H} falls below a certain threshold.

### G. Verifying Hypotheses

In the following two sections, we describe the details of the two criteria we use for verifying the hypothesis set *C*_{H} and how we backtrack over this set (by examining its subsets) when the criteria are violated.

*1) Hypothesis Verification and Modification With the Matching Consensus Criterion*

When a hypothesized feature correspondence set *C*_{H} is generated from the scene image, we evaluate the sum of the squared image plane errors between the model features and the scene features in the set *C*_{H} as follows:
TeX Source
$$d_{C_H,{{\bf p}_{\bf H}}} = \sum_{i=1}^{\vert C_{H}\vert }d_{f({{\bf p}_{\bf H}}, m_{H_i},\hat{z}_{j_{H_i}})}\eqno{\hbox{(13)}}$$where the term is the image error between a particular model feature *m*_{Hi} and its corresponding scene feature *z*_{jHi}, as defined in (9) for the case of initial pose. |*C*_{H}| denotes the cardinality of *C*_{H}. As we have previously mentioned, has a Chi-squared distribution with 4 DOF. Hence, *d*_{CH,pH} also has a Chi-squared distribution with DOFs equal to four times the cardinality of *C*_{H}. With the confidence level of 95%, we reject the hypothesis *C*_{H} if *d*_{CH,pH} is greater than χ_{4| CH|,0.95}.

In order to explain why this criterion is used for verifying *C*_{H}, we use the notion of the matching consensus. If the feature correspondence pairs in *C*_{H} are true matches, then all the matching pairs must be consistent with a certain estimate of the object pose. In other words, the scene features of the correspondence pairs in *C*_{H} must have reasonably small image errors with the corresponding model features projected with the true estimate of the object pose. In that sense, for the hypothesized feature correspondence set *C*_{H} to be accepted, the feature correspondence pairs in *C*_{H} should form a consensus set with regard to the updated pose **p**_{H}. For this reason, we call this criterion as the matching consensus criterion.

When the matching consensus test fails for a hypothesis *C*_{H}, there are one or more feature correspondence pairs in *C*_{H} that are not consistent with the pose hypothesis **p**_{H}; these result in large image errors.

In order to remove these inconsistent correspondences from the hypothesis *C*_{H}, we use the following “leave one out” approach to detect the model-scene feature pairing that is most inconsistent with the rest of the pairings. This is done by applying the matching consensus criterion to each of the subsets of *C*_{H} in the following manner.^{3}

For each model and scene feature pair (*m*_{Hi},*z*_{jHi}) in *C*_{H}, we make a subset *C*_{Hi}, which is defined as follows:
TeX Source
$$C_{H_i} = C_H - \{ (m_{H_i}, z_{j_{H_i}})\}.\eqno{\hbox{(14)}}$$

For each subset *C*_{Hi}, we update the predicted pose **p**_{0}. Obviously, this pose calculation only uses the model-scene feature pairings in *C*_{Hi}. That is, the new updated pose would *not* include the feature pair (*m*_{Hi},*z*_{jHi}). The new updated pose is called **p**_{Hi}.

We evaluate the matching consensus criterion for each subset *C*_{Hi} with **p**_{Hi}. Let *C*_{Hmin} be the subset with the minimum matching consensus criterion value. The feature pair (*m*_{Hmin},*z*_{jHmin}) that corresponds to *C*_{Hmin} is chosen as an inconsistent correspondence pair. *C*_{Hmin}constitutes the modified hypothesis after removing the inconsistent feature pair from *C*_{H}.

This approach is based on the assumption that the inconsistent correspondences do not form a consensus set by themselves. Hence, if an inconsistent correspondence is removed from the hypothesis set *C*_{H}, the updated pose with the new hypothesis set is closer to the pose for the consensus subset in the hypothesis.

If *C*_{H} includes more than one inconsistent correspondence pair, the new hypothesis set *C*_{Hmin} may not satisfy the matching consensus criterion. In this case, we execute again the inconsistent pair detection procedure described earlier until the modified hypothesis subset satisfies the matching consensus criterion.

After we find the subset of *C*_{H} that satisfies the matching consensus criterion, the system uses the hypothesis generation algorithm that was described in Section III-F to add more feature correspondences to *C*_{H}, and verifies the modified hypothesis set again with the matching consensus criterion.

*2) Assigning Nil-Mappings and Verifying the Hypothesis Based on the Number of Nil-Maps*

If *C*_{H} satisfies the matching consensus criterion, the model features that are not included in *C*_{H} are projected into the image with pose **p**_{H}, and the matching candidates for such projected model features are sought.

Since an accepted *C*_{H} guarantees a certain bound on the projected uncertainty in the image plane, the remaining model features projected with **p**_{H} have small search regions. For example, Fig. 8(b) shows the search regions for the remaining model features when projected into the camera image with the pose updated with an accepted *C*_{H}. Fig. 8(a) displays the model-scene feature pairings in an accepted *C*_{H}. Ordinarily, on account of the tight bounds on the uncertainties associated with the projected model features at this point, there will exist at most a single candidate scene feature within the uncertainty region associated with any remaining model feature. If that is the case, that scene feature is chosen for matching with the model feature. If multiple scene features are found in this uncertainty region, the system selects the closest scene feature. And, if no scene feature is inside the uncertainty region, the model feature is assigned a nil-map.

We place a constraint on how many nil-maps are allowed for a given set of model features. It is entirely possible that a *C*_{H} was accepted for reasons of accidental alignment between a partial region of the model with a partial region of the scene without the object and its image being in true global alignment. So, if the number of nil-maps exceeds a threshold, we reject the entire *C*_{H} and regenerate another *C*_{H}. When a *C*_{H} is rejected on account of too many nil-maps, the model-to-scene correspondence in the rejected *C*_{H} is not allowed to occur again. For that reason, a rejected *C*_{H} is guaranteed not to appear again.

SECTION VI

## Experimental Results

If the reader would be willing to indulge us, to best experience our experimental results, he/she is asked to point his/her browser to the web site http://cobweb.ecn.purdue.edu/RVL/Projects/ModelBasedTracking/index.htm.

There, the reader will see a human shaking an object as it is being tracked in real time. Another demonstration at that site shows successful tracking even when the object is significantly occluded by a human waving his hand between the camera and the object.^{5}

In all of the experiments we present in this section, we must provide the tracker with the initial pose of the target object. How this initial pose information is supplied is different for different types of experiments. For the visual-servoing-based assembly-on-the-fly experiments, the initial pose is provided by the “coarse module” as described in [19]. The coarse module uses a ceiling-mounted camera for rough estimation of the location of the target. When the target moves into the servo range of the robot end-effector-mounted camera, the control is automatically handed over to the “fine control” module that is also described in [19]. The “fine control” module is based on the tracking algorithms described in this paper. For the tracking of handheld objects in real time, we have developed a GUI that gives the user control over translations and rotations of the wireframe model as projected onto a terminal screen. The user can manually bringthe model into correspondence with the first camera image and thus initialize the pose. A similar GUI-based approach is used for pose initialization when tracking an object in video sequences offline.

In the rest of this section, we will first present quantitativeresults on the accuracy of tracking for two different objects. For each object, we analyzed two video sequences for estimating tracking accuracy. Subsequently, we will show qualitative results on these two objects and two additional objects possessing complex shapes.

### A. Quantitative Analysis of Pose Estimation Errors

As stated in the preamble to this section, we report our tracking accuracy results with the help of two video sequences of two different objects, the train-station object and a truck object. We will refer to the two video sequences of the train-station object as *Station-Smooth* and *Station-Nonsmooth.* Similarly, we will refer to the two video sequences of the truck object as *Truck-Smooth* and *Truck-Nonsmooth.* The “smooth” and “nonsmooth” qualifiers in the names reflect the nature of the motion of the object with respect to the camera. In particular, the motion for the “nonsmooth” case is very jerky as should be evident to those visiting the URL mentioned at the beginning of this section. For the “nonsmooth” case, the results shown later also include plots of the translational and rotational parameters as functions of time to give the reader an idea of the jerkiness of the motions. The overall size of the train station object is 175 × 100 × 58 and that of the truck 410 × 250 × 200, all in millimeters.

For calculating the ground truth pose for these image sequences, we mount the camera on a high-performance robotic arm, and we move the robotic arm while keeping the target object stationary. Since the calculation of the pose of the object is always relative to the coordinate frame of the cameras, moving the cameras while the object is stationary is equivalent to tracking a moving object with a stationary camera. With this approach, the ground truth pose is calculated from the robot kinematics and the relative pose of the camera with respect to the robot end-effector, which is given by hand–eye calibration.

For the *Station-Smooth* and *Truck-Smooth* experiments, the camera mounted on the robot end-effector moved along a designated path. The average distance from the camera to the target object was 350 mm for the *Station-Smooth* sequence and 700 mm for the *Truck-Smooth* sequence. Each video sequence contained 100 images. A frame from each of the two sequences is shown in the composite in Fig. 11. Shown below the images are two sets of numbers. The first set is the true pose of the object in the camera coordinate frame and the second set is the estimated pose. The average rms error for *Station-Smooth* sequence is 3.3 mm in translation and 0.27° in rotation, and for *Truck-Smooth* sequence it is 6.2 mm in translation and 0.20° in rotation.

For the other two image sequences, *Station-Nonsmooth* and *Truck-Nonsmooth*, the robot end-effector with the camera mounted on it was made to execute sudden large random changes in its direction. In order to give the reader a sense of the magnitude of the motion jerkiness thus induced between the camera and the objects, Fig. 12 shows the translation and rotation values of the ground truth pose (solid line with circular markers) and the corresponding estimates for the tracked pose (dashed line with cross markers) for the two sequences. The horizontal axis in all the plots shown is the time in the video sequences.

To give the reader an even better sense of the extent of motion jerkiness injected manually into the two Nonsmooth experiments, Fig. 13 shows a frame from each video sequence. For each frame, we show the object as it appears to the camera, its superimposed predicted pose with gray line segments, and its estimated pose with dark line segments.

As shown in Table I, the average rms error for the *Station-Nonsmooth* sequence is 4.8 mm in translation and 0.36° in rotation, and 9.2 mm and 0.67° for the *Truck-Nonsmooth* sequence. The table also includes entries for the average rms error for the case of smooth motions.

### B. Pose Estimation Performance Analysis

While the previous section reported quantitative results on the tracking accuracy, we will now address the issue of tracking performance, meaning the speed with which the objects can be tracked. The performance numbers will be presented for the same four video sequences used in the previous section. The computer hardware used in those tracking experiments was a Pentium-4 3.6 GHz processor with 512 Mb of system memory.

Obviously, the time it takes to update the object pose depends on the number of features in the model for the target object. The number of physical edges used for the train station model was 148 and for the truck model 67. A significant portion of the processing time is spent on low-level image processing such as smoothing, edge detection, and straight-edge extraction. The mean of the scene feature extraction time was 153 ms for the train station object and 151 ms for the truck object. The pose estimation time obviously depends on the number of EKF iterations to achieve convergence in the matching of scene features and model features. The average number of EKF iterations is seven for the *Station* sequences and six for the *Truck* sequences. The mean of the pose estimation time was only 53 ms for the train station object and 10 ms for the truck object for the four image sequences used in the previous section. Processing time for each of the four image sequences is listed in Table I.

### C. Some Further Tracking Experiments in the Presence of Occlusion and Cluttered Background

We will now present some additional experimental results to demonstrate how robust our tracking method is to occlusion and cluttered backgrounds. Since each of these phenomena is difficult to quantify for nontrivial experimental conditions, our results in the rest of the paper are only qualitative. That is, we will show some example frames from experimental data taken under conditions that represent the phenomena. With overlays, these example frames will demonstrate that our system is able to track a target object despite the presence of highly adverse circumstances. But, for obvious reasons, it is difficult to convey the full sense of the capabilities of our approach using just static images. An interested reader is therefore urged to visit the Web site whose URL was mentioned at the beginning of Section VI.

For each of the two objects used in the previous section, the train-station object and the truck object, we captured one image sequence with highly cluttered background, named *Station-Clutter* for the train-station object and *Truck-Clutter* for the truck object, and one image sequence in the presence of severe occlusion, named *Station-Occlusion* and *Truck-Occlusion*, respectively.

A frame from each of the two video sequences *Station-Clutter* and *Truck-Clutter*, presented in Fig. 14, qualitatively demonstrates the robustness of our technique when the background is highly cluttered. Superimposed on each frame is projection of the model into the camera image using the calculated pose of the object.

Along the same lines, a frame for each of the two video sequences, *Station-Occlusion* and *Truck-Occlusion*, in Fig. 15 demonstrates the robustness of the system with regard to heavy occlusion. The superimposed wireframe (thick black line segments) in each image shows that the object is being tracked correctly despite the fact that a significant portion of the object is occluded.

We also tested our algorithm for tracking two visually challenging objects. A jeep object, shown in the left panel of Fig. 16, has a complicated shape making the feature matching process confusing. The other object is a digital camera, shown in the right panel of Fig. 16, which has metallic and dark surfaces that make it difficult to extract scene features under normal lighting conditions. Experiments show that the system successfully tracks these two objects. Complete tracking sequences for these objects are also posted at the URL mentioned earlier.

### D. Visual Servoing Experiment

Finally, we will present an experiment in real-time visual servoing using the object tracking algorithm presented in this paper. The goal of this experiment is to carry out peg-in-hole assembly while the “hole” is undergoing large and nonsmooth motions. Fig. 17(a) shows an engine-cover object that hangs from a gantry mounted on the ceiling. The object contains a “hole” into which the robot must insert a “peg.” In a typical experiment, the engine-cover object moves along a linear slide with an average speed of 43.5 mm/s. A couple of strings are attached to the engine cover so that a human can pull them differentially to induce large jerkiness in the motion of the “hole” as the robot end-effector tries to insert the peg into it. Fig. 17(a) shows an example of successful peg insertion, and a screen shot of the camera images before and after feature extraction for successful insertion is shown in Fig. 17(b). Further details regarding these visual servoing experiments for the purpose of robotic assembly are presented in [19]. The servoing results can also be seen at http://cobweb.ecn.purdue.edu/RVL/movies/LineTracking/ICRA06.wmv.