• Abstract

# Discovering Higher Level Structure in Visual SLAM

In this paper, we describe a novel method for discovering and incorporating higher level map structure in a real-time visual simultaneous localization and mapping (SLAM) system. Previous approaches use sparse maps populated by isolated features such as 3-D points or edgelets. Although this facilitates efficient localization, it yields very limited scene representation and ignores the inherent redundancy among features resulting from physical structure in the scene. In this paper, higher level structure, in the form of lines and surfaces, is discovered concurrently with SLAM operation, and then, incorporated into the map in a rigorous manner, attempting to maintain important cross-covariance information and allow consistent update of the feature parameters. This is achieved by using a bottom-up process, in which subsets of low-level features are “folded in” to a parameterization of an associated higher level feature, thus collapsing the state space as well as building structure into the map. We demonstrate and analyze the effects of the approach for the cases of line and plane discovery, both in simulation and within a real-time system operating with a handheld camera in an office environment.

SECTION I

## Introduction

VISUAL simultaneous localization and mapping (SLAM) aims to use cameras to recover a representation of the environment that concurrently enables accurate pose estimation. Cameras are powerful additions to the range of sensors available for robotics, not only because of their small size and cost, but also for their ability to provide a rich set of world features, such as color, texture, motion, and structure. In mobile robotics, vision has proved to be an effective companion for more traditional sensing devices. For example, in [1], where vision is coupled with laser ranging, the visual information is reliable enough to perform loop closure for an outdoor robot in a large section of an urban environment. Indeed, such is the utility of vision, that cameras have also been used as the main or only SLAM sensor, as demonstrated in the semin al work of Davison [2], [3], [4]. This enables applications for a wide range of moving platforms, where the main mapping sensor could be a single camera, such as small domestic robots and a variety of personal devices. For these, visual SLAM offers the fundamental competence of being spatially aware, a necessity for informed motion and task planning.

### A. Motivation

Although there have been significant advances in visual SLAM, these have been primarily concerned with localization. In contrast, progress in developing mapping output has been somewhat neglected. Systems have been based on maps consisting of sparse, isolated features, predomin antly 3-D points [2], [5], [6], and also 3-D edges [7], [8], [9] and localized planar facets [10]. Such minimal representations are attractive from the localization point of view. With judicious choice of features, image processing can be kept to a minimum, enabling real-time operation while maintaining good tracking performance. Recent work has further improved on localization to give better robustness by adopting different frameworks such as factored sampling [11] by using more discrimin ative features [6], [12] or by incorporating relocalization methods for when normal tracking fails [5]. While this has certainly improved the performance significantly, the information contained in the resulting maps is still very limited in terms of the representation of the surrounding environment. In this paper, we seek to address this issue by investigating mechanisms to discover, and then, incorporate higher level map structure, such as lines and surfaces, in visual SLAM systems.

### B. Related Work

The detection of higher level structure has been considered before, using both visual and nonvisual sensors. In most cases, this happens after the map has been generated. For example, in [13] and [14], planes are fitted to clouds of mapped points determined from laser ranging. However, the discovered structures are not incorporated into the map as new features and so do not benefit from correlation with the rest of the map and must be refitted every time when the underlying map changes.

In other examples, higher level features are recovered before they are added into the map, such as those described in [15], [16], [17], where 2-D geometric primitives are fitted to sonar returns and then used as features, and [18], [19] where 3-D planes are fitted to laser range data. The method described in [20] and [21] takes this approach a stage further and fuses measurements from a laser range finder with vertical lines from a camera to discover 2-D lines, semiplanes, and corners that can be treated together as single features. In [22], visual map features, such as lines, walls, and points, are represented in a measurement subspace that can vary its dimensions as more information is gathered about the feature.

In vision-based SLAM, maps have been enhanced by virtue of users providing annotations [23], [24], manually initializing regions with geometric primitives [25], or by embedding and recognizing known objects [26]. These systems require user intervention, and importantly, the structures are not able to grow to include other features as the map continues to expand.

### C. Contribution

In contrast to the previous systems, we propose a system in which higher level structure is automatically introduced concurrently with SLAM operation, and importantly, the newly discovered structures become part of the map, coexisting seamlessly with existing features. Moreover, this is done within a single framework, maintaining full cross covariance between the different feature types, thus enabling consistent update of feature parameters. This is achieved by augmenting the system state with parameters representing higher level structures in a manner similar to that described in [20] and [21]. The novelty of the approach is that we use existing features in the map to infer the existence of higher level structure, and then, proceed to “fold in”subsets of lower order features that already exist in the system state and are consistent with the parameterization. This collapses the state space, reducing redundancy and computational demands, as well as giving a higher level scene description. The approach therefore differs both in the timing at which higher level structure is introduced and in the way the newly discovered structure is maintained and continually extended.

The paper is organized as follows. We first introduce a generic SLAM framework, based around Kalman filtering, in which higher level structure can be incorporated into a map concurrently with localization and mapping. This is followed by details of how this can be achieved for the cases of planes and lines, based on collapsing subsets of 3-D points and edgelets, respectively. Results of experiments in simulation and for real-time operation using a handheld camera within an office environment are then presented, including an analysis of the different tradeoffs within the formulation. The paper concludes with a discussion on the directions for future work. Initial findings of the work were previously reported in [27].

SECTION II

## Visual SLAM With Higher Level Structure

We assume a calibrated camera moving with agile motion while viewing a static scene. The aim is to estimate the camera pose while simultaneously estimating the 3-D parameters of structural features, such as points, lines, or surfaces. This is done using a Kalman filter framework. The system state x = [v,m]T has two partitions: one for the camera pose v = [q,t]T and one for the structural features m = [m1,m2,…]T. The camera rotation is represented by the quaternion q and t is its 3-D position vector, both defined w.r.t. a world coordin ate system. We adopt a generalized framework for the structural partition to allow multiple feature types to be mapped. Each mi denotes a parameterization of a feature, for example, a position vector for a 3-D point or the position and orientation of a planar surface. In general, we expect the number and the types of features to vary as the camera explores the environment, with new features being added and old ones being removed so as to minimize computation and maintain stability.

### A. Kalman Filter Formulation

The Kalman filter requires a process model and an observation model, encoding the assumed evolution of the state between time steps and the relationship between the state and the filter measurements, respectively [28]. Application of the filter to real-time visual SLAM is well documented (see, e.g., [2], [29], and [6]); we, therefore, provide only a brief outline here. We assume a constant position model for the camera motion, i.e., a random walk, and since we are dealing with a static scene, this gives the following process model: TeX Source $${\bf x}^{\rm new} ={\bf f}({\bf x},{\bf w})=[\Delta {\bf q}({\bf w}_\omega)\otimes{\bf q},{\bf t}+{\bf w}_\tau,{\bf m}]^T \eqno{\hbox{(1)}}$$where w = [w_ω,w_τ]T is a 6-D noise vector, assumed to be from N(0,Q), Δq(w_ω) is the incremental quaternion corresponding to the Euler angles defined by w_ω, and ⊗ denotes quaternion multiplication. Note that the map parameters are predicted to remain unchanged (we assume a static scene) and that the nonadditive noise component in the quaternion part is necessary to give an unbiased distribution in rotation space.

At each time step, we collect a set of measurements from the next frame obtained from the camera, z = [z1,z2,…]T, where, in general, each zi will be related to the state via a separate measurement function, i.e., zi = hi(x,e), where e is a multivariate noise vector from N(0,R). For example, when mapping point features, the measurements are assumed to be corrupted 2-D positions of projected points and for the i th measurement TeX Source $${\bf h}_i({\bf x},{\bf e})=\Pi({\bf y}({\bf v},{\bf m}_j))+{\bf e}_i \eqno{\hbox{(2)}}$$where mj is the 3-D position vector of the point associated with the measurement zi,y(v,mj) denotes this position vector in the camera coordin ate system, Π denotes pinhole projection for a calibrated camera, and ei is a 2-D noise vector from N(0,Ri).

Application of the filter at each time step gives mean and covariance estimates for the state based on the process and observation models. Both of these are nonlinear, and thus, we use the extended Kalman filter (EKF) formulation, in which updates are derived from linear approximations based on the Jacobians of f and hi[28]. The role of the covariances within the filter is critical. Proper maintenance ensures that updates are propagated among the state elements, particularly the structural components, which are correlated through their mutual dependence on the camera pose [30], [31]. This ensures consistency and stability of the filter. Propagation of the state covariance through the observation model also allows for better data association, constraining the search for image features. As discussed next, it is therefore crucial to ensure that covariances are properly maintained for a successful filter operation.

### B. Augmenting and Collapsing the State

We adopt a bottom-up process in order to incorporate higher level structure into a map. This works as follows. Given a map populated by isolated lower level features, we seek subsets that support the existence of higher level structure such as lines or surfaces. With sufficient support, a parameterization of the higher level feature is then augmented to the state with an initial estimate derived from the supporting features. Subsequently, lower level features consistent with the parameterization are linked to that feature and transformed according to the constraint that it imposes. This results in a collapsing of the state space, reducing redundancy. Details of how this is achieved in practice are given in Section III.

A key aspect of the work described in this paper is that the earlier process is implemented in a rigorous manner within the filter. For this, we draw on the procedures for transforming and expanding states [32], which also forms the basis for initializing isolated features in existing systems [33]. Augmentation and collapsing of the state is done in such a way so as to preserve full correlation among existing and new features within the map. This ensures that parameter updates are carried out in a consistent fashion. It can be formalized as follows. Assume that we have an existing state estimate , where there are n features in the map, and that we wish to augment the state with a new (higher level) feature mn+1 based on an initialization measurement zo. In general, the initial estimate for the feature will be derived from a combin ation of the measurement and the existing state, i.e., , and the augmented state covariance then becomes TeX Source \eqalignno{{\bf P}^{\rm new} &= {\bf J}\left[\matrix{{\bf P} & {\bf 0} \cr {\bf 0} & {\bf R}_{o}}\right]{\bf J}^T&{\hbox{(3)}}\cr {\bf J}&=\left[\left. \matrix{{\bf I} \cr \cr \hline \cr \nabla{\bf s}_{{\bf v}} \;\; \nabla{\bf s}_{{\bf m}_1} \;\; \ldots \;\; \nabla{\bf s}_{{\bf m}_n}\;\;}\!\!\right\vert \!\!\!\matrix{{\bf 0} \cr \cr \hline \cr \;\;\nabla{\bf s}_{{\bf z}_o}}\right]&{\hbox{(4)}}}where Ro is the covariance of the measurement and ∇sv = ∂s/∂v. Thus, the covariance matrix grows, as illustrated in Fig. 1, and introduces the important correlations between the new feature and the existing features in the map. For example, in the case of augmentation with a new planar feature derived from a subset of point features, only the relevant Jacobians ∇smi will be nonzero and the feature will be correlated with the camera pose through that subset of points.

In a similar way, we can also transform an existing feature to a different representation, e.g., , and then, the covariance update is given by TeX Source \eqalignno{{\bf P}^{\rm new}&={\bf J}{\bf P}{\bf J}^T&{\hbox{(5)}}\cr {\bf J}&=\left[\left.\left. \matrix{{\bf I}\cr \cr \hline \cr \nabla{\bf r}_{{\bf v}} \;\;\ldots \;\;\cr \cr \hline \cr {\bf 0}\;\; }\!\!\right\vert \!\!\!\matrix{{\bf 0}\cr \cr \hline \cr \;\; \nabla{\bf r}_{{\bf m}_i}\;\;\cr \cr \hline \cr {\bf 0}\;\; }\!\!\right\vert \!\!\!\matrix{{\bf 0} \cr \cr \hline \cr \;\; \ldots \;\;\nabla{\bf r}_{{\bf m}_{n}}\;\;\cr \cr \hline \cr {\bf I}}\right]&{\hbox{(6)}}}where any change in the dimensions of the state is reflected in the dimensions of the Jacobian J.

Both of the previous relationships allow us to introduce higher level features and collapse the state size while maintaining the cross correlations within the filter. For example, as illustrated in Fig. 1, consider a new feature mn+1 that represents a higher level structure and imposes a constraint on a subset of existing features (a line and a set of points, respectively, in Fig. 1). Given an initial estimate of mn+1, we can introduce it into the system using (3) and the constraint can then be imposed by transforming the existing features into the new constrained representation using (5). This collapses the state by removing redundant constrained variables, while also maintaining the correlations between the new features, the camera pose, and the remainder of the map.

Fig. 1. Collapsing the state space when introducing higher level structure into the SLAM map. (Left) Map consists of point features. (Middle) Parameterization of higher level line feature introduced into the map, expanding the state x and the covariance Σ.(Right) State space collapses as points on the line are “folded in” to the higher level feature.
SECTION III

## Building in Higher Level Structure

In practice, building higher level structure into a system map has three distinct components: discovery—determining subsets of isolated features that support the premise that a higher level structure is present; incorporation—initializing a parameterization of the higher level feature into the system state and linking existing features to it; and measurement—deriving measurements for the higher level feature in order that it and its linked features can be updated within the filter.

### A. Discovery

The discovery component is essentially a segmentation task among existing features in a map, and we adopt a random sample consensus (RANSAC) approach [34]. This involves generating hypotheses from minimal sets of features randomly sampled from the subset of features having sufficient convergence, i.e., with variance less than some suitable threshold σRANSAC2. Each hypothesis is then tested for consensus with the rest of the set. Thus, given a minimal set of features, (m1,m2,…,mK), we generate the parameters for a higher level feature via p = g(m1,m}2,…, mK), and a feature mi is then deemed to be in consensus with the hypothesis if a distance measure d(p,mi) from the hypothesized feature is less than some threshold dRANSAC. The choice of thresholds here is important and essentially controls the system's willingness to introduce new structure into the map. We discuss this further and analyze the tradeoffs in Section V.

In practice, RANSAC search is not performed over the global map but uses a small local set of the N most recently observed features. At the beginning of each search, a random feature from this set is chosen as a base reference and used to temporarily transform the other features in the set and discover their relative uncertainties using the technique described in [17]. The local covariances of the features are used for thresholding, allowing the discovery of local structure even when the global uncertainty is large. We use a standard implementation of RANSAC and do not currently use the variance information within the algorithm itself, although this could be incorporated in the future, e.g., by temporarily transforming features prior to a Mahalanobis distance consensus test by using the hypothesized higher level features as a base frame of reference.

### B. Incorporation

The hypothesis with the most consensus yields a set of inlying features from which “best-fit” parameters can be derived in order to initialize the associated higher level feature. This marks the start of the incorporation component.

First, a χ2 similarity test with 95% bounds is performed to ensure that the proposed higher level structure is not too similar to any existing structures in the system. If the test is passed, then the state is augmented with the initial parameter estimates and the covariance is then expanded according to (3), where s(x,zo) denotes the derivation of the parameters from the set of inlying features. Note that since the parameters are derived only from the inliers, then both zo = 0 and ∇sv = 0, and we need only to compute the Jacobians ∇smi. This ensures that the parameters are correlated with the rest of the SLAM state through the inlying features.

Having initialized a higher level feature, the system then seeks to link existing features to it and transform them according to the constraint that it imposes. While we could immediately incorporate all of the inlying features into the new structure, we choose not to do this because the current RANSAC search does not fully take into account the uncertainty of the features, as discussed in Section III-A.Instead, we gradually add features to the structure, adopting the following procedure for each feature in the map (or a local subset of features around the new structure).

1. Temporarily transform the feature using the new structure as a base reference [17] to give us the state and covariance relative to the new structure.

2. Reject the feature if σmax2 > σT2, where σmax is the maximum standard deviation of the feature relative to the structure and σT is a threshold standard deviation.

3. Reject the feature if d(p,mi) > dT, where d(p,mi)is a measure of the distance from the structure, e.g., the RANSAC consensus distance measure, and dT is a distance threshold.

4. Accept the feature if its Mahalanobis distance from the structure is less than the 95% χ2 threshold.

Features accepted for incorporation are transformed into a new representation, e.g.,, and the state covariance updated according to (5). This typically results in a collapsing of the state due to the constraint imposed by the higher level feature. For example, as described in the next section, 3-D point features linked to a planar feature are transformed into 2-D points lying on the plane.

It is important to emphasize that once the higher level feature is added to the map, we are able to continue to seek for new features to link to it, even if they were not initially used to estimate the higher level feature or were subsequently added to the map after incorporation.

### C. Measurement

The parameters for both the higher level feature and its linked features are updated in the usual manner within the filter. To do so, we require an observation function, linking the parameters to measurements obtained from the incoming image frames. The details of this function will be dependent on the type of higher level features and specific examples are given in Section IV-A4 and IV-B4 for the cases of planes and lines, respectively. Due to the cross-covariance terms introduced between features and the higher level structure, measurement of one will necessarily update the other. Therefore, it is possible to update the higher level feature even if we are unable to take direct measurements of it.

SECTION IV

## Planes and Lines

In this section, we describe the specific cases of discovering and incorporating planar and straight line structure within maps consisting of isolated 3-D points and edgelets, respectively. The mechanisms closely follow the steps described in Section III. We, therefore, only note the key differences and implementation details here.

### A. 3-D Planes From Points

In the case of planes, we aim to discover the location, extent, and orientation of planar structure formed by 3-D points being concurrently mapped. Once a plane has been initialized, 3-D points deemed to be consistent with it are collapsed into 2-D planar points, reducing the state size, and may even have their location on the plane fixed so that the point can be completely removed from the state. However, note that since we are interested in reducing redundancy in the system, we do not at present verify that the discovered planes correspond to physical planes in the scene (c.f. Section VII).

#### Representation

Since we want to use the plane as a coordin ate frame for 2-D points that are “folded into” the plane, we need to parameterize the orientation of the plane around its normal direction as well as its origin. For convenience, we use a nonminimal representation with a 9-D state vector so that a new planar feature augmented to the state has the form mn+1 = [po,c1,c2]T, where po is the plane origin and the orientation is defined by two orthonormal basis vectors,c1 and c2, which lie on the plane. The normal to the plane is then simply the cross product between the two basis vectors, i.e., n = c1×c2.A 3-D point in the map that is deemed to lie on the plane and whose feature vector mi defines its 3-D position vector can then be transformed into a 2-D planar point using TeX Source $${\bf m}_i^{\rm new} = [({\bf m}_{i}-{\bf p}_o) \cdot {\bf c}_1,\;\; ({\bf m}_{i}-{\bf p}_o) \cdot {\bf c}_2]^T \eqno{\hbox{(7)}}$$where · denotes the dot product. Thus, for l such points on a plane, this gives a state size of 9+2l, compared with 3l, giving a reduction in state size for l > 9. In the filter, the orthonormality constraint of the basis vectors is applied after every update by displacing the mean to satisfy the constraint and updating the covariance to reflect this displacement (c.f. the “displOff”algorithm in [35]).

#### Discovery

The discovery of potential planar structure using the RANSAC process is similar in approach to that used in [36]. We require a minimum of three 3-D points to generate the parameters for a plane hypothesis, i.e., TeX Source $${\bf p}_o={\bf m}_{1} \;\;\; {\bf c}_1={\bf m}_{2}-{\bf m}_{1} \;\;\; {\bf c}_2= {\bf m}_{3}-{\bf m}_{1}. \eqno{\hbox{(8)}}$$Consensus for the hypothesis among other points is then determined using a distance measure based on a perpendicular distance from the plane, i.e., for a point mi, the distance d = (mipon.

We also impose the additional criterion that the Euclidean distance from the plane origin must be less than a second threshold dmax, which ensures that we only initialize planes with strong local support. This biases the system toward introducing planes that are likely to match local physical structure, rather than fitting a geometric plane through a disperse set of points. This is beneficial because it is reasonable to assume that a physical plane will contain other, currently undiscovered, coplanar points that can be “folded in” to the plane in the future.

To determine the best-fit plane from the inliers resulting from the RANSAC process, we use a principal component approach similar to [19]. The origin is set to the mean and the orientation parameters are determined from the principal components. Specifically, if the position vectors for l inlying points w.r.t. the mean are stacked into an l× 3 matrix M, then the eigenvector of MTM corresponding to the smallest eigenvalue gives the normal to the plane and the other two eigenvectors give the basis vectors within the plane. The smallest eigenvalue λmin is the variance of the inliers in the normal direction and provides a convenient measure of the quality of the fit. It is possible for this approach to return poorly conditioned eigenvectors if the system is degenerate. This case can be tested by rejecting planes with pairs of similar eigenvalues.

#### Incorporation

In order to avoid adding poor estimates of planes to the system state, the best-fit plane generated by the RANSAC process is only initialized if l > lT and λmin < λT. The best-fit parameters are used to initialize a plane feature in the state and the covariance is updated according to (3), where the Jacobian w.r.t. an inlying point feature mi is computed as TeX Source $$\nabla{\bf s}_{{\bf m}_i} =\left[\matrix {\displaystyle {{\partial{\bf p}_{o}}\over {\partial{\bf m}_{i}}},\quad {{\partial{\bf c}_1}\over {\partial{\bf m}_{i}}},\quad {{\partial{\bf c}_2}\over {\partial{\bf m}_{i}}}}\right]^T \eqno{\hbox{(9)}}$$where ∂c1/∂mi and∂c2/∂mj are the Jacobians of the two eigenvectors computed using the method described in [37]. This ensures that the plane parameters are correlated with the rest of the state through the inlying point features.

As an additional criterion for incorporation of the point onto the plane, we reject the point if its distance from the plane origin or any other point on the plane is greater than dmax. Again, this is to bias the system toward accepting local planes.

#### Measurement

Here, we do not consider making a direct observation of the plane. Instead, we base the plane measurement on observations of the 2-D points linked to the plane. The measurement model for a 2-D planar point mi is very similar to that of a standard 3-D point feature, but involves an additional prelimin ary step to convert it to a 3-D position vector p in the world frame of reference prior to projection into the camera, i.e., from (7), we have TeX Source $${\bf p}=\left[{\bf c}_1 \;\; {\bf c}_2\right]{\bf m}_i \;+\; {\bf p}_o. \eqno{\hbox{(10)}}$$The predicted measurement for the planar point can then be obtained by passing p through the standard perspective projection model in (2). The similarity between the measurement models makes the implementation of planar points in the considered EKF SLAM system very simple. The measurement Jacobian required by the EKF is almost the same as that for 3-D point features, except that we need to take account of the conversion in (10) when computing the Jacobian relating the observation to the 2-D point feature, i.e., TeX Source $${{\partial{\bf h}_{i}}\over {\partial{\bf m}_{i}}}={{\partial{\bf h}_{i}}\over {\partial{\bf p}}}\quad {{\partial{\bf p}}\over {\partial{\bf m}_{i}}}\eqno{\hbox{(11)}}$$where hi is the measurement function associated with a 3-D point (2) and ∂p/∂mi is the Jacobian of the 3-D point w.r.t. the 2-D planar point derived from (10). Fin ally, since the predicted observation is also dependent on the position and orientation of the plane, a Jacobian ∂hi/∂mj relating hi to the relevant plane feature mj is also derived from (10).

#### Fixing Plane Points

If the maximum of the variances of a 2-D planar point becomes very small relative to the plane, i.e., less than a threshold σfix, then we can consider further collapsing the state space by removing it completely from the state. Instead of maintaining an estimate of its 2-D position on the plane, we consider it as a fixed point in the plane and use its measurements to update the associated planar feature directly. This leads to further reduction in the size of the state space.

### B. 3-D Lines From Edgelets

Another natural option to enhance scene description beyond sparse points are edges. In this case, we consider edgelet features, which are locally straight sections of a gradient image proposed in [7]. An edgelet is composed of a point in 3-D and its direction. By their nature, edgelets can be initialized along a variety of edge shapes by assuming that even curved edges can be approximated by locally straight sections. In this case, we are interested in discovering straight lines formed from collections of edgelets and perform a reduction in state size similar to that of points incorporated into planes.

#### Representation

We parameterize an edgelet in 6-D, composed of its 3-D position ce, and unnormalized 3-D direction vector de. Similarly, a 3-D line feature is defined by mn+1 = [cl,dl]T, where cldenotes its origin point and dl is its direction.

#### Discovery

In the RANSAC discovery component, we use pairs of edgelets to generate line hypotheses, based on their 3-D position vectors, which we found to give more reliable results than using a single edgelet. The consensus distance measure is then based on the perpendicular distance of an edgelet's 3-D position from the line and the difference in its 3-D orientation. Principal components of the inlier set with most consensus then provides an initial estimate of the line direction and the mean 3-D position of the inliers is taken as the line origin.

#### Incorporation

After a line is put in the state, edgelets deemed to lie on the line are linked to it and transformed into 1-D line points parameterized by their distance from the line origin. These 1-D edgelets can then be removed from the state, reducing the state size dramatically. Transforming directly from the 6-D edgelet representation to a fixed, 1-D representation on the line makes sense because the remaining degree of freedom, which allows the edgelet to move along the line, is not observable in the edgelet measurements [7], which only measure perpendicular offset from the line. Also, the edgelet does not have a descriptor linking it to a specific point on the line, so small errors in position along the line will have no effect on the performance of the system. This is a marked contrast to the case of points on a plane, where each point measurement is associated through a visual descriptor to a specific feature in the image. To transform the edgelet, we simply project it onto the line along the perpendicular direction.

#### Measurement

In contrast to the case with planes, we do not pass all of the individual observations of the line points into the EKF. Instead, a single observation for the line is computed by projecting each of its fixed edgelets into the image plane (with 3-D orientation set to that of the line) and applying a RANSAC line-fitting method to robustly fit a single line to individual edgelet measurements. The direction and position of this single line in the image is passed into the EKF as the observation of the 3-D line. Measurement Jacobians for the EKF are then computed in a similar manner to (11).

This approach has two distinct benefits: 1) the number of observations that the EKF has to process each frame is significantly reduced and 2) the RANSAC line fitting ensures that we only accept mutually compatible measurements of the edgelets along the line, similar to the mutual compatibility test between the features of a given landmark in [21].

SECTION V

## Simulations

### A. Methodology

We have analyzed the approach using two different simulated environments. The simulations model a 6-DOF camera moving through space with a 81° horizontal field of view (FOV) and image size 320 × 240 pixels. The simulated data have been used to measure the system consistency and the accuracy of the tracking and reconstruction against the known ground truth. In each case, the simulated environment contains a small set of known landmarks that are initially visible from the camera. This “template” provides a known scale for the map.

#### Consistency

The consistency of the system is measured using the normalized estimation error squared (NEES) [28] averaged over multiple Monte Carlo runs of the simulations. If the system is consistent, then we expect this average value to converge toward the dimension of the system state, which we can test using a χ2 test with 95% bounds. An average value above the upper bound indicates that the system is optimistic, a case that we particularly want to avoid because it reduces the positive effects of loop closure and prevents successful data association if we use feature innovation to constrain the search for image features.

TABLE I Plane and Line Simulation Results Over 50 Runs on a Small Map

#### State Size Reduction

The state size reduction from adding higher level features is expressed as a percentage of the maximum reduction we could achieve, if all of the simulated features initialized on ground-truth higher level structures were added correctly to their higher level structure by the system. For example, in a simulation of a set of points initialized to lie on a plane, we would achieve 100% of the maximum state size reduction, if all of these points were successfully added to one discovered plane in the system map.

### B. Small Map Simulation

Fig. 2. Small map simulation results. (Top) Edgelet mapping and line discovery. (Bottom) Point mapping and plane discovery. A known template of points/edgelets is positioned at the beginning and end of the wall to give a known scale to the map and provide loop closure at each end. Covariance is drawn as red ellipsoids. Ground-truth lines are drawn in pale red/cyan according to whether they are currently observed by the camera, and the estimated lines are superimposed on the top as thicker blue lines. The estimated plane is drawn using a pale green bounding box around the included points. Full sequence videos and a portable document format (PDF)version of this paper with color images are available in the IEEE Xplore.

The first simulation (Table I and Figs. 3 and 4) demonstrates the improvement in mean absolute error (MAE)of the map and reduction in state size caused by adding higher level structure in a small map with multiple loop closures, as might be typical in a desktop visual SLAM system or within a local map of a submapping SLAM system. The environment contains ground-truth line or point features arranged randomly in a 4-m-long volume and the camera translates along a sinusoidal path in front of them (Fig. 2). The positions of newly observed points and edgelets are initialized as 6-D inverse depth points [33] and converted to 3-D points once their depth uncertainty becomes Gaussian [38]. We assume perfect data association and Gaussian measurement noise of 0.5 pixel2. Thresholds for the planes were set at dT = 0.1 cm,σT = 1.0 cm, σRANSAC = 2σT, dmax = 200 cm, and lT = 7. Thresholds for lines were set at dT = 0.5 cm, 24°,σT = 1.0 cm, 24°, σRANSAC = σT, dmax = 120 cm, and lT = 4. The other thresholds were set in relation to these first ones as dRANSAC = dT, λT = dT2, and σfix = dT.

In the line simulations, the values in Table I correspond to a significant mean reduction in state size from 780 to 204. The MAE in orientation and position is also improved by adding lines, as shown in Fig. 3. In particular, the relatively large covariance on edgelet direction measurements and limited motion of the camera means that the edgelets do not quickly converge to a good estimation on their own, so applying a line constraint helps to overcome these limitations.

Fig. 3. Small map simulation results for lines over 50 Monte Carlo runs. (Top-left)Average camera NEES remains below the upper threshold indicating that the camera does not become inconsistent. (Top-right, bottom-left, and bottom-right) Evolution of MSE for the camera position, edgelet positions, and edgelet orientations respectively. Loop closure occurs around frame 750.

In the plane simulations, we also investigated adding clutter around the plane by positioning 50% of the points randomly within ± 20 cm of the ground-truth plane (Planes B in Table I). When clutter points are prevented from being added to the plane using ground-truth knowledge (Planes A), the results in Fig. 4 show similar characteristics to the line experiments, demonstrating reduction in MAE and state size, and no significant effect on consistency of the camera or the map. When clutter points are allowed to be erroneously added to the plane (Planes B and C), this performance degrades due to the added inconsistency in the map that becomes visible at loop closure (around frame 750, Fig. 4). However, the system still achieves 0.4 cm accuracy over a 4-m map and is capable of respectable state size reductions when points are fixed on the plane.

Fig. 4. Small map simulation results for planes over 50 Monte Carlo runs. (Top-left)Average camera NEES remains below the upper threshold indicating that the camera does not become inconsistent. (Top-right) Proportion of features that are inconsistent increases after loop closure if outliers are erroneously added to the plane. (Bottom-left, bottom-right)Evolution of MSE for the camera position and map, respectively. Loop closure occurs around frame 750.

### C. Room Simulation

In the second simulation, the camera moves through a larger environment with 3-D points randomly arranged on and around the walls of a 4-m square room (see Fig. 5). Clutter distribution and thresholds are left the same as for the previous experiments (c.f. Section V-B).The challenge in this simulation is that loop closures occur less regularly and camera orientation error is given time to grow much larger.

Fig. 5. Plane simulation results for Planes B. (Far-left) Initial point mapping and plane discovery. (Center-left) Large covariance prior to loop closure. (Center-right) Covariance and map error reduction after loop closure. (Far-right) Further reduction in covariance and slight inconsistency after second loop. Point and camera covariance shown in red and plane covariance in blue. Green lines denote ground-truth walls and estimated planes shown as blue bounding boxes around included points, with thickness indicating angle error. A full sequence video and a PDF version of this paper with color images is available in the IEEE Xplore.

When we allow the map to converge before adding higher level structure (Planes E), then consistency and MAE are comparable to points only operation (no planes), as shown in Table II and Fig. 6. However, we get the additional benefit of reduced state size and improved understanding of the environment. Similar performance is achieved when we add planes to the map but do not incorporate points onto the plane (Planes D): consistency and MAE are preserved, although in this case, the state size increases since planes are added but no points are removed.

TABLE II Room Simulation Results Over 30 Runs on a Room Map
Fig. 6. Consistency results over 30 Monte Carlo runs for the room simulation. (Left)Average NEES of the camera state, the horizontal lines indicate the 95% bounds on the average consistency, and the system is inconsistent if the NEES exceeds the upper bound. (Right) Proportion of points in the map with optimistic average consistency at each frame. Loop closure occurs around frames 2700 and 5400.

Differences in consistency only occur when the system attempts to incorporate points before loop closure (Planes B and C in Fig. 6). In these cases, the MAE is slightly larger but still comparable to the other cases. It seems as though the additional inconsistency is caused by a combin ation of two factors: adding outlier points to the plane and introducing errors when projecting 3-D points onto the plane to enforce the planarity constraint. We intend to investigate this further in future work.

SECTION VI

## Real-World Experiments

### A. Methodology

To illustrate the operation of the approach with real-world data, we present results obtained for plane discovery from points within a real-time visual SLAM system. This operates with a handheld camera and experiments were carried out in an office environment containing textured planar surfaces. We use a firewire webcam with frames of size 320× 240 pixels and a 43° FOV. The system begins by mapping 3-D points, starting from known points on a calibration pattern [2]. As the camera moves away, new points are initialized and added to the map, allowing tracking to continue as the camera explores the scene. Points are added using the inverse depth representation [33] and augmented to the state in a manner that maintains full covariance with the rest of the map [39] (c.f. Section II-B). For measurements, we use the features from accelerated segment test (FAST) salient point detector [40] to identify potential features and scale-invariant feature transform (SIFT) style descriptors with scale prediction [12] for repeatable matching across frames.

Fig. 7. Examples from a run of real-time visual SLAM with plane discovery using a handheld camera in an office environment. (Left)View of estimated camera position, 3-D mapped points, and planar features. (Right) Views through the camera with discovered planar features superimposed (boundaries in white) with converged 3-D points in black and 2-D planar points in white. A full sequence video is available in the IEEE Xplore.

### B. Results

Fig. 7 shows results from an example run. On the left are external views of the estimated camera position and map estimates in terms of 3-D points and planes at different time frames. The images on the right show the view through the camera with mapped point and planar features superimposed. Note that the three planes in the scene have been successfully discovered and incorporated into the map. The vast majority of mapped 3-D points were transformed into 2-D points (except those on the calibration pattern, which for fair comparison, we force to remain as 3-D points). The system operated at around 20 frames/s, on a 2.4 GHz PC, including graphics rendering and without software optimization, for maps of up to 50 features and with ten visible features per frame on average.

Fig. 8 shows examples of planar features converging within the SLAM map following initialization. The left image shows the state of the map following the introduction of a planar feature that links the 3-D points mapped on the side wall. Note the thickness of the graphic plane that illustrates the combined uncertainty in the plane position and orientation. As shown in the right image, after further observations, the plane has converged as indicated by the thin graphic. We have used orthographic, rather than perspective, projection to show the clear reduction in uncertainty. The plane discovery here is effective and we have utilized this as the basis for an augmented reality (AR) game demonstration [41].

Fig. 8. Convergence of planar features within the SLAM framework. Note the reduction in the graphic thickness for the left wall plane, indicating a reduction in uncertainty of plane position and orientation.

### C. Computational Demands

For real-time operation, the additional computational overhead of structure discovery becomes important. There are three main sources of this: 1) the initialization of structure using RANSAC; 2) the search for additional features to add to incorporated structure; and 3) the incorporation of new structure into the filter. In the experiments, structure search was run every frame using a set of 40 features, and all features in the map were considered for incorporation into the existing higher level structures. Using a nonoptimized software implementation, we found that the RANSAC initialization formed the bulk of the overhead, requiring around 10 ms per run. However, note that a more intelligent approach, which distributes the searches over multiple frames or runs less frequently or on smaller datasets, for example, would allow computational overhead to be controlled and balanced against the potential savings from incorporating higher level structure into the map. This is also something that we intend to investigate in future studies.

SECTION VII

## Conclusion

We have presented a method for discovering and incorporating higher level map structure, in the forms of planes and lines, in a real-time visual SLAM system. In small maps, introducing higher level structure reduces error and allows us to collapse the state size by combining subsets of low-level features. The higher level structures are able to grow and shrink dynamically as features are added and removed from the map, and provide more meaning to the SLAM map than the standard sparse collections of points. In larger maps, where error is allowed to grow significantly between loop closures, it is still possible to introduce the higher level structures but more care must be taken to avoid inconsistency.

However, in its current form, the method does have limitations. Incorrect incorporation of points into higher level features can have an adverse effect on consistency, and although we found that, in practice, this could be tolerated to some extent, it will need to be addressed if greater robustness is to be achieved. In future work, we intend to make better use of the rich visual information from the camera to improve the consistency and performance of the method. In particular, we currently discover the higher level structures purely from the information contained in the map, but there is a clear opportunity to combine this with information from the visual sensor, giving the potential to reduce the number of features erroneously added to higher level structures.

The current system also uses a number of thresholds, and these have a noticeable effect on the system performance. It would be desirable to reduce the number of thresholds that can be varied, ideally adopting a probabilistic threshold on the likelihood of a higher level structure existing in the map given the information we have about the low-level features. This type of probabilistic approach could also combine information from other sources, such as image information from the camera, before making a decision about whether to instantiate the new structure.

Fin ally, although high-level structure discovery facilitates useful reduction of the system state size, particularly in small-scale maps, it will not solve the computational problem of SLAM in large-scale environments, which has also been shown to introduce inconsistency into the filter. Submapping strategies may offer an approach to dealing with both of these problems at the same time. It will be interesting to consider how high-level structures could be discovered, incorporated, and managed within a submapped SLAM system, where low-level features belonging to the structure may potentially be distributed across multiple submaps. Recent work [42] examining submapping with conditionally independent submaps may provide a useful approach.

### Acknowledgment

The authors would like to thank the reviewers for their many useful comments and their suggestions for improvements.

## Footnotes

Manuscript received December 18, 2007; revised July 18, 2008. This paper was recommended for publication by Associate Editor J. Leonard and Editor L. Parker upon evaluation of the reviewers, comments. The work of A. P. Gee was supported by the Engineering and Physical Sciences Research Council (EPSRC) Equator Interdisciplin ary Research Collaboration (IRC). The work of D. Chekhlov was supported in part by the Overseas Research Students Awards Scheme (ORSAS), in part by the University of Bristol Postgraduate Research Scholarship, and in part by the Leonard Wakeham Endowment.

The authors are with the Department of Computer Science, University of Bristol, Bristol BS8 1UB, U.K. (e-mail: gee@cs.bris.ac.uk; chekhlov@cs.bris.ac.uk; andrew@cs.bris.ac.uk; wmayol@cs.bris.ac.uk).

This paper has supplementary downloadable material available at http://ieeexplore.ieee.org provided by the authors. The supplementary multimedia contains a set of four videos demonstrating the three different simulations that were run and the system in operation discovering planes in a real office environment. Its size is 36 MB.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

## References

1. Outdoor SLAM using visual appearance and laser ranging

P. Newman, D. Cole, K. Ho

Proc. Int. Conf. Robot. Autom., Orlando, FL 2006, pp. 1180–1187

2. Real-time simultaneous localisation and mapping with a single camera

A. J. Davison

Proc. Int. Conf. Comput. Vision 2003 pp. 1403–1410

3. 3-D simultaneous localisation and map-building using active vision for a robot moving on undulating terrain

A. J. Davison, N. Kita

Comput. Vision Pattern Recog., Vol. 1 pp. 384–391 2001

4. Simultaneous localisation and map-building using active vision

A. Davison, D. Murray

IEEE Trans. Pattern Anal. Mach. Intell., Vol. 24, issue (7) pp. 865–880 2002-07

5. Automatic relocalisation for a single-camera simultaneous localisation and mapping system

B. Williams, P. Smith, I. Reid

Proc. Int. Conf. Robot. Autom. Rome, Italy, 2007 pp. 2784–2790

6. Robust real-time visual SLAM using scale prediction and exemplar based feature description

D. Chekhlov, M. Pupilli, W. Mayol, A. Calway

Proc. Int. Conf. Comput. Vision Pattern Recog. 2007 pp. 1–7

7. Edge landmarks in monocular SLAM

presented at the Brit. Mach. Vision Conf., Brit. Mach. Vision Conf. 2006

8. Real-time monocular SLAM with straight lines

P. Smith, I. Reid, A. Davison

Proc. Brit. Mach. Vision Conf. 2006, pp. 17–26

9. Real-time model-based SLAM using line segments

A. P. Gee, W. Mayol-Cuevas

Proc. Int. Symp. Visual Comput. 2006 pp. 354–363

10. Locally planar patch features for real-time structure from motion

N. Molton, I. Reid, A. Davison

presented at the Brit. Mach. Vision Conf., Brit. Mach. Vision Conf. 2004

11. Scalable monocular SLAM

Proc. Int. Conf. Comput. Vision Pattern Recog. 2006 pp. 469–476

12. Real-time and robust monocular SLAM using predictive multi-resolution descriptors

D. Chekhlov, M. Pupilli, W. Mayol-Cuevas, A. Calway

presented at the 2nd Int. Symp. Visual Comput., 2nd Int. Symp. Visual Comput. 2006-11

13. Obtaining 3-D models of indoor environments with a mobile robot by estimating local surface directions

M. M. Nevado, J. G. García-Bermejo, E. Zalama

Robot. Auton. Syst., Vol. 48, issue (2/3) pp. 131–143 2004-09

14. 3D plane-based egomotion for SLAM on semi-structured environment

D. Viejo, M. Cazorla

Proc. IEEE Int. Conf. Intell. Robots Syst. San Diego CA 2007 pp. 2761–2766

15. Mobile robot relocation from echolocation constraints

J. Lim, J. Leonard

IEEE Trans. Pattern Anal. Mach. Intell., Vol. 22, issue (9) pp. 1035–1041 2000-09

16. Towards robust data association and feature modeling for concurrent mapping and localization

J. J. Leonard, P. M. Newman, R. J. Rikoski, J. Neira, J. D. Tardos

Proc. Tenth Int. Symp. Robot. Res. Lorne Australia, 2001-11 pp. 9–12

17. Robust mapping and localization in indoor environments using sonar data

J. Tardos, J. Neira, P. Newman, J. Leonard

Int. J. Robot. Res., Vol. 21, issue (4) pp. 311–330 2002

18. EKF-based 3D SLAM for structured environment reconstruction

J. Weingarten, R. Siegwart

Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. 2005 pp. 3834–3839

19. Probabilistic plane fitting in 3-D and an application to robotic mapping

J. Weingarten, G. Gruener, R. Siegwart

Proc. IEEE Int. Conf. Robot. Autom., New Orleans, LA, 2004 pp. 927–932

20. Mobile Robot Localization and Map Building: A Multisensor Fusion Approach

J. Castellanos J. Tardós

Norwell, MA
Mobile Robot Localization and Map Building: A Multisensor Fusion Approach Kluwer 2000

21. Multisensor fusion for simultaneous localization and map building

J. Castellanos, J. Neira, J. Tardos

IEEE Trans. Robot. Autom., Vol. 17, issue (6) pp. 908–914 2001-12

22. Vision SLAM in the measurement subspace

J. Folkesson, P. Jensfelt, H. Christensen

Proc. IEEE Int. Conf. Robot. Autom., Barcelona, Spain, 2005 pp. 30–35

23. Interaction between hand and wearable camera in 2-D and 3-D environments

W. Mayol, A. Davison, B. Tordoff, N. Molton, D. Murray

presented at the Brit. Mach. Vision Conf., Brit. Mach. Vision Conf. 2004

24. Real-time localisation and mapping with wearable active vision

A. Davison, W. Mayol, D. Murray

Proc. IEEE ACM Int. Symp. Mixed Augmented Reality 2003 pp. 18–27

25. Semi-automatic annotations in unknown environments

E. E. G. Reitmayr, T. Drummond

Proc. IEEE ACM Int. Symp. Mixed Augmented Reality Nara Japan, 2007-11 pp. 67–70

26. Towards simultaneous recognition, localization and mapping for hand-held and wearable cameras

R. O. Castle, D. J. Gawley, G. Klein, D. W. Murray

Proc. Int. Conf. Robot. Autom. Roma Italy 2007 pp. 4102–4107

27. Discovering planes and collapsing the state space in visual SLAM

A. P. Gee, D. Chekhlov, W. Mayol, A. Calway

presented at the Brit. Mach. Vision Conf., Proc. Brit. Mach. Vision Conf. 2007

28. Estimation with Applications to Tracking and Navigation

Y. Bar-Shalom, T. Kirubarajan, X. Li

New York
Estimation with Applications to Tracking and Navigation Wiley 2002

29. MonoSLAM: Real-time single camera SLAM

A. Davison, I. Reid, N. Molton, O. Stasse

IEEE Trans. Pattern Anal. Mach. Intell., Vol. 29, issue (6) pp. 1052–1067 2007-06

30. A stochastic map for uncertain spatial relationships

R. Smith, M. Self, P. Cheeseman

presented at the Int. Symp. Robot. Res., Int. Symp. Robot. Res. 1987

31. Simultaneous localisation and mapping (SLAM): Part I—The essential algorithms

H. Durrant-Whyte, T. Bailey

IEEE Robot. Autom. Mag., Vol. 13, issue (2) pp. 99–110 2006-06

32. Simultaneous localisation and mapping (SLAM): Part II—State of the art

T. Bailey, H. Durrant-Whyte

IEEE Robot. Autom. Mag., Vol. 13, issue (3) pp. 108–117 2006-09

33. Unified inverse depth parametrization for monocular SLAM

J. Montiel, J. Civera, A. Davison

Proc. Robot.: Sci. Syst. Conf. 2006 pp. 16–19

35. On Kalman filtering with nonlinear equality constraints

S. Julier, J. LaViola

IEEE Trans. Signal Process., Vol. 55, issue (6) pp. 2774–2784 2007-06

36. A random sampling strategy for piecewise planar scene segmentation

A. Bartoli

Comput. Vision Image Understanding, Vol. 105, issue (1) pp. 42–59 2007

37. Estimating the jacobian of the singular value decomposition: Theory and applications

Proc. Euro Conf. Comput. Vision 2000 pp. 554–570

38. Inverse depth to depth conversion for monocular SLAM

J. Civera, A. Davison, J. Montiel

presented at the Int. Conf. Robot. Autom., Int. Conf. Robot. Autom. 2007

39. Consistency of the EKF-SLAM algorithm

T. Bailey, J. Nieto, J. Guivant, M. Stevens, E. Nebot

Proc. Int. Conf. Intell. Robots Syst. 2006 pp. 3562–3568

40. Machine learning for high-speed corner detection

E. Rosten, T. Drummond

Proc. Euro Conf. Comput. Vision 2006 pp. 430–443

41. Ninja on a plane: Automatic discovery of physical planes for augmented reality using visual SLAM

D. Chekhlov, A. P. Gee, A. Calway, W. Mayol

Proc. Int. Symp. Mixed Augmented Reality (ISMAR) Nara Japan 2007 pp. 153–156

42. Scalable SLAM building conditionally independent local maps

P. Piniés, J. Tardós

presented at the IEE/RSJ Int. Conf. Intell. Robots Syst., IEE/RSJ Int. Conf. Intell. Robots Syst., 2007

## Cited By

No Citations Available

## Keywords

### IEEE Keywords

No Keywords Available

### INSPEC: Controlled Indexing

Kalman filters, SLAM (robots), covariance analysis

### More Keywords

No Keywords Available

No Corrections

## Media

Video

18,744 KB
Video

8,430 KB
Video

6,429 KB
Video

### simSmallMapPlanes

3,294 KB
This paper appears in:
IEEE Transactions on Robotics
Issue Date:
OCTOBER 2008
On page(s):
980 - 990
ISBN:
1552-3098
Print ISBN:
N/A
INSPEC Accession Number:
10291076
Digital Object Identifier:
10.1109/TRO.2008.2004641
Date of Current Version:
31 Oct, 2008
Date of Origin al Publication:
10 Oct, 2008

Barbosa, L.C.