VISUAL simultaneous localization and mapping (SLAM) aims to use cameras to recover a representation of the environment that concurrently enables accurate pose estimation. Cameras are powerful additions to the range of sensors available for robotics, not only because of their small size and cost, but also for their ability to provide a rich set of world features, such as color, texture, motion, and structure. In mobile robotics, vision has proved to be an effective companion for more traditional sensing devices. For example, in [1], where vision is coupled with laser ranging, the visual information is reliable enough to perform loop closure for an outdoor robot in a large section of an urban environment. Indeed, such is the utility of vision, that cameras have also been used as the main or only SLAM sensor, as demonstrated in the semin al work of Davison [2], [3], [4]. This enables applications for a wide range of moving platforms, where the main mapping sensor could be a single camera, such as small domestic robots and a variety of personal devices. For these, visual SLAM offers the fundamental competence of being *spatially aware*, a necessity for informed motion and task planning.

### A. Motivation

Although there have been significant advances in visual SLAM, these have been primarily concerned with localization. In contrast, progress in developing mapping output has been somewhat neglected. Systems have been based on maps consisting of sparse, isolated features, predomin antly 3-D points [2], [5], [6], and also 3-D edges [7], [8], [9] and localized planar facets [10]. Such minimal representations are attractive from the *localization* point of view. With judicious choice of features, image processing can be kept to a minimum, enabling real-time operation while maintaining good tracking performance. Recent work has further improved on localization to give better robustness by adopting different frameworks such as factored sampling [11] by using more discrimin ative features [6], [12] or by incorporating relocalization methods for when normal tracking fails [5]. While this has certainly improved the performance significantly, the information contained in the resulting maps is still very limited in terms of the representation of the surrounding environment. In this paper, we seek to address this issue by investigating mechanisms to discover, and then, incorporate higher level map structure, such as lines and surfaces, in visual SLAM systems.

### B. Related Work

The detection of higher level structure has been considered before, using both visual and nonvisual sensors. In most cases, this happens *after* the map has been generated. For example, in [13] and [14], planes are fitted to clouds of mapped points determined from laser ranging. However, the discovered structures are not incorporated into the map as new features and so do not benefit from correlation with the rest of the map and must be refitted every time when the underlying map changes.

In other examples, higher level features are recovered *before* they are added into the map, such as those described in [15], [16], [17], where 2-D geometric primitives are fitted to sonar returns and then used as features, and [18], [19] where 3-D planes are fitted to laser range data. The method described in [20] and [21] takes this approach a stage further and fuses measurements from a laser range finder with vertical lines from a camera to discover 2-D lines, semiplanes, and corners that can be treated together as single features. In [22], visual map features, such as lines, walls, and points, are represented in a *measurement subspace* that can vary its dimensions as more information is gathered about the feature.

In vision-based SLAM, maps have been enhanced by virtue of users providing annotations [23], [24], manually initializing regions with geometric primitives [25], or by embedding and recognizing known objects [26]. These systems require user intervention, and importantly, the structures are not able to grow to include other features as the map continues to expand.

### C. Contribution

In contrast to the previous systems, we propose a system in which higher level structure is automatically introduced *concurrently* with SLAM operation, and importantly, the newly discovered structures become part of the map, coexisting seamlessly with existing features. Moreover, this is done within a single framework, maintaining full cross covariance between the different feature types, thus enabling consistent update of feature parameters. This is achieved by augmenting the system state with parameters representing higher level structures in a manner similar to that described in [20] and [21]. The novelty of the approach is that we use existing features in the map to infer the existence of higher level structure, and then, proceed to “fold in”subsets of lower order features that already exist in the system state and are consistent with the parameterization. This collapses the state space, reducing redundancy and computational demands, as well as giving a higher level scene description. The approach therefore differs both in the timing at which higher level structure is introduced and in the way the newly discovered structure is maintained and continually extended.

The paper is organized as follows. We first introduce a generic SLAM framework, based around Kalman filtering, in which higher level structure can be incorporated into a map concurrently with localization and mapping. This is followed by details of how this can be achieved for the cases of planes and lines, based on collapsing subsets of 3-D points and edgelets, respectively. Results of experiments in simulation and for real-time operation using a handheld camera within an office environment are then presented, including an analysis of the different tradeoffs within the formulation. The paper concludes with a discussion on the directions for future work. Initial findings of the work were previously reported in [27].

SECTION II

## Visual SLAM With Higher Level Structure

We assume a calibrated camera moving with agile motion while viewing a static scene. The aim is to estimate the camera pose while simultaneously estimating the 3-D parameters of structural features, such as points, lines, or surfaces. This is done using a Kalman filter framework. The system state **x** = [**v**,**m**]^{T} has two partitions: one for the camera pose **v** = [**q**,**t**]^{T} and one for the structural features **m** = [**m**_{1},**m**_{2},…]^{T}. The camera rotation is represented by the quaternion **q** and **t** is its 3-D position vector, both defined w.r.t. a world coordin ate system. We adopt a generalized framework for the structural partition to allow multiple feature types to be mapped. Each **m**_{i} denotes a parameterization of a feature, for example, a position vector for a 3-D point or the position and orientation of a planar surface. In general, we expect the number and the types of features to vary as the camera explores the environment, with new features being added and old ones being removed so as to minimize computation and maintain stability.

### A. Kalman Filter Formulation

The Kalman filter requires a *process model* and an *observation model*, encoding the assumed evolution of the state between time steps and the relationship between the state and the filter measurements, respectively [28]. Application of the filter to real-time visual SLAM is well documented (see, e.g., [2], [29], and [6]); we, therefore, provide only a brief outline here. We assume a constant position model for the camera motion, i.e., a random walk, and since we are dealing with a static scene, this gives the following process model:
TeX Source
$${\bf x}^{\rm new} ={\bf f}({\bf x},{\bf w})=[\Delta {\bf q}({\bf w}_\omega)\otimes{\bf q},{\bf t}+{\bf w}_\tau,{\bf m}]^T \eqno{\hbox{(1)}}$$where **w** = [**w**_ω,**w**_τ]^{T} is a 6-D noise vector, assumed to be from **N**(0,**Q**), Δ**q**(**w**_ω) is the incremental quaternion corresponding to the Euler angles defined by **w**_ω, and ⊗ denotes quaternion multiplication. Note that the map parameters are predicted to remain unchanged (we assume a static scene) and that the nonadditive noise component in the quaternion part is necessary to give an unbiased distribution in rotation space.

At each time step, we collect a set of measurements from the next frame obtained from the camera, **z** = [**z**_{1},**z**_{2},…]^{T}, where, in general, each **z**_{i} will be related to the state via a separate measurement function, i.e., **z**_{i} = **h**_{i}(**x**,**e**), where **e** is a multivariate noise vector from **N**(0,**R**). For example, when mapping point features, the measurements are assumed to be corrupted 2-D positions of projected points and for the *i* th measurement
TeX Source
$${\bf h}_i({\bf x},{\bf e})=\Pi({\bf y}({\bf v},{\bf m}_j))+{\bf e}_i \eqno{\hbox{(2)}}$$where **m**_{j} is the 3-D position vector of the point associated with the measurement **z**_{i},**y**(**v**,**m**_{j}) denotes this position vector in the camera coordin ate system, Π denotes pinhole projection for a calibrated camera, and **e**_{i} is a 2-D noise vector from **N**(0,**R**_{i}).

Application of the filter at each time step gives mean and covariance estimates for the state based on the process and observation models. Both of these are nonlinear, and thus, we use the extended Kalman filter (EKF) formulation, in which updates are derived from linear approximations based on the Jacobians of **f** and **h**_{i}[28]. The role of the covariances within the filter is critical. Proper maintenance ensures that updates are propagated among the state elements, particularly the structural components, which are correlated through their mutual dependence on the camera pose [30], [31]. This ensures consistency and stability of the filter. Propagation of the state covariance through the observation model also allows for better data association, constraining the search for image features. As discussed next, it is therefore crucial to ensure that covariances are properly maintained for a successful filter operation.

### B. Augmenting and Collapsing the State

We adopt a bottom-up process in order to incorporate higher level structure into a map. This works as follows. Given a map populated by isolated lower level features, we seek subsets that support the existence of higher level structure such as lines or surfaces. With sufficient support, a parameterization of the higher level feature is then augmented to the state with an initial estimate derived from the supporting features. Subsequently, lower level features consistent with the parameterization are linked to that feature and transformed according to the constraint that it imposes. This results in a collapsing of the state space, reducing redundancy. Details of how this is achieved in practice are given in Section III.

A key aspect of the work described in this paper is that the earlier process is implemented in a rigorous manner within the filter. For this, we draw on the procedures for transforming and expanding states [32], which also forms the basis for initializing isolated features in existing systems [33]. Augmentation and collapsing of the state is done in such a way so as to preserve full correlation among existing and new features within the map. This ensures that parameter updates are carried out in a consistent fashion. It can be formalized as follows. Assume that we have an existing state estimate , where there are *n* features in the map, and that we wish to augment the state with a new (higher level) feature **m**_{n+1} based on an initialization measurement **z**_{o}. In general, the initial estimate for the feature will be derived from a combin ation of the measurement and the existing state, i.e., , and the augmented state covariance then becomes
TeX Source
$$\eqalignno{{\bf P}^{\rm new} &= {\bf J}\left[\matrix{{\bf P} & {\bf 0} \cr {\bf 0} & {\bf R}_{o}}\right]{\bf J}^T&{\hbox{(3)}}\cr {\bf J}&=\left[\left. \matrix{{\bf I} \cr \cr \hline \cr \nabla{\bf s}_{{\bf v}} \;\; \nabla{\bf s}_{{\bf m}_1} \;\; \ldots \;\; \nabla{\bf s}_{{\bf m}_n}\;\;}\!\!\right\vert \!\!\!\matrix{{\bf 0} \cr \cr \hline \cr \;\;\nabla{\bf s}_{{\bf z}_o}}\right]&{\hbox{(4)}}}$$where **R**_{o} is the covariance of the measurement and ∇**s**_{v} = ∂**s**/∂**v**. Thus, the covariance matrix grows, as illustrated in Fig. 1, and introduces the important correlations between the new feature and the existing features in the map. For example, in the case of augmentation with a new planar feature derived from a subset of point features, only the relevant Jacobians ∇**s**_{mi} will be nonzero and the feature will be correlated with the camera pose through that subset of points.

In a similar way, we can also transform an existing feature to a different representation, e.g., , and then, the covariance update is given by
TeX Source
$$\eqalignno{{\bf P}^{\rm new}&={\bf J}{\bf P}{\bf J}^T&{\hbox{(5)}}\cr {\bf J}&=\left[\left.\left. \matrix{{\bf I}\cr \cr \hline \cr \nabla{\bf r}_{{\bf v}} \;\;\ldots \;\;\cr \cr \hline \cr {\bf 0}\;\; }\!\!\right\vert \!\!\!\matrix{{\bf 0}\cr \cr \hline \cr \;\; \nabla{\bf r}_{{\bf m}_i}\;\;\cr \cr \hline \cr {\bf 0}\;\; }\!\!\right\vert \!\!\!\matrix{{\bf 0} \cr \cr \hline \cr \;\; \ldots \;\;\nabla{\bf r}_{{\bf m}_{n}}\;\;\cr \cr \hline \cr {\bf I}}\right]&{\hbox{(6)}}}$$where any change in the dimensions of the state is reflected in the dimensions of the Jacobian **J**.

Both of the previous relationships allow us to introduce higher level features and collapse the state size while maintaining the cross correlations within the filter. For example, as illustrated in Fig. 1, consider a new feature **m**_{n+1} that represents a higher level structure and imposes a constraint on a subset of existing features (a line and a set of points, respectively, in Fig. 1). Given an initial estimate of **m**_{n+1}, we can introduce it into the system using (3) and the constraint can then be imposed by transforming the existing features into the new constrained representation using (5). This collapses the state by removing redundant constrained variables, while also maintaining the correlations between the new features, the camera pose, and the remainder of the map.

SECTION III

## Building in Higher Level Structure

In practice, building higher level structure into a system map has three distinct components: *discovery*—determining subsets of isolated features that support the premise that a higher level structure is present; *incorporation*—initializing a parameterization of the higher level feature into the system state and linking existing features to it; and *measurement*—deriving measurements for the higher level feature in order that it and its linked features can be updated within the filter.

### A. Discovery

The discovery component is essentially a segmentation task among existing features in a map, and we adopt a random sample consensus (RANSAC) approach [34]. This involves generating hypotheses from minimal sets of features randomly sampled from the subset of features having sufficient convergence, i.e., with variance less than some suitable threshold σ_{RANSAC}^{2}. Each hypothesis is then tested for consensus with the rest of the set. Thus, given a minimal set of features, (**m**_{1},**m**_{2},…,**m**_{K}), we generate the parameters for a higher level feature via **p** = **g**(**m**_{1},**m**}_{2},…, **m**_{K}), and a feature **m**_{i} is then deemed to be in consensus with the hypothesis if a distance measure *d*(**p**,**m**_{i}) from the hypothesized feature is less than some threshold *d*_{RANSAC}. The choice of thresholds here is important and essentially controls the system's willingness to introduce new structure into the map. We discuss this further and analyze the tradeoffs in Section V.

In practice, RANSAC search is not performed over the global map but uses a small local set of the *N* most recently observed features. At the beginning of each search, a random feature from this set is chosen as a base reference and used to temporarily transform the other features in the set and discover their relative uncertainties using the technique described in [17]. The local covariances of the features are used for thresholding, allowing the discovery of local structure even when the global uncertainty is large. We use a standard implementation of RANSAC and do not currently use the variance information within the algorithm itself, although this could be incorporated in the future, e.g., by temporarily transforming features prior to a Mahalanobis distance consensus test by using the hypothesized higher level features as a base frame of reference.

### B. Incorporation

The hypothesis with the most consensus yields a set of inlying features from which “best-fit” parameters can be derived in order to initialize the associated higher level feature. This marks the start of the incorporation component.

First, a χ^{2} similarity test with 95% bounds is performed to ensure that the proposed higher level structure is not too similar to any existing structures in the system. If the test is passed, then the state is augmented with the initial parameter estimates and the covariance is then expanded according to (3), where **s**(**x**,**z**_{o}) denotes the derivation of the parameters from the set of inlying features. Note that since the parameters are derived only from the inliers, then both **z**_{o} = **0** and ∇**s**_{v} = **0**, and we need only to compute the Jacobians ∇**s**_{mi}. This ensures that the parameters are correlated with the rest of the SLAM state through the inlying features.

Having initialized a higher level feature, the system then seeks to link existing features to it and transform them according to the constraint that it imposes. While we could immediately incorporate all of the inlying features into the new structure, we choose not to do this because the current RANSAC search does not fully take into account the uncertainty of the features, as discussed in Section III-A.Instead, we gradually add features to the structure, adopting the following procedure for each feature in the map (or a local subset of features around the new structure).

Temporarily transform the feature using the new structure as a base reference [17] to give us the state and covariance relative to the new structure.

Reject the feature if σ_{max}^{2} > σ_{T}^{2}, where σ_{max} is the maximum standard deviation of the feature relative to the structure and σ_{T} is a threshold standard deviation.

Reject the feature if *d*(**p**,**m**_{i}) > *d*_{T}, where *d*(**p**,**m**_{i})is a measure of the distance from the structure, e.g., the RANSAC consensus distance measure, and *d*_{T} is a distance threshold.

*Accept* the feature if its Mahalanobis distance from the structure is less than the 95% χ^{2} threshold.

Features accepted for incorporation are transformed into a new representation, e.g.,, and the state covariance updated according to (5). This typically results in a collapsing of the state due to the constraint imposed by the higher level feature. For example, as described in the next section, 3-D point features linked to a planar feature are transformed into 2-D points lying on the plane.

It is important to emphasize that once the higher level feature is added to the map, we are able to continue to seek for new features to link to it, even if they were not initially used to estimate the higher level feature or were subsequently added to the map after incorporation.

### C. Measurement

The parameters for both the higher level feature and its linked features are updated in the usual manner within the filter. To do so, we require an observation function, linking the parameters to measurements obtained from the incoming image frames. The details of this function will be dependent on the type of higher level features and specific examples are given in Section IV-A4 and IV-B4 for the cases of planes and lines, respectively. Due to the cross-covariance terms introduced between features and the higher level structure, measurement of one will necessarily update the other. Therefore, it is possible to update the higher level feature even if we are unable to take direct measurements of it.

SECTION IV

## Planes and Lines

In this section, we describe the specific cases of discovering and incorporating planar and straight line structure within maps consisting of isolated 3-D points and edgelets, respectively. The mechanisms closely follow the steps described in Section III. We, therefore, only note the key differences and implementation details here.

### A. 3-D Planes From Points

In the case of planes, we aim to discover the location, extent, and orientation of planar structure formed by 3-D points being concurrently mapped. Once a plane has been initialized, 3-D points deemed to be consistent with it are collapsed into 2-D planar points, reducing the state size, and may even have their location on the plane fixed so that the point can be completely removed from the state. However, note that since we are interested in reducing redundancy in the system, we do not at present verify that the discovered planes correspond to physical planes in the scene (c.f. Section VII).

*Representation*

Since we want to use the plane as a coordin ate frame for 2-D points that are “folded into” the plane, we need to parameterize the orientation of the plane around its normal direction as well as its origin. For convenience, we use a nonminimal representation with a 9-D state vector so that a new planar feature augmented to the state has the form **m**_{n+1} = [**p**_{o},**c**_{1},**c**_{2}]^{T}, where **p**_{o} is the plane origin and the orientation is defined by two orthonormal basis vectors,**c**_{1} and **c**_{2}, which lie on the plane. The normal to the plane is then simply the cross product between the two basis vectors, i.e., **n** = **c**_{1}×**c**_{2}.A 3-D point in the map that is deemed to lie on the plane and whose feature vector **m**_{i} defines its 3-D position vector can then be transformed into a 2-D planar point using
TeX Source
$${\bf m}_i^{\rm new} = [({\bf m}_{i}-{\bf p}_o) \cdot {\bf c}_1,\;\; ({\bf m}_{i}-{\bf p}_o) \cdot {\bf c}_2]^T \eqno{\hbox{(7)}}$$where · denotes the dot product. Thus, for *l* such points on a plane, this gives a state size of 9+2*l*, compared with 3*l*, giving a reduction in state size for *l* > 9. In the filter, the orthonormality constraint of the basis vectors is applied after every update by displacing the mean to satisfy the constraint and updating the covariance to reflect this displacement (c.f. the “displOff”algorithm in [35]).

*Discovery*

The discovery of potential planar structure using the RANSAC process is similar in approach to that used in [36]. We require a minimum of three 3-D points to generate the parameters for a plane hypothesis, i.e.,
TeX Source
$${\bf p}_o={\bf m}_{1} \;\;\; {\bf c}_1={\bf m}_{2}-{\bf m}_{1} \;\;\; {\bf c}_2= {\bf m}_{3}-{\bf m}_{1}. \eqno{\hbox{(8)}}$$Consensus for the hypothesis among other points is then determined using a distance measure based on a perpendicular distance from the plane, i.e., for a point **m**_{i}, the distance *d* = (**m**_{i} − **p**_{o})·**n**.

We also impose the additional criterion that the Euclidean distance from the plane origin must be less than a second threshold *d*_{max}, which ensures that we only initialize planes with strong local support. This biases the system toward introducing planes that are likely to match local physical structure, rather than fitting a geometric plane through a disperse set of points. This is beneficial because it is reasonable to assume that a physical plane will contain other, currently undiscovered, coplanar points that can be “folded in” to the plane in the future.

To determine the best-fit plane from the inliers resulting from the RANSAC process, we use a principal component approach similar to [19]. The origin is set to the mean and the orientation parameters are determined from the principal components. Specifically, if the position vectors for *l* inlying points w.r.t. the mean are stacked into an *l*× 3 matrix **M**, then the eigenvector of **M**^{T}**M** corresponding to the smallest eigenvalue gives the normal to the plane and the other two eigenvectors give the basis vectors within the plane. The smallest eigenvalue λ_{min} is the variance of the inliers in the normal direction and provides a convenient measure of the quality of the fit. It is possible for this approach to return poorly conditioned eigenvectors if the system is degenerate. This case can be tested by rejecting planes with pairs of similar eigenvalues.

*Incorporation*

In order to avoid adding poor estimates of planes to the system state, the best-fit plane generated by the RANSAC process is only initialized if *l* > *l*_{T} and λ_{min} < λ_{T}. The best-fit parameters are used to initialize a plane feature in the state and the covariance is updated according to (3), where the Jacobian w.r.t. an inlying point feature **m**_{i} is computed as
TeX Source
$$\nabla{\bf s}_{{\bf m}_i} =\left[\matrix {\displaystyle {{\partial{\bf p}_{o}}\over {\partial{\bf m}_{i}}},\quad {{\partial{\bf c}_1}\over {\partial{\bf m}_{i}}},\quad {{\partial{\bf c}_2}\over {\partial{\bf m}_{i}}}}\right]^T \eqno{\hbox{(9)}}$$where ∂**c**_{1}/∂**m**_{i} and∂**c**_{2}/∂**m**_{j} are the Jacobians of the two eigenvectors computed using the method described in [37]. This ensures that the plane parameters are correlated with the rest of the state through the inlying point features.

As an additional criterion for incorporation of the point onto the plane, we reject the point if its distance from the plane origin or any other point on the plane is greater than *d*_{max}. Again, this is to bias the system toward accepting local planes.

*Measurement*

Here, we do not consider making a direct observation of the plane. Instead, we base the plane measurement on observations of the 2-D points linked to the plane. The measurement model for a 2-D planar point **m**_{i} is very similar to that of a standard 3-D point feature, but involves an additional prelimin ary step to convert it to a 3-D position vector **p** in the world frame of reference prior to projection into the camera, i.e., from (7), we have
TeX Source
$${\bf p}=\left[{\bf c}_1 \;\; {\bf c}_2\right]{\bf m}_i \;+\; {\bf p}_o. \eqno{\hbox{(10)}}$$The predicted measurement for the planar point can then be obtained by passing **p** through the standard perspective projection model in (2). The similarity between the measurement models makes the implementation of planar points in the considered EKF SLAM system very simple. The measurement Jacobian required by the EKF is almost the same as that for 3-D point features, except that we need to take account of the conversion in (10) when computing the Jacobian relating the observation to the 2-D point feature, i.e.,
TeX Source
$${{\partial{\bf h}_{i}}\over {\partial{\bf m}_{i}}}={{\partial{\bf h}_{i}}\over {\partial{\bf p}}}\quad {{\partial{\bf p}}\over {\partial{\bf m}_{i}}}\eqno{\hbox{(11)}}$$where **h**_{i} is the measurement function associated with a 3-D point (2) and ∂**p**/∂**m**_{i} is the Jacobian of the 3-D point w.r.t. the 2-D planar point derived from (10). Fin ally, since the predicted observation is also dependent on the position and orientation of the plane, a Jacobian ∂**h**_{i}/∂**m**_{j} relating **h**_{i} to the relevant plane feature **m**_{j} is also derived from (10).

*Fixing Plane Points*

If the maximum of the variances of a 2-D planar point becomes very small relative to the plane, i.e., less than a threshold σ_{fix}, then we can consider further collapsing the state space by removing it completely from the state. Instead of maintaining an estimate of its 2-D position on the plane, we consider it as a fixed point in the plane and use its measurements to update the associated planar feature directly. This leads to further reduction in the size of the state space.

### B. 3-D Lines From Edgelets

Another natural option to enhance scene description beyond sparse points are edges. In this case, we consider edgelet features, which are locally straight sections of a gradient image proposed in [7]. An edgelet is composed of a point in 3-D and its direction. By their nature, edgelets can be initialized along a variety of edge shapes by assuming that even curved edges can be approximated by locally straight sections. In this case, we are interested in discovering straight lines formed from collections of edgelets and perform a reduction in state size similar to that of points incorporated into planes.

*Representation*

We parameterize an edgelet in 6-D, composed of its 3-D position **c**_{e}, and unnormalized 3-D direction vector **d**_{e}. Similarly, a 3-D line feature is defined by **m**_{n+1} = [**c**_{l},**d**_{l}]^{T}, where **c**_{l}denotes its origin point and **d**_{l} is its direction.

*Discovery*

In the RANSAC discovery component, we use pairs of edgelets to generate line hypotheses, based on their 3-D position vectors, which we found to give more reliable results than using a single edgelet. The consensus distance measure is then based on the perpendicular distance of an edgelet's 3-D position from the line and the difference in its 3-D orientation. Principal components of the inlier set with most consensus then provides an initial estimate of the line direction and the mean 3-D position of the inliers is taken as the line origin.

*Incorporation*

After a line is put in the state, edgelets deemed to lie on the line are linked to it and transformed into 1-D line points parameterized by their distance from the line origin. These 1-D edgelets can then be removed from the state, reducing the state size dramatically. Transforming directly from the 6-D edgelet representation to a fixed, 1-D representation on the line makes sense because the remaining degree of freedom, which allows the edgelet to move along the line, is not observable in the edgelet measurements [7], which only measure perpendicular offset from the line. Also, the edgelet does not have a descriptor linking it to a specific point on the line, so small errors in position along the line will have no effect on the performance of the system. This is a marked contrast to the case of points on a plane, where each point measurement is associated through a visual descriptor to a specific feature in the image. To transform the edgelet, we simply project it onto the line along the perpendicular direction.

*Measurement*

In contrast to the case with planes, we do not pass all of the individual observations of the line points into the EKF. Instead, a single observation for the line is computed by projecting each of its fixed edgelets into the image plane (with 3-D orientation set to that of the line) and applying a RANSAC line-fitting method to robustly fit a single line to individual edgelet measurements. The direction and position of this single line in the image is passed into the EKF as the observation of the 3-D line. Measurement Jacobians for the EKF are then computed in a similar manner to (11).

This approach has two distinct benefits: 1) the number of observations that the EKF has to process each frame is significantly reduced and 2) the RANSAC line fitting ensures that we only accept mutually compatible measurements of the edgelets along the line, similar to the mutual compatibility test between the features of a given landmark in [21].

### A. Methodology

We have analyzed the approach using two different simulated environments. The simulations model a 6-DOF camera moving through space with a 81° horizontal field of view (FOV) and image size 320 × 240 pixels. The simulated data have been used to measure the system consistency and the accuracy of the tracking and reconstruction against the known ground truth. In each case, the simulated environment contains a small set of known landmarks that are initially visible from the camera. This “template” provides a known scale for the map.

*Consistency*

The consistency of the system is measured using the *normalized estimation error squared* (NEES) [28] averaged over multiple Monte Carlo runs of the simulations. If the system is consistent, then we expect this average value to converge toward the dimension of the system state, which we can test using a χ^{2} test with 95% bounds. An average value above the upper bound indicates that the system is optimistic, a case that we particularly want to avoid because it reduces the positive effects of loop closure and prevents successful data association if we use feature innovation to constrain the search for image features.

*State Size Reduction*

The state size reduction from adding higher level features is expressed as a percentage of the maximum reduction we could achieve, if all of the simulated features initialized on ground-truth higher level structures were added correctly to their higher level structure by the system. For example, in a simulation of a set of points initialized to lie on a plane, we would achieve 100% of the maximum state size reduction, if all of these points were successfully added to one discovered plane in the system map.

### B. Small Map Simulation

The first simulation (Table I and Figs. 3 and 4) demonstrates the improvement in mean absolute error (MAE)of the map and reduction in state size caused by adding higher level structure in a small map with multiple loop closures, as might be typical in a desktop visual SLAM system or within a local map of a submapping SLAM system. The environment contains ground-truth line or point features arranged randomly in a 4-m-long volume and the camera translates along a sinusoidal path in front of them (Fig. 2). The positions of newly observed points and edgelets are initialized as 6-D inverse depth points [33] and converted to 3-D points once their depth uncertainty becomes Gaussian [38]. We assume perfect data association and Gaussian measurement noise of 0.5 pixel^{2}. Thresholds for the planes were set at *d*_{T} = 0.1 cm,σ_{T} = 1.0 cm, σ_{RANSAC} = 2σ_{T}, *d*_{max} = 200 cm, and *l*_{T} = 7. Thresholds for lines were set at *d*_{T} = 0.5 cm, 24°,σ_{T} = 1.0 cm, 24°, σ_{RANSAC} = σ_{T}, *d*_{max} = 120 cm, and *l*_{T} = 4. The other thresholds were set in relation to these first ones as *d*_{RANSAC} = *d*_{T}, λ_{T} = *d*_{T}^{2}, and σ_{fix} = *d*_{T}.

In the line simulations, the values in Table I correspond to a significant mean reduction in state size from 780 to 204. The MAE in orientation and position is also improved by adding lines, as shown in Fig. 3. In particular, the relatively large covariance on edgelet direction measurements and limited motion of the camera means that the edgelets do not quickly converge to a good estimation on their own, so applying a line constraint helps to overcome these limitations.

In the plane simulations, we also investigated adding clutter around the plane by positioning 50% of the points randomly within ± 20 cm of the ground-truth plane (Planes B in Table I). When clutter points are prevented from being added to the plane using ground-truth knowledge (Planes A), the results in Fig. 4 show similar characteristics to the line experiments, demonstrating reduction in MAE and state size, and no significant effect on consistency of the camera or the map. When clutter points are allowed to be erroneously added to the plane (Planes B and C), this performance degrades due to the added inconsistency in the map that becomes visible at loop closure (around frame 750, Fig. 4). However, the system still achieves 0.4 cm accuracy over a 4-m map and is capable of respectable state size reductions when points are fixed on the plane.

### C. Room Simulation

In the second simulation, the camera moves through a larger environment with 3-D points randomly arranged on and around the walls of a 4-m square room (see Fig. 5). Clutter distribution and thresholds are left the same as for the previous experiments (c.f. Section V-B).The challenge in this simulation is that loop closures occur less regularly and camera orientation error is given time to grow much larger.

When we allow the map to converge before adding higher level structure (Planes E), then consistency and MAE are comparable to points only operation (no planes), as shown in Table II and Fig. 6. However, we get the additional benefit of reduced state size and improved understanding of the environment. Similar performance is achieved when we add planes to the map but do not incorporate points onto the plane (Planes D): consistency and MAE are preserved, although in this case, the state size increases since planes are added but no points are removed.

Differences in consistency only occur when the system attempts to incorporate points before loop closure (Planes B and C in Fig. 6). In these cases, the MAE is slightly larger but still comparable to the other cases. It seems as though the additional inconsistency is caused by a combin ation of two factors: adding outlier points to the plane and introducing errors when projecting 3-D points onto the plane to enforce the planarity constraint. We intend to investigate this further in future work.

SECTION VI

## Real-World Experiments

### A. Methodology

To illustrate the operation of the approach with real-world data, we present results obtained for plane discovery from points within a real-time visual SLAM system. This operates with a handheld camera and experiments were carried out in an office environment containing textured planar surfaces. We use a firewire webcam with frames of size 320× 240 pixels and a 43° FOV. The system begins by mapping 3-D points, starting from known points on a calibration pattern [2]. As the camera moves away, new points are initialized and added to the map, allowing tracking to continue as the camera explores the scene. Points are added using the inverse depth representation [33] and augmented to the state in a manner that maintains full covariance with the rest of the map [39] (c.f. Section II-B). For measurements, we use the features from accelerated segment test (FAST) salient point detector [40] to identify potential features and scale-invariant feature transform (SIFT) style descriptors with scale prediction [12] for repeatable matching across frames.

### B. Results

Fig. 7 shows results from an example run. On the left are external views of the estimated camera position and map estimates in terms of 3-D points and planes at different time frames. The images on the right show the view through the camera with mapped point and planar features superimposed. Note that the three planes in the scene have been successfully discovered and incorporated into the map. The vast majority of mapped 3-D points were transformed into 2-D points (except those on the calibration pattern, which for fair comparison, we force to remain as 3-D points). The system operated at around 20 frames/s, on a 2.4 GHz PC, including graphics rendering and without software optimization, for maps of up to 50 features and with ten visible features per frame on average.

Fig. 8 shows examples of planar features converging within the SLAM map following initialization. The left image shows the state of the map following the introduction of a planar feature that links the 3-D points mapped on the side wall. Note the thickness of the graphic plane that illustrates the combined uncertainty in the plane position and orientation. As shown in the right image, after further observations, the plane has converged as indicated by the thin graphic. We have used orthographic, rather than perspective, projection to show the clear reduction in uncertainty. The plane discovery here is effective and we have utilized this as the basis for an augmented reality (AR) game demonstration [41].

### C. Computational Demands

For real-time operation, the additional computational overhead of structure discovery becomes important. There are three main sources of this: 1) the initialization of structure using RANSAC; 2) the search for additional features to add to incorporated structure; and 3) the incorporation of new structure into the filter. In the experiments, structure search was run every frame using a set of 40 features, and all features in the map were considered for incorporation into the existing higher level structures. Using a nonoptimized software implementation, we found that the RANSAC initialization formed the bulk of the overhead, requiring around 10 ms per run. However, note that a more intelligent approach, which distributes the searches over multiple frames or runs less frequently or on smaller datasets, for example, would allow computational overhead to be controlled and balanced against the potential savings from incorporating higher level structure into the map. This is also something that we intend to investigate in future studies.

We have presented a method for discovering and incorporating higher level map structure, in the forms of planes and lines, in a real-time visual SLAM system. In small maps, introducing higher level structure reduces error and allows us to collapse the state size by combining subsets of low-level features. The higher level structures are able to grow and shrink dynamically as features are added and removed from the map, and provide more meaning to the SLAM map than the standard sparse collections of points. In larger maps, where error is allowed to grow significantly between loop closures, it is still possible to introduce the higher level structures but more care must be taken to avoid inconsistency.

However, in its current form, the method does have limitations. Incorrect incorporation of points into higher level features can have an adverse effect on consistency, and although we found that, in practice, this could be tolerated to some extent, it will need to be addressed if greater robustness is to be achieved. In future work, we intend to make better use of the rich visual information from the camera to improve the consistency and performance of the method. In particular, we currently discover the higher level structures purely from the information contained in the map, but there is a clear opportunity to combine this with information from the visual sensor, giving the potential to reduce the number of features erroneously added to higher level structures.

The current system also uses a number of thresholds, and these have a noticeable effect on the system performance. It would be desirable to reduce the number of thresholds that can be varied, ideally adopting a probabilistic threshold on the likelihood of a higher level structure existing in the map given the information we have about the low-level features. This type of probabilistic approach could also combine information from other sources, such as image information from the camera, before making a decision about whether to instantiate the new structure.

Fin ally, although high-level structure discovery facilitates useful reduction of the system state size, particularly in small-scale maps, it will not solve the computational problem of SLAM in large-scale environments, which has also been shown to introduce inconsistency into the filter. Submapping strategies may offer an approach to dealing with both of these problems at the same time. It will be interesting to consider how high-level structures could be discovered, incorporated, and managed within a submapped SLAM system, where low-level features belonging to the structure may potentially be distributed across multiple submaps. Recent work [42] examining submapping with conditionally independent submaps may provide a useful approach.

### Acknowledgment

The authors would like to thank the reviewers for their many useful comments and their suggestions for improvements.