SECTION I

## Introduction: State-of-the Art in Visual SLAM

THE INTEREST in using cameras in simultaneous localization and mapping (SLAM) has grown tremendously in recent times. Cameras have become much more inexpensive than lasers, and also provide texture rich information about scene elements at practically any distance from the camera. 6-DOF SLAM systems based on 3-D laser scanners plus odometry have been demonstrated feasible both indoors and outdoors [2], [3], as well as vision aided by laser without odometry [4] and vision aided by an inertial navigation system [5], [6]. But in applications where it is not practical to carry heavy and bulky sensors, such as egomotion for people tracking and environment modeling in rescue operations, cameras seem the only light weight sensors that can be easily adapted to helmets used by rescuers, or simply worn.

Current visual SLAM research has been focused on the use of either monocular or stereo vision to obtain 3-D information from the environment. Quite a few monocular visual SLAM systems have been demonstrated to be viable for small environments [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]. Most are essentially standard extended Kalman filter (EKF) SLAM systems, and vary in the technique used to initialize a feature, given the partiality of the bearing only information provided by one camera, or in the type of interest points extracted from the images (be it Harris corners, Shi–Tomasi corners, scale-invariant feature transform (SIFT) features, or some combination). Some works have also considered segment features [17], [18]. Larger environments have been tackled in hierarchical visual SLAM [19].

A single camera is used in all of these systems, and although very distant features are potentially detectable, scale unobservability is a fundamental limitation. Either the scale is fixed in some way(for example, by observing a known object [16]), or drift in scale can occur as is reported in the hierarchical visual SLAM system [19]. Panoramic cameras are also being used in visual SLAM [20], [21]. Here, the limitation of scale unobservability is overcome using an additional stereo vision bench for motion estimation between consecutive frames. In the work of Royer *et at.* [22], only monocular images are used. Mapping is achieved using a batch hierarchical bundle adjustment algorithm to compute all camera as well as interest points locations. The scale is introduced in the system by manually entering the length of the path.

Stereo visual systems provide scale through the baseline between the cameras, known from calibration. Davison and Murray demonstrated the first active stereo visual SLAM system [23], [24], [25]. It is based on standard EKF, and thus, has low scalability also. Under restrictive planar environment assumptions, Iocchi *et al. * built an environment map using stereo [26]. Se *et al. * demonstrated a visual stereo SLAM system using SIFT features in a small laboratory environment [27]. This system is also unlikely to scale adequately to large environments or work in more challenging outdoor scenarios as cross-correlations were neglected for computational reasons. In [28] and [29], the authors demonstrate an autonomous blimp system for terrain mapping using stereo as the only sensor, also using a standard EKF SLAM algorithm. Saez*et al. * [30] presented a 6-DOF stereo visual SLAM system, where egomotion estimation is done by a 3-D point matching algorithm, and mapping through a global entropy minimization algorithm in indoor orthogonal scenarios, with difficult extension to more complex nonorthogonal environments.

In [31] and [32], Sim *et al. * describe a dense visual SLAM system using Rao–Blackwellized particle filters and SIFT features (a similar effort in using Rao–Blackwellized particle filters and SIFT features for visual SLAM was reported in [15]). Visual odometry [structure from motion (SFM)] is used to generate proposals for the sensor motion and global pose estimation algorithms for loop closing. This system works in either monocular or stereo mode, with cameras mounted on a robot moving in 2-D;sensor trajectories with 6 DOF will require large amounts of particles for their representation. In [33], the authors also compare the advantages of separate monocular and stereo approaches in traditional SLAM frameworks.

In this paper, we show the advantages of being able to accommodate both monocular and stereo information in carrying out a 6-DOF SLAM with a handheld camera. In the works of Sola *et al. * [34] and Lemaire *et al. * [20], it is also pointed out that combining visual information at close range as well as at infinity should improve the performance of visual SLAM.

Since the initial results of [35], great progress has been made in the related problem of visual odometry [36], [37], [38], [39]. Visual odometry systems have the important advantage of constant time execution. Furthermore, during exploratory trajectories, in which an environment feature is seen for a certain window of time and never more, visual odometry can obtain the same precision in the estimation of the sensor location as a SLAM system, with a great reduction in cost. Unfortunately, visual odometry does not cope with loop closings, and thus, eventual drift in these cases is inevitable. Stereo visual odometry combined with GPS can result in a mapping system that avoids long-term drift [40], [41], but unfortunately GPS is not always available. Improving the precision in sensor location through loop closing is one of the main advantages of SLAM.

An important limitation of current SLAM systems that use the standard EKF algorithm is that when mapping large environments very soon, they face computational as well as consistency problems [42]. Many efforts have been invested in reducing the *O*(*n*^{2}) cost of the EKF updates. In [43], an information filter, the dual of the Kalman filter, was used, allowing constant time updates irrespective of the size of the map. An approximation is carried out to sparsify the information matrix, which may lead to map divergency [44]. The treemap algorithm [45] performs updates in *O*(log *n*) also by forcing information matrix sparseness by weak link breakage. In more complicated trajectories, such as lawn mowing, the cost can be more than log linear [46]. In the smoothing and mapping method [47], the authors observed that the information matrix is exactly sparse when all vehicle locations are considered in the stochastic map, and thus, very efficient techniques can be used to compute the batch solution (a recent incremental version is described in [48]).

All of these algorithms use the information form, and thus, the state and covariance are not readily available. There are alternatives that work on the covariance form, such as the map joining algorithm [49]. It works on a sequence of local maps of limited size, and thus, it can cut down the cost of EKF SLAM considerably, although remaining *O*(*n*^{2}). It has the additional advantage of improving the consistency of the resulting estimation [42]. The divide and conquer algorithm [50] is able to compute the covariance form of the stochastic map in an amortized time linear with the size of the map, improving further the consistency of the solution. However, in these systems, local maps are required to be statistically independent. This requires creating a new local map from scratch every time the current local map size limit has been reached. Consequently, no sharing of valuable information is possible in a 6-DOF visual SLAM, such as the camera velocity, or information about features currently being tracked. This issue has been tackled in a recent work [51] by using the *conditional independence* property.

In this paper, we describe a robust and scalable 6-DOF visual SLAM system that can be carried in hand at normal walking speeds of4–5 km/h, and used to map large indoor and outdoor environments. In Section II, we summarize the main characteristics of our system. In Section III, we describe the details of the visual SLAM system that provides the sequence of conditionally independent (CI) local maps, the basic building blocks of our mapping algorithm. This algorithm, CI divide and conquer (D&C) SLAM, is explained in Section IV. In Section V, we describe the two experiments carried out to test the system, an indoor 200 m loop and an outdoor 140 m loop. In Section VI, we discuss the results obtained, and finally, in Section VII, we draw the main conclusions of our work.

The fundamental characteristics of the system that we describe in this paper are as follows.

Unlike any other visual SLAM system, we consider information from features, both close and far from the cameras. A stereo provides 3-D information from nearby scene points, and each camera can also provide bearing only information from distant scene points. Both types of information are incorporated into the map and used to improve the estimation of both the camera pose and velocity, as well as the map.

Nearby scene points provide scale information through the stereo baseline, eliminating the intrinsic scale unobservability problem of monocular systems.

We use Conditionally Independent Divide and Conquer SLAM algorithm that allows the system to maintain both camera velocity information and current feature information during local map initialization. This adds robustness to the system without sacrificing precision or consistency in any way. Being a D&C algorithm, it also allows linear time execution, enabling the system to be used for large-scale indoor/outdoor SLAM.

Our 6-DOF hardware system consists of a stereo camera carried in hand and a laptop to record and process a sequence of images (see Fig. 1). Since the camera moves in 6 DOF, we define the camera state using 12 variables: camera position in 3-D Cartesian coordinates, camera orientation in Euler angles, and linear and angular velocities. It is known that a stereo camera can provide depth estimation of points up to a certain distance determined by the baseline between left and right cameras. Therefore, two regions can be differentiated: a region close to the cameras and visible by both, in which the stereo behaves as a range and bearing sensor. The second is the region of features far from the cameras or seen by only one, in which the stereo becomes a monocular camera, providing bearing only measurements of such points. To take advantage of both types of information, we combine 3-D points and inverse depth (ID) points (introduced in [52]) in the state vector in order to build a map and estimate the camera trajectory. The system produces sequences of local maps of limited size containing both types of features using an EKF SLAM algorithm. As we detail in Section IV, these local maps are joined into a full map using the CI D&C SLAM algorithm, obtaining as final result a full stochastic map containing all tracked features, and the final and intermediate camera states from each local map. This system is highly scalable: local maps are built in constant time, regardless of the size of the environment, and the CI D&C algorithm requires amortized linear time.

During the feature tracking process, the right image is chosen as reference to initialize new features. Interest points are detected and classified according to their disparity with the left image. Those points whose disparity reveals a close distance are initialized as 3-D features, otherwise they are modeled as ID points and initialized using the bearing information obtained from the right image. When the camera moves, these features are tracked in order to update the filter and produce the corresponding corrections. To track a feature, its position is predicted in both images inside a bounded region given by the uncertainty in the camera motion and the corresponding uncertainty of the feature.

The process to select, initialize, and manage these features is detailed in the next section.

SECTION III

## Visual SLAM System

### A. State Representation

The state vector that represents a local submap **x**_{B} contains the final camera location **x**_{c} and the location of all features **x**_{f1:n} with respect to the map base reference *B*, the initial camera location. Some features are codified using the *ID parametrization* that model points that are at infinity in **x**_{ID}. Additionally, Cartesian *3-D parametrization* is used to represent depth points in **x**_{3D}:
TeX Source
$${\bf x}_B = \left[\matrix{ {\bf x}_c \cr {\bf x}_{f_{1:n}}}\right]= \left[\matrix{{\bf x}_c \cr {\bf x}_{ID} \cr {\bf x}_{3D}}\right].\eqno{\hbox{(1)}}$$

The camera is described by the position of its optical center in Cartesian coordinates **r**, its orientation in Euler angles Ψ, its linear velocity **v**, and its angular velocity **w**. In order to carry out the prediction process, the camera motion follows a constant velocity model with zero mean Gaussian noise in the linear and angular accelerations
TeX Source
$${\bf x}_c = \left[\matrix{{\bf r} \cr \Psi \cr {\bf v} \cr {\bf w} }\right].\eqno{\hbox{(2)}}$$

Image corners classified as depth points are transformed to 3-D points, given the disparity information provided by the stereo pair. Section III-D describes the criterion adopted to select points as depth points. Since the stereo camera provides rectified images, the backprojection equations to obtain a 3-D point are based on a pinhole camera model that relates image points and 3-D points using the following transformation function:
TeX Source
$$\eqalignno{
{\bf x}_{\rm 3D} & = f(u_r, v_r, u_l, v_l)\cr
&= [x, y, z]^T \cr
&= \left[
{b (u_r - u_0) \over d},
{b (v_r - v_0) \over d},
{f b \over d}\right]^T
&\hbox{(3)}
}$$where (*u*_{r},*v*_{r}) and (*u*_{l},*v*_{l}) are the pixels on the right and left images, and *d* = (*u*_{l}−*u*_{r}) is the horizontal disparity. The remainder terms in the equations are the calibrated parameters of the camera, i.e., the central pixel of the image
(*u*_{0}, *v*_{0}), the baseline *b*, and the focal length *f.*

Given the camera location **x**_{ci}, an ID point is defined in [52] as
TeX Source
$${\bf x}_{\rm ID} = \left[\matrix{{\bf r}_i \cr\theta_i \cr\phi_i \cr\rho_i}\right].\eqno{\hbox{(4)}}$$This vector depends on the optical center **r**_{i} of the camera from which the feature was first observed, the direction of the ray passing through the image point (i.e., azimuth θ_{i}, elevation φ_{i}), and the inverse of its depth, ρ_{i} = 1/*d*_{i}.

### B. Selection and Management of Trackable Points

To ensure tracking stability of map features, distinctive points have to be selected. Following a similar idea as the one presented in [53], we use the Shi–Tomasi variation of the Harris corner detector to select good trackable image points and their corresponding 11 × 11 surrounding patch.

From the first step, the right image is split using a regular grid; the point with the best detector response per grid cell is selected (see Fig. 2). At each step, we use only those features that fall in the field of view (FOV) of the camera when they are projected along with their uncertainties on right and left images. Using the patch associated with each feature, a matching search based on normalized cross-correlation is performed inside the projected uncertainty region, as introduced in [24]. During the following steps, those cells that become and remain empty for a given time are monitorized to initialize a new feature when a good point is detected. In this way, features can be uniformly distributed in the image, improving the amount of information gathered from the scene, and therefore the map estimate. The approach is accompanied by a feature management strategy so that nonpersistent features are deleted from the state vector to avoid an unnecessary growth in population.

### C. Measurement Equation

At each step, we apply the active search process described before such that, for each projected feature in the stereo image, a match is found after performing normalized cross-correlation. Thus, a new observation **z** given by the matched pixel is used to update the state of the camera and the map.

In the right camera, the equation that defines the relation between the *i* th ID feature **x**_{ID}^{i} and its observation **z**_{ID}^{ri} is given by the following measurement equation:
TeX Source
$$\eqalignno{{\bf z}_{\rm ID}^{r_i}&= h_{\rm ID}^r({\bf x}_c, {\bf x}_{\rm ID}^i) + \upsilon \cr&= {\rm projection}(\ominus {\bf x}_c \oplus {\bf x}_{\rm ID}^i) + \upsilon&\hbox{(5)}}$$where *h*_{ID}^{r} is the function that projects the ID feature to the right camera and υ is a zero mean Gaussian noise with σ_{p} standard deviation that represents the projection error in pixels. Alternatively, we can define the measurement equation that relates the inverse point observation on the left image by
TeX Source
$$\eqalignno{{\bf z}_{\rm ID}^{l_i}&= h_{\rm ID}^l({\bf x}_c, {\bf x}_{\rm ID}^i) + \upsilon \cr&= {\rm projection}(\ominus {\bf x}_c \oplus {\bf x}_{c_rc_l}\oplus {\bf x}_{\rm ID}^i) + \upsilon&\hbox{(6)}}$$where the displacement of the left camera optical center with respect to the right camera is given by the rigid transformation **x**_{cr cl} = [0 *b* 0]^{T}.

In a similar way, we describe observations corresponding to 3-D map features in the right and left cameras as
TeX Source
$$\eqalignno{{\bf z}_{\rm 3D}^{r_i} &= h_{\rm 3D}^r({\bf x}_c, {\bf x}_{\rm 3D}^i) + \upsilon \cr&= {\rm projection}(\ominus {\bf x}_c \oplus {\bf x}_{\rm 3D}^i) + \upsilon \cr{\bf z}_{\rm 3D}^{l_i} &= h_{\rm 3D}^l({\bf x}_c, {\bf x}_{\rm 3D}^i) \cr&= {\rm projection}(\ominus {\bf x}_c \oplus {\bf x}_{c_rc_l}\oplus {\bf x}_{\rm ID}^i) + \upsilon.}$$Note that we use ⊕ and ⊖ operators in order to denote the corresponding compositions and inversions of transformations. They represent different transformations depending on the kind of parametrization used to express a feature. In [49], the definitions for 2-D transformations were introduced, dealing mainly with point features and line features. In [54], the operations have been extended for 3-D ID and depth points. Details of the calculation of the corresponding Jacobians to propagate the uncertainties correctly can also be found in [54].

Fig. 2 shows the prediction of these 3-D ID features that fall inside the FOV of each of the cameras. A good advantage of using a stereo camera is that although a feature can disappear from the FOV of one camera, information to update the state is available if the feature can be still found in the other. As it will be shown in the experiments, this fact is of extreme importance when the camera rotates or turns around a corner, since features escape very fast from the FOV of a single camera, making the estimation of the camera location in these moments very weak.

### D. Depth Points Versus ID Points

Current research on monocular SLAM has shown that the ID parametrization is suitable to represent the distribution of features at infinity as well as close points, allowing to perform an undelayed initialization of features. Despite its properties, each ID point needs an overparametrization of six values instead of a simpler three coordinates spatial representation [55]. This produces a computational overhead in the EKF. Working with a stereo camera, which can estimate the depth of points close to the camera, raises the subtle question of when a feature should be initialized using a 3-D or an ID representation.

In order to clarify this issue, we have designed a simulated experiment to study the effect of the linearization in both representations when a point is initialized using the stereo information. In this simulated experiment, the variance of the pixel noise (σ_{p} = 1 pixel) and the actual intrinsic parameters of the stereo camera used, such as the baseline, are taken into account to implement the simulation. The experimental setup consists of a stereo pair where the left camera is located at the origin of the reference frame, with its principal axis pointing along *Z* and the *X* axes pointing to the right. The right camera is at *b* = 12 cm in *X.* We consider a point that is in the middle between both cameras at different distances in *Z.*Given a noisy pixel observation, the uncertainty region of a reconstructed point is sampled and plotted in Fig. 3 for three different point distances: 5,10, and 15 m. The uncertainty region of the 3-D representation, which is calculated using a linearization of (3) and evaluated in the ground truth, is represented by the dark red ellipse. The corresponding uncertainty region of the linearized ID representation is bounded by the light gray lines in the plot. Notice that the ID parametrization models very accurately the real uncertainty for the studied distances. However, although the dark ellipse covers the real distribution at 5 m quite accurately, for longer distances, the ellipse overestimates the uncertainty in the region close to the cameras and is overconfident for far distances.

This empirical analysis suggests choosing a threshold of 5 m. A point closer than 5 m is initialized using a 3-D representation, a more distant point is parameterized as an ID point.

ID features can be transitioned to 3-D points, reducing significantly the number of DOF. Conversion requires an analysis of the linearity of the functions that model both depth point and ID point distributions. In [55], this issue is considered by using a linearity index. Such analysis makes it possible to decide when an inverse point distribution is well approximated with the overparameterized coding. Switching from ID to depth depends on a linearity threshold derived from the analysis.

D&C SLAM has proved to be a good algorithm in minimizing the computational complexity of EKF-based SLAM and improving consistency of the resulting estimate [50]. The algorithm allows us to efficiently join several local maps into a single state vector using map joining in a hierarchical tree structure (see Fig. 4). Local maps can be obtained in constant time, regardless of the size of the environment, and the map joining operations can be performed in an amortized linear time. The D&C SLAM algorithm was, however, conceived for statistically independent sequences of local maps. This requires creating a new local map from scratch every time the current local map size limit has been reached. Consequently, it is not possible to share valuable information in a 6-DOF visual SLAM, such as the camera velocity, or information about features currently being tracked.

In this section, we describe the *CI* D&C SLAM algorithm, which is able to work with maps that are not statistically independent, but rather *conditionally independent*, and thus, allow sharing of the valuable information with no increment in computational cost or loss of precision whatsoever.

### A. CI Local Maps

In visual SLAM, it can be very useful to share some state vector components between consecutive submaps: some camera states, such as linear and angular velocities, as well as features that are in the transition region between adjacent submaps and are currently being tracked. This allows us to improve the estimate of relative location between the submaps and continue tracking the observed features with no interruptions. Nevertheless, special care is needed to join the submaps in a single map since their estimates are not independent anymore.

The novel technique to achieve these requirements is based on the concept of CI local maps presented in [51]. Here, we present a brief summary of the technique.

Suppose that a local map 1 has been built and we want to start a new submap 2 not from scratch, but sharing some elements in common with 1. Submap 1 is described by the following probability density function:
TeX Source
$$p({\bf x}_A,{\bf x}_C\vert {\bf z}_a) = {\cal N} \left(\left[\matrix{\hat{{\bf x}}_{A_a} \cr\hat{{\bf x}}_{C_a} \cr}\right],\left[\matrix{P_{A_a} & P_{{ AC}_a} \cr P_{{CA}_a} & P_{C_a} \cr}\right]\right)\eqno{\hbox{(7)}}$$where **x**_{A} are the components of the current submap that only belong to map 1, **x**_{C} are the elements that will be shared with map 2, and **z**_{a} the observations gathered during the map construction. Notice that upper case subindices are for state vector components whereas lower case subindices describe which observations **z** have been used to obtain the estimate.

Submap 2 is then initialized with the result of marginalizing out the noncommon elements from submap 1:
TeX Source
$$p({\bf x}_C\vert {\bf z}_a) = \int p({\bf x}_A,{\bf x}_C\vert {\bf z}_a)\, d{\bf x}_A = {\cal N}(\hat{{\bf x}}_{C_a}, P_{C_a}).\eqno{\hbox{(8)}}$$During the trajectory along map 2, new observations **z**_{b} are gathered about the common components **x**_{C} as well as observations of new elements **x**_{B} that are incorporated into the map. When map 2 is finished, its estimate is finally described by
TeX Source
$$p({\bf x}_C,{\bf x}_B\vert {\bf z}_a,{\bf z}_b)={\cal N} \left(\left[\matrix{\hat{{\bf x}}_{C_{ab}} \cr\hat{{\bf x}}_{B_{ab}} \cr}\right],\left[\matrix{P_{C_{ab}} & P_{CB_{ab}} \cr P_{BC_{ab}} & P_{B_{ab}} \cr}\right]\right)\eqno{\hbox{(9)}}$$where the subindices in the estimates and reveal that both sets of observations **z**_{a} and **z**_{b} have been used in the estimation process. This means that submap 2 is updated with all the information gathered by the sensor. But observe that map 1 in (7) has been updated with the observation **z**_{a} but not with the more recent observations **z**_{b}.

Fig. 5 shows a Bayesian network that describes the probabilistic dependencies between elements of submaps 1 and 2. As it can be seen, the only connection between the set of nodes (**x**_{A}, **z**_{a}) and (**x**_{B}, **z**_{b}) is through node **x**_{C}, i.e. , both subgraphs are *d-separated* given **x**_{C}[56]. This implies that nodes **x**_{A} and **z**_{a} are*CI* of nodes **x**_{B} and **z**_{b}given node **x**_{C}. Intuitively, this means that if **x**_{C} is known, submaps 1 and 2 do not carry any additional information about each other.

### B. CI Map Joining

Consider two consecutive CI local maps. We are interested in joining the maps into a single stochastic map described by
TeX Source
$$\eqalignno{& {p({\bf x}_A,{\bf x}_B,{\bf x}_C\vert {\bf z}_a,{\bf z}_b)} \cr&\quad = {\cal N} \left(\left[\matrix{\hat{{\bf x}}_{A_{ab}} \cr\hat{{\bf x}}_{C_{ab}} \cr\hat{{\bf x}}_{B_{ab}} \cr}\right],\left[\matrix{P_{A_{ab}} & P_{AC_{ab}} & P_{AB_{ab}} \cr P_{CA_{ab}} & P_{C_{ab}} & P_{CB_{ab}} \cr P_{BA_{ab}} & P_{BC_{ab}} & P_{B_{ab}} \cr}\right]\right). \qquad&\hbox{(10)}}$$Taking into account the submap conditional independence property, it can be demonstrated [51] that the optimal map result of the joining can be computed using
TeX Source
$$\eqalignno{K &= P_{{AC}_a}P_{C_a}^{-1} \cr&= P_{{AC}_{ab}}P_{C_{ab}}^{-1}&\hbox{(11)}\cr\hat{{\bf x}}_{A_{ab}}&= \hat{{\bf x}}_{A_a}+ K(\hat{{\bf x}}_{C_{ab}}-\hat{{\bf x}}_{C_a})&\hbox{(12)}\cr P_{A_{ab}} &= P_{A_a} + K(P_{{CA}_{ab}} - P_{{CA}_a})&\hbox{(13)}\cr P_{{AC}_{ab}} &= K P_{C_{ab}}&\hbox{(14)}\cr P_{AB_{ab}}&= K P_{CB_{ab}}.&\hbox{(15)}}$$Using this technique, we can build local maps that have elements in common, and then retrieve the global information in a consistent manner. After the joining, the elements belonging to the second map are transformed to the base reference of the first map.

### C. Actual Implementation for Stereo

The D&C SLAM algorithm of [50] can be adapted to work with conditional independent local maps simply by using the CI map joining operation described before. As we mentioned before, since the camera moves in 6 DOF, the camera state is composed of its position using 3-D Cartesian coordinates, the orientation in Euler angles, and its linear and angular velocities. 3-D points and ID points are included as features in the state vector. When a local map **m**_{i} is finished, the final map estimate is given by
TeX Source
$${\bf m}_i.\hat{{\bf x}} = \left[\matrix{\hat{{\bf x}}_{R_i R_j} \cr\hat{{\bf v}}_{R_i R_j} \cr\hat{{\bf x}}_{R_i F_{1:m}} \cr\hat{{\bf x}}_{R_i F_{m+1:n}} \cr}\right]\eqno{\hbox{(16)}}$$where is the final camera location *R*_{j} with respect to the initial one, *R*_{i} and are the linear and angular velocities, are 3-D and ID features that will only remain in the current map, and are 3-D and ID features that will be shared with the next submap **m**_{j}.

Since the current camera velocity > and some features are used to initialize the next local map, these elements have to be computed with respect to the base reference of the second map *R*_{j}:
TeX Source
$${\bf m}_i.\hat{{\bf x}} = \left[\matrix{\hat{{\bf x}}_{R_i R_j} \cr\hat{{\bf v}}_{R_i R_j} \cr\hat{{\bf x}}_{R_i F_{1:m}} \cr\hat{{\bf x}}_{R_i F_{m+1:n}} \cr\cdots \cr\ominus\hat{{\bf x}}_{R_i R_j}\oplus\hat{{\bf v}}_{R_i R_j} \cr\ominus\hat{{\bf x}}_{R_i R_j}\oplus\hat{{\bf x}}_{R_i F_{m+1:n}}}\right]= \left[\matrix{\hat{{\bf x}}_{A_a} \cr\cdots \cr\hat{{\bf x}}_{C_a} \cr}\right]\eqno{\hbox{(17)}}$$where the new elements define the common part and the original map defines . Notice that the appropriate composition operation has to be applied for each transformed component and that the corresponding covariance elements have to be added to the map.

In local mapping, a base reference has to be identified to start a new map. This common reference is represented by the final vehicle position, which is the case of *R*_{j} between **m**_{i} and **m**_{j}.

The initial state vector of the next submap is then given by
TeX Source
$${\bf m}_{j}.\hat{{\bf x}} = \left[\matrix{\hat{{\bf x}}_{R_j R_j} \cr\cr\ominus\hat{{\bf x}}_{R_i R_j}\oplus\hat{{\bf v}}_{R_i R_j} \cr\cr\ominus\hat{{\bf x}}_{R_i R_j}\oplus\hat{{\bf v}}_{R_i R_j} \cr\cr\ominus\hat{{\bf x}}_{R_i R_j}\oplus\hat{{\bf x}}_{R_i F_{m+1:n}}}\right]\eqno{\hbox{(18)}}$$where represents the location of the camera in the new reference frame with initial zero uncertainty and zero correlation with the rest of the elements of the initial map. Notice that the initial velocity brought from the previous map has been replicated twice. One of the copies will change as the camera moves through the new map carrying the current camera velocity. The other copy will remain fixed and, together with the transformed features, will be the common elements with the previous map. The same process is successively repeated with all local maps.

### D. Continuous Data Association in Each Local Map

Recent work on large environments [19] has shown that the joint compatibility test [57] helps avoiding map corruption in the visual SLAM by rejecting measurements that come from moving objects. This framework is suitable in environments with a limited number of observations. However, a branch and bound algorithm implementation of(*JCBB*) has limited use when the number of observations per step is large. In this paper, we have obtained more efficient results using the *randomized joint compatibility* version *RJC* proposed in [50], in which, in the spirit of RANSAC, a *joint compatibility* *(JC)* test is run with a fixed set of *p* randomly selected measurements. In this case, correlation between patches and individual χ^{2} tests is used to obtain candidate matches. If all *p* measurements and their matches are jointly compatible, we apply the nearest neighbor rule to match the remaining measurements. Once a full hypothesis *H* is obtained, we check *JC* to avoid false positives. The process is repeated *t* times with adaptive RANSAC, limiting the probability of missing a correct association.

### E. Map Matching

The property of sharing common elements solves the data association problem between consecutive local maps [50]. This requires us to solve data association only in loop closing situations. We use the map matching algorithm of [19] in order to detect a previously visited area. The algorithm finds correspondences between features in different local maps, taking into account the texture and the relative geometry between the features. If sufficient corresponding features are found, an ideal measurement equation that imposes the loop closing constraint is applied in the final map.

SECTION V

## Experiments in Urban Outdoor and Indoor Environments

In order to demonstrate the robustness and scalability of the visual SLAM system that we propose, we have gathered two 320 × 240 image sequences with a stereo system (see Fig. 1). The system provides a65 × 50 degree FOV per camera, and has a baseline of12 cm, limiting the 3-D point features initialization up to a distance close to 5 m.

An indoor loop (at 48 fps) and an urban outdoor (at 25 fps) loop sequences were captured carrying the camera in hand, at normal walking speeds of 4–5 km/h. Both sequences were processed in MATLAB with the proposed algorithms on a desktop computer with an Intel 4 processor at 2.4 GHz. The higher frame rate for the indoor experiment helps in reducing the probability of mismatches given that the environment includes brick walls providing ambiguous texture information.

The outdoor sequence is composed of 3441 stereo pairs gathered in a public square of our home town (see Fig. 6 top row). The full trajectory is approximately 140 m long from the initial camera position. Fig. 6, left column, shows the sequence of conditional independent local maps obtained with the technique described in Section IV-A. Each map contains 100features combining ID and 3-D points. The total number of maps built during the stereo sequence is 11. The result of D&C without applying the loop closing constraint is shown in Fig. 6, middle column. As it can be observed, the precision of the map obtained is good enough to almost align the first and last submaps after all the trajectory has been traversed, even without applying loop closing constraints. Fig. 6, right column, presents the final result after closing the loop.

The second experiment was carried out inside one of our campus buildings in a walk of approximately 210 m (see Fig. 6, bottom row). The same process was run in order to obtain a full map from 8135 stereo pairs. This environment has a particular degree of difficulty due to ambiguous texture and the presence of extensive zones of glass windows such as offices, corridors, and cafeterias. This can be noticed in the long distance points estimated in some of the maps, which are actually inside offices and the cafeteria (see Fig. 6, left column). The result of CI D&C is shown in Fig. 6, middle column, and the final result after loop closing is shown in Fig. 6, right column.

Our 6-DOF SLAM system, even implemented in MATLAB, does not exceed 2s per step, which is the worst case when building CI local maps. Fig. 7 shows how the system running time remains constant in most of the steps. Moreover, time peaks that appear when CI D&C takes place are below 8 s for the square experiment and14 s for the indoor experiment, which are the maximum times required in the last step.

Using the Google Earth tool, we can see that the map scale obtained and the trajectory followed by the camera is very close to the real scale. Fig. 9 illustrates comparative results. We loaded the MATLAB figure in Google Earth and set the scale parameter to the real scale. Given that we had neither GPS nor compass measurements for the initial locations of the camera that are the base reference of each map, the position and orientation of the figure over the map were adjusted by hand. It can be noticed that angles between the square sides and the shape of the walls of the surrounding environment have been captured with precision.

As presented in Section I, several works have demonstrated successful visual SLAM systems in small environments using monocular or stereo cameras. There are several important factors that limit the extension of these results to large-scale environments.

First, the computational complexity and consistency of the underlying SLAM technique. In this paper, we have presented a novel algorithm that builds CI local maps in constant time and combines them in an optimal way in amortized linear time. Although the experiments presented here were processed in MATLAB, we expect that the extension to stereo of our current real-time implementation [19] will be able to build local maps up to 100features in real time, with updates at 25 Hz. The D&C map joining, loop detection, and loop closing can be implemented on a separate thread, taking advantage of current multiple core processors.

In the case of monocular SLAM, another important limiting factor is the intrinsic unobservability of the scale. This problem can be addressed using additional sensors such as the vehicle odometry, GPS, or inertial units. When they are not available, the scale can be initialized using some *a priori* knowledge about the environment such as the size of a known object visible at the start [16] or the initial speed of the camera. However, in large environments, unless scale information is injected on the system periodically, the scale of the map can slowly drift (see, for example, the experiments in [19]). Another critical issue appears when the scene is mostly planar and perpendicular to the optical axis. In this situation, with a monocular camera, it is very difficult to distinguish between camera translation and rotation, unless a wide FOV is used.

To illustrate these difficulties, we have processed our indoor and outdoor experiments using only the information from the right camera. As we are now using a bearing only system, all the features are initialized using the ID representation. To bootstrap the system, we have introduced a initial estimated speed for the camera of 1 m/s. Apart from that, our visual SLAM algorithm remains unchanged. The resulting maps are represented in the left column of Fig. 8. As it can be seen, the scale obtained by the system drifts (compare the beginning of the loop with the end). Also, in the outdoor experiment, at a certain point, the system misinterprets the camera translation as a rotation, and the map gets corrupted. Here, we are using a camera with FOV of 65°. The results obtained in the same environment with an FOV of 90° are significantly more robust [51]. In the indoor experiment with a monocular camera, as the objects are much closer to the camera, most of the features disappear fast from the FOV when the camera turns, leading to a bad estimation of its position and consequently divergence in the map estimate.

We have also processed the sequences with our SLAM algorithm using conventional stereo, i.e., changed to initialize all the features whose disparity is larger than one pixel as 3-D points. Features without disparity are discarded because its depth cannot be computed by stereo. The immediate benefit is that the true environment scale is observable and the map corruption disappears (Fig. 8, middle column). However, for points that are more than 10 m away from the camera, a Gaussian in *xyz* is a bad approximation for its true uncertainty. This is the reason for the map deformation that is clearly visible in the lower part of the outdoor experiment, where many features are at about 20 m from the camera.

The proposed system (Fig. 8, right column)combines the advantages of stereo and bearing only vision. On the one hand, the true scale is precisely obtained due to the 3-D information obtained by the stereo camera from close point features. On the other hand, the region with useful point features extends up to infinity due to the ID representation developed for bearing-only SLAM. The depth of the features that are far from the camera can be precisely recovered by the system if they are seen from viewpoints that are separated enough. In that case, they can be upgraded to 3-D points for better efficiency [55]. Otherwise, they remain as ID points and still provide very valuable orientation information that improves map precision and keeps the SLAM system stable when few close features are observed.

In this paper, we have shown that 6-DOF visual mapping of large environments can be efficiently and accurately carried out using a stereo camera as the only sensor. One of the contributions of the paper is that information from features nearby and far from the cameras can be simultaneously incorporated to represent the 3-D structure more precisely. Using close points provides scale information through the stereo baseline avoiding “scale-drift,”while ID points are useful to obtain angular information from distant scene points.

Another contribution of the paper is the combination of two recent local mapping techniques to improve consistency and reduce complexity in the SLAM process. Using CI local maps [51], our system is able to properly share information related to the camera motion model and common features between consecutive maps. Smoother transitions from map to map are achieved as well as better relative locations between local maps. By means of the simplicity and efficiency of the CI D&C SLAM algorithm, we can recover the full map very efficiently. The combination of both techniques adds robustness to the process without sacrificing precision.

In [50], we describe the performance of D&C SLAM when the vehicle carries out different types of trajectories. For some trajectories, the cost of map joining can increase at some steps, depending of the size of the overlap between the maps to be joined:doing exploration, the overlap is constant and the cost of map joining is small, when completing a loop traversal for a second time the overlap between the maps is total and the cost of joining will be much higher. Although we are able to close large indoor and outdoor loops, the algorithm used for loop closing strongly depends on detecting sets of features already stored in the map when the same area is revisited. It would be interesting to analyze other types of algorithms for loop closing, for instance, the image to map algorithm proposed in [58].

Moreover, as we assume smooth motions, the relocation algorithm presented in [58] would enable the system to avoid failures in case of jitter.

There is also a restriction of the system to estimate pitch orientation due to the use of Euler angles. A combined solution using quaternions can mitigate the problem. This will be part of our future research.

Apart from upward looking cameras and jitter, there are no limitations to manoeuver the camera freely: it can be used in environments that include stairs and other terrain accidents. This kind of experiment will be part of the evaluation process for future work.

We will also focus on comparing our system with other stereo vision techniques such as visual odometry. We are very interested in studying the fusion of the stereo camera with other sensors like GPS or inertial systems in order to compare the precision obtained. We will consider other types of feature detectors as well, and their effect in the final result.

### Acknowledgment

The authors would like to thank J. M. M. Montiel and J. Civera for the fruitful discussions.