• Abstract

# A Minimalistic Approach to Appearance-Based Visual SLAM

This paper presents a vision-based approach to simultaneous localization and mapping (SLAM) in indoor/outdoor environments with minimalistic sensing and computational requirements. The approach is based on a graph representation of robot poses, using a relaxation algorithm to obtain a globally consistent map. Each link corresponds to a relative measurement of the spatial relation between the two nodes it connects. The links describe the likelihood distribution of the relative pose as a Gaussian distribution. To estimate the covariance matrix for links obtained from an omnidirectional vision sensor, a novel method is introduced based on the relative similarity of neighboring images. This new method does not require the determination of distances to image features using multiple-view geometry, for example. Combined indoor and outdoor experiments demonstrate that the approach can handle different environments (without modification of the parameters), and it can cope with violations of the “flat floor assumption” to some degree and scales well with increasing size of the environment, producing topologically correct and geometrically accurate maps at low computational cost. Further experiments demonstrate that the approach is also suitable for combining multiple overlapping maps, e.g., for solving the multirobot SLAM problem with unknown initial poses.

SECTION I

## Introduction

THIS PAPER presents a new vision-based approach to the problem of simultaneous localization and mapping (SLAM). Especially compared with SLAM approaches using a 2-D laser scanner, the rich information provided by a vision-based approach about a substantial part of the environment allows for dealing with high levels of occlusion [1] and enables solutions that do not rely strictly on a flat floor assumption. Cameras can also offer a longer range and are therefore advantageous in environments that contain large open spaces.

The proposed method is called “mini-SLAM” since it is minimalistic in several ways. On the hardware side, it relies solely on odometry and an omnidirectional camera as the external source of information. This allows for less expensive systems compared to methods that use 2-D or 3-D laser scanners. Note that the robot used for the experiments was also equipped with a 2-D laser scanner. This laser scanner, however, was not used in the SLAM algorithm but only to visualize the consistency of the created maps.

Apart from the frugal hardware requirements, the method is also minimalistic in its computational demands. Map estimation is performed online by a linear-time SLAM algorithm on an efficient graph representation. The main difference to other vision-based SLAM approaches is that there is no estimate of the positions of a set of landmarks involved, enabling the algorithm to scale up better with the size of the environment. Instead, a measure of image similarity is used to estimate the relative pose between corresponding images (“visual relations”) and the uncertainty of this estimate. Given these “visual relations” and relative pose estimates between consecutive images obtained from the odometry of the robot (“odometry relations”), the multilevel relaxation algorithm [2] is then used to determine the maximum likelihood estimate of all image poses. The relations are expressed as a relative pose estimate and the corresponding covariance. A key insight is that the estimate of the relative pose in the “visual relations” does not need to be very accurate as long as the corresponding covariance is modeled appropriately. This is because the relative pose is used only as an initial estimate that the multilevel relaxation algorithm can adjust according to the covariance of the relation. Therefore, even with fairly imprecise initial estimates of the relative poses, it is possible to geometrically build accurate maps using the geometric information in the covariance of the relative pose estimates. Mini-SLAM was found to produce consistent maps in various environments, including, for example, a dataset of an environment containing indoor and outdoor passages (path length of 1.4 km) and an indoor dataset covering five floor levels of a department building.

Further to our previously published work [3], we extended the mini-SLAM approach to the multirobot SLAM problem, demonstrating its ability to combine multiple overlapping maps with unknown initial poses. We also provide an evaluation of the robustness of the suggested approach with respect to poor odometry or a less reliable measure of visual similarity.

### A. Related Work

Using a camera as the external source of information in SLAM has received increasing attention during the past years. Many approaches extract landmarks using local features in the images and track the positions of these landmarks. As the feature descriptor, Lowe's scale-invariant feature transform (SIFT) [4] has been used widely [5], [6]. An initial estimate of the relative pose change is often obtained from odometry [6], [7], [8], or where multiple cameras are available as in [9] and [10], multiple-view geometry can be applied to obtain depth estimates of the extracted features. To update and maintain visual landmarks, extended Kalman filters (EKFs) [7], [11] and Rao-Blackwellized particle filters (RBPFs) [6], [9] have been used. In the visual SLAM method proposed in [11], particle filters were utilized to obtain the depth of landmarks, while the landmark positions were updated with an EKF. Initial landmark positions had to be provided by the user. A similar approach described in [8] applies a converse methodology. The landmark positions were estimated with a KF, and a particle filter was used to estimate the path.

Due to their suitability for addressing the correspondence problem, vision-based systems have been applied as an addition to laser-scanning-based SLAM approaches for detecting loop closure. The principle has been applied to SLAM systems based on a 2-D laser scanner [12] and a 3-D laser scanner [13].

In the approach proposed in this paper, the SLAM optimization problem is solved at the graph level with the multilevel relaxation (MLR) method of Frese and Duckett [2]. This method could be replaced by alternative graph-based SLAM methods, for example, the online method proposed by Grisetti et al. [14] based on the stochastic gradient descent method proposed by Olson et al. [15].

The rest of this paper is structured as follows. Section II describes the proposed SLAM approach. Then, the experimental setup is detailed, and the results are presented in Section III. The paper ends with conclusions and suggestions for future work (Section IV).

SECTION II

## Mini-SLAM

### A. Multilevel Relaxation

The SLAM optimization problem is solved at the graph level with the MLR method of Frese and Duckett [2]. A map is represented as a set of nodes connected in a graph structure. An example is shown in Fig. 1. Each node corresponds to the robot pose at a particular time and each link to a relative measurement of the spatial relation between the two nodes it connects. A node is created for each omni-image in this paper, and the terms node and frame are used interchangeably in this paper.

The MLR algorithm can be briefly explained as follows. The input is a set R of m = | R relations on n planar frames (i.e., a 2-D representation is used). Each relation rR describes the likelihood distribution of the pose of frame a relative to frame b. Relations are modeled as a Gaussian distribution with mean μr and covariance Cr. The output of the MLR algorithm is the maximum likelihood estimation vector for the poses of all the frames. Thus, a globally consistent set of Cartesian coordinates is obtained for the nodes of the graph based on local (relative) and inconsistent (noisy) measurements, by maximizing the total likelihood of all measurements.

Fig. 1. Graph representation used. The figure shows frames (nodes) and relations (edges) of both the odometry ro and the visual relations rv. Visual relations are indicated with dotted lines. Each frame a contains a reference to a set of features Fa extracted from an omnidirectional image Ia, an odometry pose xoa, a covariance estimate of the odometry pose Cxoa, the estimated pose , and an estimate of its covariance . Fig. 2 shows images corresponding to the region represented by the graph in this figure.

### B. Odometry Relations

The mini-SLAM approach is based on two principles. First, odometry is sufficiently accurate if the distance traveled is short. Second, by using visual matching, correspondence between robot poses can be detected reliably even though the search region around the current pose estimate is large. Accordingly, two different types of relations are created in the MLR graph, relations based on odometry ro and relations based on visual similarity rv.

Odometry relations ro are created between successive frames. The relative pose μro is obtained directly from the odometry readings, and the covariance Cro is estimated using the motion model suggested in [16] as TeX Source $$C_{r_o} = \left[\matrix{d^2\delta^2_{X_d} + t^2\delta^2_{X_t} & 0 & 0 \cr0 & d^2\delta^2_{Y_d} + t^2\delta^2_{Y_t} & 0 \cr0 & 0 & d^2\delta^2_{\theta_d} + t^2\delta^2_{\theta_t}} \right]\eqno{\hbox{(1)}}$$where d and t are the total distance traveled and total angle rotated between two successive frames. The δX parameters relate to the forward motion, the δY parameters to the side motion, and the δ_θ parameters to the rotation of the robot. The six δ-parameters adjust the influence of the distance d and rotation t in the calculation of the covariance matrix. They were tuned manually once, and then kept constant throughout the experiments.

### C. Visual Similarity Relations

#### Similarity Measure

Given two images Ia and Ib, features are first extracted using the SIFT algorithm [4]. This results in two sets of features Fa and Fb for frame a and b. Each feature F = [x,y],H comprises the pixel position [x,y] and a histogram H containing the SIFT descriptor. The similarity measure Sa,b is based on the number of features that match between Fa and Fb.

The feature matching algorithm calculates the Euclidean distance between each feature in image Ia and all the features in image Ib. A potential match is found if the smallest distance is less than 60% of the second smallest distance. This criterion was found empirically and was also used in [17]. It guarantees that interest point matches are substantially better than all other possible matches. We also do not allow features to be matched against more than one other feature. If a feature has more than one candidate match, the match that has the lowest Euclidean distance among the candidates is selected. Examples of matched features are shown in Fig. 2.

Fig. 2. Examples of (top) loop closure detection outdoors and (bottom) indoors. In the outdoor example, the distance to the extracted features is larger than in the indoor example. (Left) Feature matches at the peak of the similarity value, (top) S678,758 = 0.728 and (bottom) S7,360 = 0.322. (Middle) Feature matches two steps (equivalent to ∼3-m distance) away, (top) S680,758 = 0.286 and (bottom) S9, 360 = 0.076. The pose standard deviation σxrv = σyrv was estimated as (top) 2.06 m and (bottom) 1.09 m, respectively, and the mean dμ as (top) 0.199 m and (bottom) −0.534 m. (Right) Evolution of the similarity measure S against the distance travelled (obtained from odometry) together with the fitted Gaussian.

The matching step results in a set of feature pairs Pa,b with a total number Ma,b of matched pairs. Since the number of features varies heavily depending on the image content, the number of matches is normalized to Sa,b∊ [0,1] as TeX Source $$S_{a,b} = {M_{a,b}\over {1/2}(n_{F_a} + n_{F_b})}\eqno{\hbox{(2)}}$$where nFa and nFb are the number of features in Fa and Fb, respectively. A high-similarity measure indicates a perceptually similar position.

#### Estimation of the Relative Rotation and Variance

The relative rotation between two panoramic images Ia and Ib can be estimated directly from the horizontal displacement of the matched feature pairs Pa,b. If the flat floor assumption is violated, this will be only an approximation. Here, the relative rotations θp for all matched pairs pPa,b are sorted into a 10-bin histogram and the relative rotation estimate μθrv is determined as the maximum of a parabola fitted to the largest bin and its left and right neighbor (see Fig. 3).

Fig. 3. Relative orientation histogram from two omnidirectional images taken 2 m apart. The dotted line marks the relative orientation estimate μθrv.

To evaluate the accuracy of relative rotation estimates θp, we collected panoramic images in an indoor laboratory environment and computed the relative orientation with respect to a reference image I0. Panoramic images were recorded at a translational distance of 0.5, 1.0, and 2.0 m to the reference image I0. The ground truth rotation was obtained by manually measuring the displacement of corresponding pixels in areas along the displacement of the camera. The results in Table I demonstrate the good accuracy obtained. Even at a displacement of 2 m the mean error is only 7.15°.

TABLE I Errors of Relative Rotation θ Estimate in Radians

The rotation variance σθrv2 is estimated by the sum of squared differences between the estimate of the relative rotation μθrv and the relative rotation of the matched pairs Pa,b TeX Source $$\sigma_{\theta^{r_v}}^2 = {1\over M_{a,b}-1}\sum_{p\in P_{a,b}}(\mu_{\theta}^{r_v} - \theta_{p})^2.\eqno{\hbox{(3)}}$$To increase the robustness toward outliers, a 10% Winsorized mean is applied. For the evalutated data, this had only a minor effect on the results compared to using an untruncated mean.

#### Estimation of Relative Position and Covariance

The mini-SLAM approach does not attempt to determine the position of the detected features. Therefore, the relative position between two frames a and b cannot be determined very accurately. Instead, we use only image similarity of the surrounding images to estimate [μxrv, μyrv], as described shortly. It would be possible to estimate the relative position using multiple-view geometry, but this would introduce additional complexity that we want to avoid.

Instead, geometric information is obtained from an estimate of the covariance of the relative position between a current frame b and a previously recorded frame a. This covariance estimate is computed using only the similarity measures S of frame b with a and the neighboring frames of a.

The number of matched features between successive frames will vary depending on the physical distance to the features (see Figs. 2 and 4). Consider, for example, a robot located in an empty car park where the physical distance to the features is large, and therefore, the appearance of the environment does not change quickly if the robot is moved a certain distance. If, on the other hand, the robot is located in a narrow corridor where the physical distance to the extracted features is small, the number of feature matches in successive frames tends to be smaller if the robot was moved the same distance.

Fig. 4. (Left) Physical distance to the features will influence the number of features that can be identified from different poses of the robot. The filled squares represent features that could be matched in all three robot poses, while the unfilled squares represent the features for which correspondences could not be found from all poses. The left wall in the figure is closer to the robot. Thus, due to the faster change in appearance, the number of features of the left wall, which can be matched over successive images, tends to be less compared to the number of matched features of the right wall. (Right) Outdoor robot used in this paper, equipped with a Canon EOS 350D camera and a panoramic lens from 0-360.com, which were used to collect the data, a DGPS unit to determine ground truth positions, and an SICK LMS 200 scanner used for visualization and for obtaining ground truth.

The covariance of the robot pose estimate [x,y] TeX Source $$C_{r_v} = \left[\matrix{\sigma_{x^{r_v}}^2 & \sigma_{x^{r_v}}\sigma_{y^{r_v}} \cr\sigma_{x^{r_v}}\sigma_{y^{r_v}} & \sigma_{y^{r_v}}^2}\right]\eqno{\hbox{(4)}}$$is computed based on how the similarity measure varies over the set N(a), which contains frame a and its neighboring frames. The analyzed sequence of similarity measures is indicated in the zoomed-in visualization of a similarity matrix shown in Fig. 5. In order to avoid issues estimating the covariance orthogonal to the path of the robot if the robot was driven along a straight path, the covariance matrix is simplified by setting σxrv2 = σyrv2 and σxrvσyrv = 0. The remaining covariance parameter is estimated by fitting a 1-D Gaussian to the similarity measures SN(a),b and the distance traveled as obtained from odometry (see Fig. 6). Two parameters are determined from the nonlinear least-squares fitting process: mean (d_μ) and variance (σ[x,y]rv2). The initial estimate of the relative position [μxrvyrv] of a visual relation is calculated as TeX Source \eqalignno{\mu_x^{r_v} & = {\rm cos}(\mu_{\theta}^{r_v})d_\mu& \hbox{(5)}\cr\mu_y^{r_v} & = {\rm sin}(\mu_{\theta}^{r_v})d_\mu& \hbox{(6)}}where d_μ is the calculated mean of the fitted Gaussian and μθ the estimated relative orientation (Section II-C2).

Fig. 5. (Left) Full similarity matrix for the laboratory dataset. Brighter entries indicate a higher similarity measure S. (Right) Zoomed-in image. The left area (enclosed in a blue frame) corresponds to a sequence of similarity measures that gives a larger position covariance than the right sequence (red frame).
Fig. 6. Gaussian fitted to the distance traveled d (as obtained from odometry) and the similarity measures between frame b and the frames of the neighborhood N(a) = {a−2,a−1,a,a+1,a+2}. From the similarity measures, both a relative pose estimate μrv and a covariance estimate Crv are calculated between node a and node b. The orientation and orientation variance are not visualized in this figure.

In the experimental evaluation, the Gaussian was estimated using five consecutive frames. To evaluate whether the evolution of the similarity measure in the vicinity of a visual relation can be reasonably approximated by a Gaussian, the mean error between the five similarity measures and the fitted Gaussian was calculated for the outdoor/indoor dataset (the dataset is described in Section III-A). The results in Table II indicate that the Gaussian represents the evolution of the similarity in a reasonable way. Please note that frame b is recorded at a later time than frame a, meaning that the covariance estimate Crva,b can be calculated directly without any time lag.

TABLE II Statistics of the Error ∊ Between the Gaussian Fit and the Similarity Measures Sa−2,b, …, Sa+2,b for Each Node for Which the Fit was Performed in the Outdoor/Indoor Dataset

#### Selecting Frames to Match

In order to speed up the algorithm and make it more robust to perceptual aliasing (the problem that different regions have similar appearance), only those frames are selected for matching that are likely to be located close to each other.

Consider the current frame b and a previously recorded frame a. If the similarity measure was to be calculated between b and all previously added frames, the number of frames to be compared would increase linearly (see Fig. 7). Instead, frames are only compared if the current frame b is within a search area around the pose estimate of frame a. The size of this search area is computed from the estimated pose covariance.

Fig. 7. Number of similarity calculations performed at each frame in the outdoor/indoor dataset. The first frames were compared around frame 240, since up to then none of the previous frames were within the search area around the current pose estimate defined by the estimated pose covariance. The diagonal line indicates the linear increase for the case that the frames to match are not preselected.

From the MLR algorithm (see Section II-A) we obtain the maximum likelihood estimate for frame b. There is, however, no estimate of the corresponding covariance that could be used to distinguish whether frame a is likely to be close enough to frame b so that it can be considered a candidate for a match, i.e., a frame for which the similarity measure Sa,b should be calculated. So far, we have defined two types of covariances: the odometry covariance Cro and the visual relation covariance Crv. To obtain an overall estimate of the relative covariance between frame a and b, we first consider the covariances of the odometry relations ro between a and b, and compute relative covariance Cxoa,b as TeX Source $$C_{x^o_{a, b}} = \sum_{j\in(a, b-1)}{\bf R}_j C_{r_{o_j}}{\bf R}_j^T .\eqno{\hbox{(7)}}$$Rj is a rotation matrix, which is defined as TeX Source $${\bf R}_j = \left(\matrix{{\rm cos}(\hat{x}_{j+1}^{\theta} - \hat{x}_j^{\theta}) & -{\rm sin}(\hat{x}_{j+1}^{\theta} - \hat{x}_j^{\theta}) & 0 \cr{\rm sin}(\hat{x}_{j+1}^{\theta} - \hat{x}_j^{\theta}) & {\rm cos}(\hat{x}_{j+1}^{\theta} - \hat{x}_j^{\theta}) & 0 \cr0 & 0 & 1} \right)\eqno{\hbox{(8)}}$$where is the orientation estimated for frame j.

As long as no visual relation rv has been added, either between a and b or any of the frames between a and b, the relative covariance can be determined directly from the odometry covariance Cxoa and Cxob, as described earlier. However, when a visual relation rva,b between a and b is added, the covariance of the estimate decreases. Using the covariance intersection method [18], the covariance for frame b is therefore updated as TeX Source $$C_{\hat{x}_b} = C_{\hat{x}_b} \oplus (C_{\hat{x}_a} + C_{r_v^{a, b}})\eqno{\hbox{(9)}}$$where ⊕ is the covariance intersection operator. The covariance intersection method weighs the influence of both covariances Ca and Cb as TeX Source $$C_A \oplus C_B = [\omega C_A^{-1} + (1-\omega) C_B^{-1}]^{-1} . \eqno{\hbox{(10)}}$$The parameter ω ∊ [0,1] is chosen so that the determinant of the resulting covariance is minimized [19].

The new covariance estimate is also used to update the frames between a and b by adding the odometry covariances Cxoa·.b in opposite order (i.e., simulate that the robot is moving backwards from frame b to a). The new covariance estimate for frame j ∊ (a,b) is calculated as TeX Source $$C_{\hat{x}_j} = C_{\hat{x}_j} \oplus (C_{\hat{x}_b} + C_{x^o_{b, j}}).\eqno{\hbox{(11)}}$$

#### Visual Relation Filtering

To avoid adding visual relations with low similarity, visual similarity relations rva,b between frame a and frame b are added only if the similarity measure exceeds a threshold tvs: Sa,b > tvs. In addition, similarity relations are added only if the similarity value Sa,b has its peak at frame a [compared to the neighboring frames N(a)]. There is no limitation on the number of visual relations that can be added for each frame.

### D. Fusing Multiple Datasets

Fusion of multiple datasets recorded at different times is related to the problem of multirobot mapping where each of the datasets is collected concurrently with a different robot. The motivation for multirobot mapping is not only to reduce the time required to explore an environment but also to merge the different sensor readings in order to obtain a more accurate map. The problem addressed here is equivalent to “multirobot SLAM with unknown initial poses” [20], because the relative poses between the datasets are not given. The exploration problem is not considered in this paper.

Only a minor modification of the standard method described earlier is necessary to address the problem of fusing multiple datasets. The absence of relative pose estimates between the datasets is compensated for by not limiting the search region for which the similarity measures S are computed. This is implemented by adding datasets incrementally and setting the relative pose between consecutively added datasets initially to (0 0 0) with an infinite pose covariance. Such odometry relations between datasets appear as long, diagonal lines in Fig. 16, representing the transition between lab to studarea and studarea to lab-studarea.

SECTION III

## Experimental Results

In this section, we present results from five different datasets with varying properties. An overview of all datasets is presented in Table III. All datasets were collected with our mobile robot Tjorven (see Fig. 4). The platform uses “skid steering,” which is prone to bad odometry. In the different datasets different wheel types (indoor/outdoor) were used. The robot's odometry was calibrated (for each wheel type) by first driving forward 5 m to obtain a distance per encoder tick value, and second by completing one full revolution to determine the number of differential encoder ticks per angular rotation. Finally, the drift parameter was adjusted so that the robot would drive forward in a straight line, i.e., to compensate for the slightly different size of the wheel pairs.

TABLE III For Each Dataset: Number of Nodes , Visual Relations #rv, Performed Similarity Calculations #S, Average Number of Extracted Visual Features μF Per Node With Variance σF, Evaluation Run Time T (Excluding the Similarity Computation)

The omnidirectional images were first converted to panoramic images with a resolution of 1000 × 289. When extracting SIFT features, the initial doubling of the images was not performed, i.e., SIFT features from the first octave were ignored, simply to lower the amount of extracted features.

The results are presented both visually with maps obtained by superimposing laser range data using the poses estimated with mini-SLAM and quantitatively by the MSE from ground truth data. Since the corresponding pose pairs 〈x}i,xGTi〉 between the estimated pose and the corresponding ground truth pose xGTi are known, the optimal rigid transformation between pose estimates and ground truth data can be determined directly. We applied the method suggested by Arun et al. [21].

To investigate the influence of the threshold tvs, described in Section II-C5, the MSE was calculated for all datasets for which ground truth data were available. The result in Fig. 8 shows that the value of the threshold tvs can be selected so that it is nearly optimal for all datasets and that there is a region in which minor changes of the tvs do not strongly influence the accuracy of the map. Throughout the remainder of this section, a constant threshold tvs = 0.2 is used.

Fig. 8. Influence of the threshold parameter tvs on the relative MSE.

In order to give a better idea of the function of the mini-SLAM algorithm, the number of visual relations per node depending on the threshold tvs is shown in Fig. 9. The overview of all datasets presented in Table III also contains the number of similarity calculations performed and the evaluation run time on a Pentium 4 (2 GHz) processor with 512 MB of RAM. This time does not include the time required for the similarity computation. Each similarity calculation (including relative rotation and variance estimation) took 0.30 s using a dataset with an average of 522.3 features with standard deviation of 21.4. However, note that the implementation used for feature matching in this paper was not optimized for computational efficiency.

Fig. 9. Amount of visual nodes added to the graph depending on the threshold tvs.

### A. Outdoor/Indoor Dataset

A large set of 945 omnidirectional images was collected over a total distance of 1.4 km with height differences of up to 3 m. The robot was driven manually, and the data were collected in both indoor and outdoor areas over a period of two days (due to the limited capacity of the camera battery).

#### Comparison to Ground Truth Obtained From DGPS

To evaluate the accuracy of the created map, the robot position was measured with differential GPS (DGPS) while collecting the omnidirectional images. Thus, for every SLAM pose estimate, there is a corresponding DGPS position 〈xi, xDGPSi〉.

DGPS gives a smaller position error than GPS. However, since only the signal noise is corrected, the problem with multipath reflections still remains. DGPS is also available only if the radio link between the robot and the stationary GPS is functional. Thus, only a subset of pose pairs 〈{xi, xDGPSii = 1,…, N can be used for ground truth evaluation. DGPS measurements were considered only when at least five satellites were visible, and the radio link to the stationary GPS was functional. The valid DGPS readings are indicated as light dots in Fig. 10. The total number of pairs used to calculate the MSE for the whole map was 377 compared to the total number of frames of 945. To measure the difference between the poses estimated with mini-SLAM and the DGPS positions xDGPS (using UTM WGS84, which provides a metric coordinate system), the two datasets have to be aligned. Since the correspondence of the filtered pose pairs is known, 〈xi, xDGPSi〉, an optimal rigid alignment can be determined directly with the method by Arun et al. [21], as described earlier.

Fig. 10. DGPS data xDGPS with aligned SLAM estimates displayed on an aerial image of the area. The darker squares show the mini-SLAM pose estimates, and the lighter squares show the DGPS poses for which the number of satellites was considered acceptable. The deviation seen at the bottom (the car park) is mainly caused by the fact that the car park is elevated compared with the rest of the environment.
Fig. 11. Evolution of the MSE between the ground truth position obtained from DGPS readings xDGPS and the mini-SLAM estimate of the robot pose as frames is added to the map. Drops in the MSE indicate that the consistency of the map has been increased. The final MSE of the raw odometry was 377.5 m2.

The MSE between xDGPS and for the dataset shown in Fig. 10 is 4.89 m. To see how it evolves over time when creating the map, the MSE was calculated from the new estimates after each new frame was added. The result is shown in Fig. 11 and compared to the MSE obtained using only odometry to estimate the robot's position. Note that the MSE was evaluated for each frame added. Therefore, when DGPS data are not available, the odometry MSE xo will stay constant for these frames. This can be seen, for example, for the frames 250–440 in Fig. 11. For the same frames, the MSE of the SLAM estimate is not constant since new estimates are computed for each frame added and loop closing also occurs indoors, or generally, when no DGPS is available. The first visual relation rv was added around frame 260. Until then, the error of the mini-SLAM estimate and the odometry MSE xo were the same.

### B. Multiple Floor Levels

This dataset was collected inside a department building at Örebro University. It includes all five floor levels and connections between the floor levels by three elevators. The data contain loops in 2-D coordinates and also involving different floor levels. This dataset consists of 419 panoramic images and covers a path with a length of 618 m. The geometrical layout differs for the different floors (see Fig. 13). No information about the floor level is used as an input to the system; hence, the robot pose is still described using (x,y,θ).

#### Visualized Results

There are no ground truth data available for this dataset. However, it is possible to get a visual impression of the accuracy of the results from Fig. 12. The figure shows occupancy grid maps obtained from laser scanner readings and raw odometry poses (left), or the mini-SLAM pose estimates (right), respectively. All floors are drawn on top of each other without any alignment. To further illustrate the mini-SLAM results, an occupancy map was also created separately for each floor from the laser scanner readings and mini-SLAM pose estimates (see Fig. 13). Here, each pose was assigned to the corresponding floor level manually.

Fig. 12. Occupancy grid map of all five floors drawn on top of each other. (Left) Gridmap created using pose information from raw odometry. (Right) Using the estimated robot poses from mini-SLAM.
Fig. 13. Occupancy maps for floor levels 1–5, computed using laser scanner data at each estimated pose. The assignment of initial poses to floor levels was done manually and is used only to visualize these maps.

This experiment mainly illustrates the robustness of data association that is achieved using omnidirectional vision data. The similarity matrix and a similarity access matrix for the “multiple floor levels” dataset are shown in Fig. 14.

Fig. 14. (Left) Pose similarity matrix for the “multiple floor levels” dataset. (Right) Similarity access matrix showing which similarity measures were used in the mini-SLAM computation. Brighter pixels were used more often.

### C. Partly Overlapping Data

This dataset consists of three separate indoor sets: laboratory (lab), student area (studarea), and a combination of both (labstudarea) (see Fig. 15). Similar to the dataset described in Section III-B, omnidirectional images, 2-D laser range data, and odometry were recorded. Ground truth poses xGT were determined using the laser scanner and odometry together with the MLR approach as in [2].

Fig. 15. Submaps for the partly overlapping data. (Left) lab. (Middle) studarea. (Right) labstudarea, overlapping both lab and studarea.

#### Visualized Results

Fig. 16 shows the final graph (left), a plot of laser scanner readings merged using poses from odometry (middle), and poses obtained with mini-SLAM (right). Fig. 17 shows the similarity matrix and the similarity access matrix for the labstudarea dataset.

#### Comparison to Ground Truth Obtained From Laser-Based SLAM

As described in Section II-D, fusion of multiple maps is motivated both by its need in multirobot mapping and by the increased accuracy of the resulting maps. Instead of simply adding the different maps onto each other, the fused maps also use additional information from the overlapping parts to improve the accuracy of the submaps. This is illustrated in Table IV, which shows the MSE (again obtained by determining the rigid alignment between and xGT) before and after the fusion was performed. While the datasets lab and studarea shows a negligible change in accuracy, labstudarea clearly demonstrate a large improvement.

#### Robustness Evaluation

The suggested method relies on incremental pose estimates (odometry) and a visual similarity measure S. The robustness of the method is evaluated by corrupting these two inputs and evaluating the performance. For this evaluation, the studarea dataset is used, and the tests were repeated ten times.

In the first test, the similarity measures S were corrupted by adding a random value drawn from a Gaussian distribution N}(0,σ) with varying standard deviation σ (see Table V). The amount of added noise has to be compared to the range of [0,1] in which the similarity measure S lies [see (2)].

Fig. 16. (Left) Part of the final MLR graph containing the three different datasets. (Middle) Laser-range scanning-based map using the raw odometry. (Right) Laser-range scanning-based map using the mini-SLAM poses.

The robustness evaluation with respect to the similarity measure S shows that the system can handle additional noise to some extent, but incorrect visual relations will affect the accuracy of the final map. This illustrates that the proposed method, as many others, would have difficulties in perceptually similar locations in case the uncertainty of the pose estimates is high.

In the second test, the odometry values were corrupted by adding additional noise to the incremental distance d and the orientation θ. The corrupted incremental distance d′ is calculated as TeX Source $$d^{\prime} = d + 0.1d{\cal N}(0,\sigma) + 0.2\theta{\cal N}(0,\sigma)\eqno{\hbox{(12)}}$$and the orientation θ as TeX Source $$\theta^{\prime}= \theta + 0.2d{\cal N}(0,\sigma)+\theta{\cal N}(0,\sigma).\eqno{\hbox{(13)}}$$Since the odometry pose estimates are computed incrementally, the whole later trajectory is affected when adding noise at a particular time step.

The results of the robustness evaluation with the corrupted odometry are shown in Fig. 18 together with the MSE of the corrupted odometry. These results show that the system is robust to substantial odometry errors. A failure case is shown in Fig. 19.

Fig. 17. (Left) Pose similarity matrix for the labstudarea dataset. (Right) Similarity access matrix showing which similarity measures are used in the proposed method. Brighter pixels were used more often.
TABLE IV MSE Results Before and After Merging of the Datasets and Using Odometry Only
TABLE V MSE Results (mean and stddev) After Adding a Random Variable Drawn From N(0,σ) to Each Similarity Measure Sa,b
Fig. 18. MSE results (mean and stddev) for x (odometry) and (estimated poses) after corrupting the odometry by adding random values drawn from N(0,σ). The plot also shows the MSE when the odometry covariance is increased with the added noise.
Fig. 19. Failure case where the corrupted odometry error became too large resulting in a corrupted map. (Left) SLAM map. (Right) Raw odometry.
SECTION IV

## Conclusion and Future Work

Mini-SLAM combines the principle of using similarity of panoramic images to close loops at the topological level with a graph relaxation method to obtain a metrically accurate map representation and with a novel method to determine the covariance for visual relations based on visual similarity of neighboring poses. The proposed method uses visual similarity to compensate for the lack of range information about local image features, avoiding computationally expensive and less general methods such as tracking of individual image features.

Experimentally, the method scales well to the investigated environments. The experimental results are presented by visual means (as occupancy maps rendered from laser scans and poses determined by the mini-SLAM algorithm) and by comparison with ground truth (obtained from DGPS outdoors or laser-based SLAM indoors). The results demonstrate that the mini-SLAM method is able to produce topologically correct and geometrically accurate maps at low computational cost. A simple extension of the method was used to fuse multiple datasets so as to obtain improved accuracy. The method has also been used without any modifications to successfully map a building consisting of five floor levels.

Mini-SLAM generates a 2-D map based on 2-D input from odometry. It is worth noting that the “outdoor/indoor” dataset includes variations of up to 3 m in height. This indicates that the mini-SLAM can cope with violations of the flat floor assumption to a certain extent. We expect a graceful degradation in map accuracy as the roughness of the terrain increases. The representation should still be useful for self-localization using 2-D odometry and image similarity, e.g. , using the global localization method in [1], which, in addition, could be used to improve the robustness toward perceptual aliasing when fusing multiple datasets. In extreme cases, of course, it is possible that the method would create inconsistent maps, and a 3-D representation should be considered.

The bottleneck of the current implementation in terms of computation time is the calculation of image similarity, which involves the comparison of many local features. The suggested approach, however, is not limited to the particular measure of image similarity used in this paper. There are many possibilities to increase the computation speed either by using alternative similarity measures that are faster to compute while still being distinctive enough, or by optimizing the implementation, for example, by executing image comparisons on a graphics processing unit (GPU) [22].

Further plans for future work include an investigation of the possibility of using a standard camera instead of an omnidirectional camera, and incorporation of vision-based odometry to realize a completely vision-based system.

## Footnotes

Manuscript received December 14, 2007; revised July 12, 2008. First published September 26, 2008; current version published nulldate. This paper was recommended for publication by Associate Editor J. Leonard and Editor L. Parker upon evaluation of the reviewers' comments.

H. Andreasson and A. J. Lilienthal are with the Applied Autonomous Sensor System (AASS) Research Center, Örebro University, Örebro SE-701 82, Sweden (e-mail: henrik.andreasson@tech.oru.se; achim.lilienthal@tech.oru.se).

T. Duckett is with the Department of Computer Science, University of Lincoln, Lincoln LN6 7TS, U.K. (e-mail: tduckett@lincoln.ac.uk).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

## References

1. Localization for mobile robots using panoramic vision, local features and particle filter

H. Andreasson, A. Treptow, T. Duckett

Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2005, 3348–3353

2. A multilevel relaxation algorithm for simultaneous localisation and mapping

U. Frese, P. Larsson, T. Duckett

IEEE Trans. Robot., vol. 21, issue (2), p. 196–207, 2005-04

3. Mini-SLAM: Minimalistic visual SLAM in large-scale environments based on a new interpretation of image similarity

H. Andreasson, T. Duckett, A. Lilienthal

Proc. IEEE Int. Conf. Robot. Autom. (ICRA 2007), Rome, Italy, pp. 4096–4101

4. Object recognition from local scale-invariant features

D. Lowe

Corfu, Greece
Proc. Int. Conf. Comput. Vision (ICCV), 1999, 1150–1157

5. Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks

S. Se, D. Lowe, J. Little

Int. J. Robot. Res., vol. 21, issue (8), p. 735–758, 2002

6. Online visual motion estimation using FastSLAM with SIFT features

T. Barfoot

Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2005, 579–585

7. A framework for vision based bearing only 3D SLAM

P. Jensfelt, D. Kragic, J. Folkesson, M. Björkman

Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2006, 1944–1950

8. The vSLAM algorithm for robust localization and mapping

N. Karlsson, E. D. Bernardo, J. Ostrowski, L. Goncalves, P. Pirjanian, M. E. Munich

Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2005, 24–29

9. σSLAM: Stereo vision SLAM using the Rao-Blackwellised particle filter and a novel mixture proposal distribution

P. Elinas, R. Sim, J. Little

Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2006, 1564–1570

10. 6dof entropy minimization slam

J. Sez, F. Escolano

Orlando, FL
Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2006, 1548–1555

11. Real-time simultaneous localisation and mapping with a single camera

A. Davison

Proc. IEEE Int. Conf. Comput. Vis. (ICCV 2003), pp. 1403–1410

12. Loop closure detection in SLAM by combining visual and spatial appearance

K. L. Ho, P. Newman

Robot. Auton. Syst., vol. 54, issue (9), p. 740–749, 2006-09

13. Outdoor SLAM using visual appearance and laser ranging

P. M. Newman, D. M. Cole, K. L. Ho

Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2006, 1180–1187

14. Online constraint network optimization for efficient maximum likelihood map learning

G. Grisetti, D. Lordi Rizzini, C. Stachniss, E. Olson, W. Burgard

Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2008, 1880–1885

15. Fast iterative optimization of pose graphs with poor initial estimates

E. Olson, J. Leonard, S. Teller

Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2006, 2262–2269

16. Learning probabilistic motion models for mobile robots

A. I. Eliazar, R. Parr

Proc. 21st Int. Conf. Mach. Learning (ICML), Banff, AB, Canada, 2004, pp. 32–39

17. Rover localization in natural environments by indexing panoramic images

J. Gonzalez-Barbosa, S. Lacroix

Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2002, 1365–1370

18. Dynamic map building and localization for autonomous vehicles,

J. Uhlmann

Dynamic map building and localization for autonomous vehicles,, Ph.D. dissertation, Univ. Oxford, Oxford, U.K., 1995

19. Using covariance intersection for SLAM

S. J. Julier, J. K. Uhlmann

Robot. Auton. Syst., vol. 55, issue (1), p. 3–20, 2007

20. Multi-robot simultaneous localization and mapping using particle filters

A. Howard

Robot.: Sci. Syst. Conf., presented at the, Cambridge, MA, 2005-06

21. Least-squares fitting of two 3-d point sets

K. S. Arun, T. S. Huang, S. D. Blostein

IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-9, issue (5), p. 698–700, 1987-09

22. GPU-based video feature tracking and matching

S. N. Sinha, J.-M. Frahm, M. Pollefeys, Y. Genc

Chapel Hill, NC
Proc. Workshop Edge Comput. Using New Commodity Archit. (EDGE 2006), 2006

## Cited By

No Citations Available

## Keywords

### IEEE Keywords

No Keywords Available

### INSPEC: Controlled Indexing

Gaussian distribution, SLAM (robots), covariance matrices, image sensors, robot vision

### More Keywords

No Keywords Available

No Corrections

## Media

No Content Available
This paper appears in:
IEEE Transactions on Robotics
Issue Date:
OCTOBER 2008
On page(s):
991 - 1001
ISBN:
1552-3098
Print ISBN:
N/A
INSPEC Accession Number:
10301446
Digital Object Identifier:
10.1109/TRO.2008.2004642
Date of Current Version:
31 Oct, 2008

Someda, C.G.

Machler, P.