**T**HIS PAPER presents a new vision-based approach to the problem of simultaneous localization and mapping (SLAM). Especially compared with SLAM approaches using a 2-D laser scanner, the rich information provided by a vision-based approach about a substantial part of the environment allows for dealing with high levels of occlusion [1] and enables solutions that do not rely strictly on a flat floor assumption. Cameras can also offer a longer range and are therefore advantageous in environments that contain large open spaces.

The proposed method is called “mini-SLAM” since it is minimalistic in several ways. On the hardware side, it relies solely on odometry and an omnidirectional camera as the external source of information. This allows for less expensive systems compared to methods that use 2-D or 3-D laser scanners. Note that the robot used for the experiments was also equipped with a 2-D laser scanner. This laser scanner, however, was not used in the SLAM algorithm but only to visualize the consistency of the created maps.

Apart from the frugal hardware requirements, the method is also minimalistic in its computational demands. Map estimation is performed online by a linear-time SLAM algorithm on an efficient graph representation. The main difference to other vision-based SLAM approaches is that there is no estimate of the positions of a set of landmarks involved, enabling the algorithm to scale up better with the size of the environment. Instead, a measure of image similarity is used to estimate the relative pose between corresponding images (“visual relations”) and the uncertainty of this estimate. Given these “visual relations” and relative pose estimates between consecutive images obtained from the odometry of the robot (“odometry relations”), the multilevel relaxation algorithm [2] is then used to determine the maximum likelihood estimate of all image poses. The relations are expressed as a relative pose estimate and the corresponding covariance. A key insight is that the estimate of the relative pose in the “visual relations” does not need to be very accurate as long as the corresponding covariance is modeled appropriately. This is because the relative pose is used only as an initial estimate that the multilevel relaxation algorithm can adjust according to the covariance of the relation. Therefore, even with fairly imprecise initial estimates of the relative poses, it is possible to geometrically build accurate maps using the geometric information in the covariance of the relative pose estimates. Mini-SLAM was found to produce consistent maps in various environments, including, for example, a dataset of an environment containing indoor and outdoor passages (path length of 1.4 km) and an indoor dataset covering five floor levels of a department building.

Further to our previously published work [3], we extended the mini-SLAM approach to the multirobot SLAM problem, demonstrating its ability to combine multiple overlapping maps with unknown initial poses. We also provide an evaluation of the robustness of the suggested approach with respect to poor odometry or a less reliable measure of visual similarity.

### A. Related Work

Using a camera as the external source of information in SLAM has received increasing attention during the past years. Many approaches extract landmarks using local features in the images and track the positions of these landmarks. As the feature descriptor, Lowe's scale-invariant feature transform (SIFT) [4] has been used widely [5], [6]. An initial estimate of the relative pose change is often obtained from odometry [6], [7], [8], or where multiple cameras are available as in [9] and [10], multiple-view geometry can be applied to obtain depth estimates of the extracted features. To update and maintain visual landmarks, extended Kalman filters (EKFs) [7], [11] and Rao-Blackwellized particle filters (RBPFs) [6], [9] have been used. In the visual SLAM method proposed in [11], particle filters were utilized to obtain the depth of landmarks, while the landmark positions were updated with an EKF. Initial landmark positions had to be provided by the user. A similar approach described in [8] applies a converse methodology. The landmark positions were estimated with a KF, and a particle filter was used to estimate the path.

Due to their suitability for addressing the correspondence problem, vision-based systems have been applied as an addition to laser-scanning-based SLAM approaches for detecting loop closure. The principle has been applied to SLAM systems based on a 2-D laser scanner [12] and a 3-D laser scanner [13].

In the approach proposed in this paper, the SLAM optimization problem is solved at the graph level with the multilevel relaxation (MLR) method of Frese and Duckett [2]. This method could be replaced by alternative graph-based SLAM methods, for example, the online method proposed by Grisetti *et al. * [14] based on the stochastic gradient descent method proposed by Olson *et al.* [15].

The rest of this paper is structured as follows. Section II describes the proposed SLAM approach. Then, the experimental setup is detailed, and the results are presented in Section III. The paper ends with conclusions and suggestions for future work (Section IV).

### A. Multilevel Relaxation

The SLAM optimization problem is solved at the graph level with the MLR method of Frese and Duckett [2]. A map is represented as a set of nodes connected in a graph structure. An example is shown in Fig. 1. Each node corresponds to the robot pose at a particular time and each link to a relative measurement of the spatial relation between the two nodes it connects. A node is created for each omni-image in this paper, and the terms node and frame are used interchangeably in this paper.

The MLR algorithm can be briefly explained as follows. The input is a set *R* of *m* = | *R* relations on *n* planar frames (i.e., a 2-D representation is used). Each relation *r* ∊ *R* describes the likelihood distribution of the pose of frame *a* relative to frame *b.* Relations are modeled as a Gaussian distribution with mean μ^{r} and covariance *C*^{r}. The output of the MLR algorithm is the maximum likelihood estimation vector for the poses of all the frames. Thus, a globally consistent set of Cartesian coordinates is obtained for the nodes of the graph based on local (relative) and inconsistent (noisy) measurements, by maximizing the total likelihood of all measurements.

### B. Odometry Relations

The mini-SLAM approach is based on two principles. First, odometry is sufficiently accurate if the distance traveled is short. Second, by using visual matching, correspondence between robot poses can be detected reliably even though the search region around the current pose estimate is large. Accordingly, two different types of relations are created in the MLR graph, relations based on odometry *r*_{o} and relations based on visual similarity *r*_{v}.

Odometry relations *r*_{o} are created between successive frames. The relative pose μ_{ro} is obtained directly from the odometry readings, and the covariance *C*_{ro} is estimated using the motion model suggested in [16] as
TeX Source
$$C_{r_o} = \left[\matrix{d^2\delta^2_{X_d} + t^2\delta^2_{X_t} & 0 & 0 \cr0 & d^2\delta^2_{Y_d} + t^2\delta^2_{Y_t} & 0 \cr0 & 0 & d^2\delta^2_{\theta_d} + t^2\delta^2_{\theta_t}} \right]\eqno{\hbox{(1)}}$$where *d* and *t* are the total distance traveled and total angle rotated between two successive frames. The δ_{X} parameters relate to the forward motion, the δ_{Y} parameters to the side motion, and the δ_θ parameters to the rotation of the robot. The six δ-parameters adjust the influence of the distance *d* and rotation *t* in the calculation of the covariance matrix. They were tuned manually once, and then kept constant throughout the experiments.

### C. Visual Similarity Relations

*Similarity Measure*

Given two images *I*_{a} and *I*_{b}, features are first extracted using the SIFT algorithm [4]. This results in two sets of features *F*_{a} and *F*_{b} for frame *a* and *b.* Each feature *F* = [*x*,*y*],*H* comprises the pixel position [*x*,*y*] and a histogram *H* containing the SIFT descriptor. The similarity measure *S*_{a,b} is based on the number of features that match between *F*_{a} and *F*_{b}.

The feature matching algorithm calculates the Euclidean distance between each feature in image *I*_{a} and all the features in image *I*_{b}. A potential match is found if the smallest distance is less than 60% of the second smallest distance. This criterion was found empirically and was also used in [17]. It guarantees that interest point matches are substantially better than all other possible matches. We also do not allow features to be matched against more than one other feature. If a feature has more than one candidate match, the match that has the lowest Euclidean distance among the candidates is selected. Examples of matched features are shown in Fig. 2.

The matching step results in a set of feature pairs *P*_{a,b} with a total number *M*_{a,b} of matched pairs. Since the number of features varies heavily depending on the image content, the number of matches is normalized to *S*_{a,b}∊ [0,1] as
TeX Source
$$S_{a,b} = {M_{a,b}\over {1/2}(n_{F_a} + n_{F_b})}\eqno{\hbox{(2)}}$$where *n*_{Fa} and *n*_{Fb} are the number of features in *F*_{a} and *F*_{b}, respectively. A high-similarity measure indicates a perceptually similar position.

*Estimation of the Relative Rotation and Variance*

The relative rotation between two panoramic images *I*_{a} and *I*_{b} can be estimated directly from the horizontal displacement of the matched feature pairs *P*_{a,b}. If the flat floor assumption is violated, this will be only an approximation. Here, the relative rotations θ_{p} for all matched pairs *p* ∊ *P*_{a,b} are sorted into a 10-bin histogram and the relative rotation estimate μ_{θ}^{rv} is determined as the maximum of a parabola fitted to the largest bin and its left and right neighbor (see Fig. 3).

To evaluate the accuracy of relative rotation estimates θ_{p}, we collected panoramic images in an indoor laboratory environment and computed the relative orientation with respect to a reference image *I*_{0}. Panoramic images were recorded at a translational distance of 0.5, 1.0, and 2.0 m to the reference image *I*_{0}. The ground truth rotation was obtained by manually measuring the displacement of corresponding pixels in areas along the displacement of the camera. The results in Table I demonstrate the good accuracy obtained. Even at a displacement of 2 m the mean error is only 7.15°.

The rotation variance σ_{θrv}^{2} is estimated by the sum of squared differences between the estimate of the relative rotation μ_{θ}^{rv} and the relative rotation of the matched pairs *P*_{a,b}
TeX Source
$$\sigma_{\theta^{r_v}}^2 = {1\over M_{a,b}-1}\sum_{p\in P_{a,b}}(\mu_{\theta}^{r_v} - \theta_{p})^2.\eqno{\hbox{(3)}}$$To increase the robustness toward outliers, a 10% Winsorized mean is applied. For the evalutated data, this had only a minor effect on the results compared to using an untruncated mean.

*Estimation of Relative Position and Covariance*

The mini-SLAM approach does not attempt to determine the position of the detected features. Therefore, the relative position between two frames *a* and *b* cannot be determined very accurately. Instead, we use only image similarity of the surrounding images to estimate [μ_{x}^{rv}, μ_{y}^{rv}], as described shortly. It would be possible to estimate the relative position using multiple-view geometry, but this would introduce additional complexity that we want to avoid.

Instead, geometric information is obtained from an estimate of the covariance of the relative position between a current frame *b* and a previously recorded frame *a.* This covariance estimate is computed using only the similarity measures *S* of frame *b* with *a* and the neighboring frames of *a.*

The number of matched features between successive frames will vary depending on the physical distance to the features (see Figs. 2 and 4). Consider, for example, a robot located in an empty car park where the physical distance to the features is large, and therefore, the appearance of the environment does not change quickly if the robot is moved a certain distance. If, on the other hand, the robot is located in a narrow corridor where the physical distance to the extracted features is small, the number of feature matches in successive frames tends to be smaller if the robot was moved the same distance.

The covariance of the robot pose estimate [*x*,*y*]
TeX Source
$$C_{r_v} = \left[\matrix{\sigma_{x^{r_v}}^2 & \sigma_{x^{r_v}}\sigma_{y^{r_v}} \cr\sigma_{x^{r_v}}\sigma_{y^{r_v}} & \sigma_{y^{r_v}}^2}\right]\eqno{\hbox{(4)}}$$is computed based on how the similarity measure varies over the set *N*(*a*), which contains frame *a* and its neighboring frames. The analyzed sequence of similarity measures is indicated in the zoomed-in visualization of a similarity matrix shown in Fig. 5. In order to avoid issues estimating the covariance orthogonal to the path of the robot if the robot was driven along a straight path, the covariance matrix is simplified by setting σ_{xrv}^{2} = σ_{yrv}^{2} and σ_{xrv}σ_{yrv} = 0. The remaining covariance parameter is estimated by fitting a 1-D Gaussian to the similarity measures *S*_{N(a),b} and the distance traveled as obtained from odometry (see Fig. 6). Two parameters are determined from the nonlinear least-squares fitting process: mean (*d*_μ) and variance (σ_{[x,y]rv}^{2}). The initial estimate of the relative position [μ_{x}^{rv},μ_{y}^{rv}] of a visual relation is calculated as
TeX Source
$$\eqalignno{\mu_x^{r_v} & = {\rm cos}(\mu_{\theta}^{r_v})d_\mu& \hbox{(5)}\cr\mu_y^{r_v} & = {\rm sin}(\mu_{\theta}^{r_v})d_\mu& \hbox{(6)}}$$where *d*_μ is the calculated mean of the fitted Gaussian and μ_{θ} the estimated relative orientation (Section II-C2).

In the experimental evaluation, the Gaussian was estimated using five consecutive frames. To evaluate whether the evolution of the similarity measure in the vicinity of a visual relation can be reasonably approximated by a Gaussian, the mean error between the five similarity measures and the fitted Gaussian was calculated for the outdoor/indoor dataset (the dataset is described in Section III-A). The results in Table II indicate that the Gaussian represents the evolution of the similarity in a reasonable way. Please note that frame *b* is recorded at a later time than frame *a*, meaning that the covariance estimate *C*_{rv}^{a,b} can be calculated directly without any time lag.

*Selecting Frames to Match*

In order to speed up the algorithm and make it more robust to perceptual aliasing (the problem that different regions have similar appearance), only those frames are selected for matching that are likely to be located close to each other.

Consider the current frame *b* and a previously recorded frame *a.* If the similarity measure was to be calculated between *b* and all previously added frames, the number of frames to be compared would increase linearly (see Fig. 7). Instead, frames are only compared if the current frame *b* is within a search area around the pose estimate of frame *a.* The size of this search area is computed from the estimated pose covariance.

From the MLR algorithm (see Section II-A) we obtain the maximum likelihood estimate for frame *b.* There is, however, no estimate of the corresponding covariance that could be used to distinguish whether frame *a* is likely to be close enough to frame *b* so that it can be considered a candidate for a match, i.e., a frame for which the similarity measure *S*_{a,b} should be calculated. So far, we have defined two types of covariances: the odometry covariance *C*_{ro} and the visual relation covariance *C*_{rv}. To obtain an overall estimate of the relative covariance between frame *a* and *b*, we first consider the covariances of the odometry relations *r*_{o} between *a* and *b*, and compute relative covariance *C*_{xoa,b} as
TeX Source
$$C_{x^o_{a, b}} = \sum_{j\in(a, b-1)}{\bf R}_j C_{r_{o_j}}{\bf R}_j^T .\eqno{\hbox{(7)}}$$**R**_{j} is a rotation matrix, which is defined as
TeX Source
$${\bf R}_j = \left(\matrix{{\rm cos}(\hat{x}_{j+1}^{\theta} - \hat{x}_j^{\theta}) & -{\rm sin}(\hat{x}_{j+1}^{\theta} - \hat{x}_j^{\theta}) & 0 \cr{\rm sin}(\hat{x}_{j+1}^{\theta} - \hat{x}_j^{\theta}) & {\rm cos}(\hat{x}_{j+1}^{\theta} - \hat{x}_j^{\theta}) & 0 \cr0 & 0 & 1} \right)\eqno{\hbox{(8)}}$$where is the orientation estimated for frame *j.*

As long as no visual relation *r*_{v} has been added, either between *a* and *b* or any of the frames between *a* and *b*, the relative covariance can be determined directly from the odometry covariance *C*_{xoa} and *C*_{xob}, as described earlier. However, when a visual relation *r*_{v}^{a,b} between *a* and *b* is added, the covariance of the estimate decreases. Using the covariance intersection method [18], the covariance for frame *b* is therefore updated as
TeX Source
$$C_{\hat{x}_b} = C_{\hat{x}_b} \oplus (C_{\hat{x}_a} + C_{r_v^{a, b}})\eqno{\hbox{(9)}}$$where ⊕ is the covariance intersection operator. The covariance intersection method weighs the influence of both covariances *C*_{a} and *C*_{b} as
TeX Source
$$C_A \oplus C_B = [\omega C_A^{-1} + (1-\omega) C_B^{-1}]^{-1} . \eqno{\hbox{(10)}}$$The parameter ω ∊ [0,1] is chosen so that the determinant of the resulting covariance is minimized [19].

The new covariance estimate is also used to update the frames between *a* and *b* by adding the odometry covariances *C*_{xoa·.b} in opposite order (i.e., simulate that the robot is moving backwards from frame *b* to *a*). The new covariance estimate for frame *j* ∊ (*a*,*b*) is calculated as
TeX Source
$$C_{\hat{x}_j} = C_{\hat{x}_j} \oplus (C_{\hat{x}_b} + C_{x^o_{b, j}}).\eqno{\hbox{(11)}}$$

*Visual Relation Filtering*

To avoid adding visual relations with low similarity, visual similarity relations *r*_{v}^{a,b} between frame *a* and frame *b* are added only if the similarity measure exceeds a threshold *t*_{vs}: *S*_{a,b} > *t*_{vs}. In addition, similarity relations are added only if the similarity value *S*_{a,b} has its peak at frame *a* [compared to the neighboring frames *N*(*a*)]. There is no limitation on the number of visual relations that can be added for each frame.

### D. Fusing Multiple Datasets

Fusion of multiple datasets recorded at different times is related to the problem of multirobot mapping where each of the datasets is collected concurrently with a different robot. The motivation for multirobot mapping is not only to reduce the time required to explore an environment but also to merge the different sensor readings in order to obtain a more accurate map. The problem addressed here is equivalent to “multirobot SLAM with unknown initial poses” [20], because the relative poses between the datasets are not given. The exploration problem is not considered in this paper.

Only a minor modification of the standard method described earlier is necessary to address the problem of fusing multiple datasets. The absence of relative pose estimates between the datasets is compensated for by not limiting the search region for which the similarity measures *S* are computed. This is implemented by adding datasets incrementally and setting the relative pose between consecutively added datasets initially to (0 0 0) with an infinite pose covariance. Such odometry relations between datasets appear as long, diagonal lines in Fig. 16, representing the transition between *lab* to *studarea* and *studarea* to *lab*-*studarea.*

SECTION III

## Experimental Results

In this section, we present results from five different datasets with varying properties. An overview of all datasets is presented in Table III. All datasets were collected with our mobile robot Tjorven (see Fig. 4). The platform uses “skid steering,” which is prone to bad odometry. In the different datasets different wheel types (indoor/outdoor) were used. The robot's odometry was calibrated (for each wheel type) by first driving forward 5 m to obtain a distance per encoder tick value, and second by completing one full revolution to determine the number of differential encoder ticks per angular rotation. Finally, the drift parameter was adjusted so that the robot would drive forward in a straight line, i.e., to compensate for the slightly different size of the wheel pairs.

The omnidirectional images were first converted to panoramic images with a resolution of 1000 × 289. When extracting SIFT features, the initial doubling of the images was not performed, i.e., SIFT features from the first octave were ignored, simply to lower the amount of extracted features.

The results are presented both visually with maps obtained by superimposing laser range data using the poses estimated with mini-SLAM and quantitatively by the MSE from ground truth data. Since the corresponding pose pairs 〈*x*}_{i},*x*^{GT}_{i}〉 between the estimated pose and the corresponding ground truth pose *x*^{GT}_{i} are known, the optimal rigid transformation between pose estimates and ground truth data can be determined directly. We applied the method suggested by Arun *et al.* [21].

To investigate the influence of the threshold *t*_{vs}, described in Section II-C5, the MSE was calculated for all datasets for which ground truth data were available. The result in Fig. 8 shows that the value of the threshold *t*_{vs} can be selected so that it is nearly optimal for all datasets and that there is a region in which minor changes of the *t*_{vs} do not strongly influence the accuracy of the map. Throughout the remainder of this section, a constant threshold *t*_{vs} = 0.2 is used.

In order to give a better idea of the function of the mini-SLAM algorithm, the number of visual relations per node depending on the threshold *t*_{vs} is shown in Fig. 9. The overview of all datasets presented in Table III also contains the number of similarity calculations performed and the evaluation run time on a Pentium 4 (2 GHz) processor with 512 MB of RAM. This time does not include the time required for the similarity computation. Each similarity calculation (including relative rotation and variance estimation) took 0.30 s using a dataset with an average of 522.3 features with standard deviation of 21.4. However, note that the implementation used for feature matching in this paper was not optimized for computational efficiency.

### A. Outdoor/Indoor Dataset

A large set of 945 omnidirectional images was collected over a total distance of 1.4 km with height differences of up to 3 m. The robot was driven manually, and the data were collected in both indoor and outdoor areas over a period of two days (due to the limited capacity of the camera battery).

*Comparison to Ground Truth Obtained From DGPS*

To evaluate the accuracy of the created map, the robot position was measured with differential GPS (DGPS) while collecting the omnidirectional images. Thus, for every SLAM pose estimate, there is a corresponding DGPS position 〈*x*_{i}, *x*^{DGPS}_{i}〉.

DGPS gives a smaller position error than GPS. However, since only the signal noise is corrected, the problem with multipath reflections still remains. DGPS is also available only if the radio link between the robot and the stationary GPS is functional. Thus, only a subset of pose pairs 〈{*x*_{i}, *x*^{DGPS}_{i}〉_{i = 1,…, N} can be used for ground truth evaluation. DGPS measurements were considered only when at least five satellites were visible, and the radio link to the stationary GPS was functional. The valid DGPS readings are indicated as light dots in Fig. 10. The total number of pairs used to calculate the MSE for the whole map was 377 compared to the total number of frames of 945. To measure the difference between the poses estimated with mini-SLAM and the DGPS positions *x*^{DGPS} (using UTM WGS84, which provides a metric coordinate system), the two datasets have to be aligned. Since the correspondence of the filtered pose pairs is known, 〈*x*_{i}, *x*^{DGPS}_{i}〉, an optimal rigid alignment can be determined directly with the method by Arun *et al.* [21], as described earlier.

The MSE between *x*^{DGPS} and for the dataset shown in Fig. 10 is 4.89 m. To see how it evolves over time when creating the map, the MSE was calculated from the new estimates after each new frame was added. The result is shown in Fig. 11 and compared to the MSE obtained using only odometry to estimate the robot's position. Note that the MSE was evaluated for each frame added. Therefore, when DGPS data are not available, the odometry MSE *x*^{o} will stay constant for these frames. This can be seen, for example, for the frames 250–440 in Fig. 11. For the same frames, the MSE of the SLAM estimate is not constant since new estimates are computed for each frame added and loop closing also occurs indoors, or generally, when no DGPS is available. The first visual relation *r*_{v} was added around frame 260. Until then, the error of the mini-SLAM estimate and the odometry MSE *x*^{o} were the same.

### B. Multiple Floor Levels

This dataset was collected inside a department building at Örebro University. It includes all five floor levels and connections between the floor levels by three elevators. The data contain loops in 2-D coordinates and also involving different floor levels. This dataset consists of 419 panoramic images and covers a path with a length of 618 m. The geometrical layout differs for the different floors (see Fig. 13). No information about the floor level is used as an input to the system; hence, the robot pose is still described using (*x*,*y*,θ).

*Visualized Results*

There are no ground truth data available for this dataset. However, it is possible to get a visual impression of the accuracy of the results from Fig. 12. The figure shows occupancy grid maps obtained from laser scanner readings and raw odometry poses (left), or the mini-SLAM pose estimates (right), respectively. All floors are drawn on top of each other without any alignment. To further illustrate the mini-SLAM results, an occupancy map was also created separately for each floor from the laser scanner readings and mini-SLAM pose estimates (see Fig. 13). Here, each pose was assigned to the corresponding floor level manually.

This experiment mainly illustrates the robustness of data association that is achieved using omnidirectional vision data. The similarity matrix and a similarity access matrix for the “multiple floor levels” dataset are shown in Fig. 14.

### C. Partly Overlapping Data

This dataset consists of three separate indoor sets: laboratory (*lab*), student area (*studarea*), and a combination of both (*lab*–*studarea*) (see Fig. 15). Similar to the dataset described in Section III-B, omnidirectional images, 2-D laser range data, and odometry were recorded. Ground truth poses *x*^{GT} were determined using the laser scanner and odometry together with the MLR approach as in [2].

*Visualized Results*

Fig. 16 shows the final graph (left), a plot of laser scanner readings merged using poses from odometry (middle), and poses obtained with mini-SLAM (right). Fig. 17 shows the similarity matrix and the similarity access matrix for the *lab*–*studarea* dataset.

*Comparison to Ground Truth Obtained From Laser-Based SLAM*

As described in Section II-D, fusion of multiple maps is motivated both by its need in multirobot mapping and by the increased accuracy of the resulting maps. Instead of simply adding the different maps onto each other, the fused maps also use additional information from the overlapping parts to improve the accuracy of the submaps. This is illustrated in Table IV, which shows the MSE (again obtained by determining the rigid alignment between and *x*^{GT}) before and after the fusion was performed. While the datasets *lab* and *studarea* shows a negligible change in accuracy, *lab*–*studarea* clearly demonstrate a large improvement.

*Robustness Evaluation*

The suggested method relies on incremental pose estimates (odometry) and a visual similarity measure *S.* The robustness of the method is evaluated by corrupting these two inputs and evaluating the performance. For this evaluation, the *studarea* dataset is used, and the tests were repeated ten times.

In the first test, the similarity measures *S* were corrupted by adding a random value drawn from a Gaussian distribution *N*}(0,σ) with varying standard deviation σ (see Table V). The amount of added noise has to be compared to the range of [0,1] in which the similarity measure *S* lies [see (2)].

The robustness evaluation with respect to the similarity measure *S* shows that the system can handle additional noise to some extent, but incorrect visual relations will affect the accuracy of the final map. This illustrates that the proposed method, as many others, would have difficulties in perceptually similar locations in case the uncertainty of the pose estimates is high.

In the second test, the odometry values were corrupted by adding additional noise to the incremental distance *d* and the orientation θ. The corrupted incremental distance *d*′ is calculated as
TeX Source
$$d^{\prime} = d + 0.1d{\cal N}(0,\sigma) + 0.2\theta{\cal N}(0,\sigma)\eqno{\hbox{(12)}}$$and the orientation θ^{′} as
TeX Source
$$\theta^{\prime}= \theta + 0.2d{\cal N}(0,\sigma)+\theta{\cal N}(0,\sigma).\eqno{\hbox{(13)}}$$Since the odometry pose estimates are computed incrementally, the whole later trajectory is affected when adding noise at a particular time step.

The results of the robustness evaluation with the corrupted odometry are shown in Fig. 18 together with the MSE of the corrupted odometry. These results show that the system is robust to substantial odometry errors. A failure case is shown in Fig. 19.

SECTION IV

## Conclusion and Future Work

Mini-SLAM combines the principle of using similarity of panoramic images to close loops at the topological level with a graph relaxation method to obtain a metrically accurate map representation and with a novel method to determine the covariance for visual relations based on visual similarity of neighboring poses. The proposed method uses visual similarity to compensate for the lack of range information about local image features, avoiding computationally expensive and less general methods such as tracking of individual image features.

Experimentally, the method scales well to the investigated environments. The experimental results are presented by visual means (as occupancy maps rendered from laser scans and poses determined by the mini-SLAM algorithm) and by comparison with ground truth (obtained from DGPS outdoors or laser-based SLAM indoors). The results demonstrate that the mini-SLAM method is able to produce topologically correct and geometrically accurate maps at low computational cost. A simple extension of the method was used to fuse multiple datasets so as to obtain improved accuracy. The method has also been used without any modifications to successfully map a building consisting of five floor levels.

Mini-SLAM generates a 2-D map based on 2-D input from odometry. It is worth noting that the “outdoor/indoor” dataset includes variations of up to 3 m in height. This indicates that the mini-SLAM can cope with violations of the flat floor assumption to a certain extent. We expect a graceful degradation in map accuracy as the roughness of the terrain increases. The representation should still be useful for self-localization using 2-D odometry and image similarity, e.g. , using the global localization method in [1], which, in addition, could be used to improve the robustness toward perceptual aliasing when fusing multiple datasets. In extreme cases, of course, it is possible that the method would create inconsistent maps, and a 3-D representation should be considered.

The bottleneck of the current implementation in terms of computation time is the calculation of image similarity, which involves the comparison of many local features. The suggested approach, however, is not limited to the particular measure of image similarity used in this paper. There are many possibilities to increase the computation speed either by using alternative similarity measures that are faster to compute while still being distinctive enough, or by optimizing the implementation, for example, by executing image comparisons on a graphics processing unit (GPU) [22].

Further plans for future work include an investigation of the possibility of using a standard camera instead of an omnidirectional camera, and incorporation of vision-based odometry to realize a completely vision-based system.