• Abstract

# Information-Efficient 3-D Visual SLAM for Unstructured Domains

This paper presents a novel vision-based sensory package and an information-efficient simultaneous localization and mapping (SLAM)algorithm. Together, we offer a solution for building 3-D dense map in an unknown and unstructured environment with minimal computational costs. The sensory package we adopt consists of a conventional camera and a range imager, which provide range and bearing and elevation inputs as commonly used by 3-D feature-based SLAM. In addition, we propose an algorithm to give the robots theintelligence” to select, out of the steadily collected data, the maximally informative observations to be used in the estimation process. We show that, although the actual evaluation of information gain for each frame introduces an additional computational cost, the overall efficiency is significantly increased by keeping the matrix compact. The noticeable advantage of this strategy is that the continuously gathered data are not heuristically segmented prior to being input to the filter. Quite the opposite, the scheme lends itself to be statistically optimal and is capable of handling large datasets collected at realistic sampling rates.

SECTION I

## Introduction and Related Work

ONE of the many important applications of mobile robots is to reach and explore terrains that are inaccessible or considered dangerous to humans. Such environments are frequently encountered in search and rescue scenarios where prior knowledge of the environment is unknown but required before any rescue operation can be deployed. A small mobile robot equipped with an appropriate sensor package can possibly be regarded as the best aid in such scenario. The robot is expected to navigate itself through the site and generate maps of the environment that human rescuers can then use for navigating and locating victims. Despite the wealth of research in planar robot mapping, the limitations of traditional 2-D feature maps in providing useful and understandable information in these scenarios have propelled increasing research efforts toward the generation of richer and more descriptive 3-D maps instead.

3-D mapping of a search and rescue scenario using a lightweight and highly mobile autonomous robot is a challenging predicament that can be framed within the generic simultaneous localization and mapping (SLAM) problem [1], where a robot is expected to move with 6 DOF in a 3-D environment. This demanding picture is further complicated by the lack of odometry information from the wheel encoders, as this tends to be totally unreliable due to the nature of the disaster environment. The emerging approach based on the ubiquitous laser range scanner lies in deriving odometry from 3-D scan matching, possibly fusing textured images a posteriori to recover the final shape [2]. When the noise on the relative locations is small, as is the case of an accurate range sensor such as tilting laser range finders or other forms of egomotion sensors, such as large and expensive inertial measurement units(IMUs), a good quality estimate of the final 3-D map can be expected if the displacement between robot poses is limited. By doing so the SLAM problem is reduced to that of the estimation of a state vector containing all the camera/robot poses. This is the approach used by many researchers in the SLAM community [3], [10], [12].

Further exploitation of appearance sensors, capable of providing scene texture information to the rescuer as well as salient visual features for “place recognition” in the estimation process itself, is becoming more prominent as they provide useful and understandable information which humans can then use for navigating and locating victims in rescue scenarios. The use of a stereo camera rig for 3-D mapping is advantageous as that makes the system fully observable in that the sensor provides enough information (range, bearing, and elevation) to compute the full3-D state of the observed landmarks. Some approaches rely on 2-D projections of the features, such as the vertical edges corresponding to corners and door frames as proposed in [15], which are subsequently tracked using an extended Kalman filter (EKF). This type of visual feature representation is not sufficiently distinctive and elaborate data association solutions are required to reject spurious matches. The real-time, intelligently controlled, optical positioning system(TRICLOPS) system was used in [16] and [22] where three images are obtained at each frame. Scale-invariant feature transformation (SIFT) features are detected, matched, and used in an EKF. The approach was however not globally consistent as the cross-correlations were not fully maintained for computational reasons. A Rao–Blackwellised particle filter was instead proposed in [20]. A motion model based purely on visual odometry was used, effectively generalizing the problem to unconstrained 3-D motion. Feature management and computational complexity, which grow exponentially with the number of particles, is likely to make this strategy infeasible in a large environment. Other approaches rely on iterative minimization algorithms from vision techniques such as iterative closest point (ICP) to perform local 3-D alignments to solve the SLAM problem [18], which produce good results when restricted to fairly structured environments and limited egomotion dynamics. Similar limitations are reported in [21], where 3-D line segments became landmarks suitable to be used in a particle filter implementation in which each particle also carried with it an EKF. Although these are promising approaches to solve the SLAM problem using visual cues, camera calibration and stereo correlation are still arbitrarily fickle. But more relevant to the harsh conditions encountered in search and rescue scenarios, determining correspondences between pixels in stereo pairs is a slow process that strongly depends on accommodating surfaces for the presence of texture, as well as the particular ambient lighting conditions [26].

Recently, more research effort has been directed toward the idea of fusing visual and range sensorial data to obtain dense 3-D maps. For example, a rotating laser scanner was employed in [2] to generate an initial rough 3-D map using ICP registrations, improved with dense texture images to recover the final 3-D shape. A tilting 2-D laser scanner and a camera for SLAM in an outdoor environment was also used in [24] and [25], where the camera was mainly used in the detection of loop closure. Our work in [6] was also focused on sequential 3-D dense map reconstruction within a visual SLAM framework. The novelty was in the shape of a combined visual sensor package proposed to enhance the salient features and scene textures captured from a conventional passive camera, with the 3-D data of the corresponding scene as provided by a low-resolution range imager—successfully proved in [17] to have the ability to be used for some simple local navigation. The combined observations made by these two cameras were then used as the sole input to a conventional extended information filter (EIF) based SLAM algorithm thereafter. The key idea is to use the traditional pinhole camera image for textural 3-D reconstruction and to help in the data association, and use a selection of the combined 3-D features from the aggregated camera and range scans to update the filter. Once the accurate camera pose sequence (in the global frame) at which the scans were taken was available, the textured ranging images could be combined to produce the final 3-D point cloud map.

In the proposed technique, the robot and feature poses needed to be recovered at the end of each “acquire, update” cycle. However, the inversion of the information matrix evoked during each estimation cycle came at a significant computational cost. Although the sensor package was capable of delivering data at a frame rate of 5–10 Hz, which theoretically guaranteed in itself appropriate frame registration, the increasing sampling delay between consecutive frames caused by extensive computation could yield inadequate data association and therefore unsuccessful frame registration. Such problem is particularly magnified in an unstructured environment where the robot's sights change rapidly and unpredictably along its undulating path.

Many efforts have been made in recent years to reduce the computational encumbrance generally faced by most SLAM algorithms, particularly in its most efficient information form. In related work [11], Eustice et al. implemented an exactly sparse delayed-state filter (ESDSF) that maintained a sequence of delayed robot poses at places where low overlap images were captured manually. A delayed-state EKF was also used in [12] to fuse the data acquired with a 3-D laser range finder. A segmentation algorithm was employed to separate the data stream into distinct point clouds based on orientation constraints, each referenced to a vehicle position. Both implementations significantly reduced the computational cost by eliminating features from the state vector to achieve practical performance. However, one noticeable common problem of these strategies is that loop closure cannot be automatically detected. Separate loop closure methods were required in conjunction with their proposed techniques, unlike the visual (and outlier removal)data association contemplated in [6]. Furthermore, both methods require either human supervision over the data acquisition process or raw odometry measurements to minimize the number of critical robot poses that should be maintained, none of which are available in the settings of a search and rescue scenario. An alternative strategy has recently been contemplated in [19] with a global batch rectification strategy that minimizes mutual information between images in order to achieve a manageable number of variables for the fusion process. The scheme essentially provides an efficient optimization solution at a claimed high-action compression rate. However, the robot and feature uncertainty beliefs are not accounted for and actions cannot be decompressed when needed.

SECTION II

## Proposed Methodology

While it has been demonstrated [6] that dense 3-D maps can be constructed with carefully prepared datasets collected in a static fashion, in the sense that each camera pose needs to be manually situated to ensure extensive overlapping between consecutive frames as well as the full coverage of the arena, in real applications such as search and rescue, it is unrealistic to expect such “ideal”datasets. The sensor package is more likely to be operated at its maximum rate to overcome problems such as motion blurriness and drastic changes in the scene due to the inherent nature of unstructured environments. Thus, an alternative approach to address the computational issues encountered in more realistic settings is proposed here: instead of focusing on minimizing the information gathered and trying to compute them in mathematically efficient ways, we seek a solution where we can collect information at maximum sensor rates and give the robot the“intelligence” to choose the critical observations that should be incorporated in the estimation process. To accomplish that, in this paper, we extend our current methodology with an improved filtering technique whereby for some given estimation error bounds, a buffer of continuous visual scans can be sampled but only those providing maximal information gain need actually be introduced in the filter.

With this technique, the filter incorporates only a minimal number of robot poses, but critically distributed along the trajectory in an automatic manner based on the robot uncertainty belief. As discussed in [14], by incorporating all of the robot poses in the state vector, the potential for divergence compared to an EKF or an EIF SLAM that maintains only the last pose of the robot in the state vector is reduced. Moreover, we can also afford to maintain both robot and feature poses in the state vector, in contrast with the state vector proposed by other researchers [3], [10], [12].By adding features in the state vector, the estimation accuracy of the camera poses is expected to improve compared to strategies where only robot poses are kept, since there are uncertainties associated with the feature locations. Hence, including feature locations in the state vector is expected to aid the filter in adjusting both camera poses and feature locations to better fit the observation data. Moreover, adding features to the state vector also means that observation noise can also be dealt with directly in the filter. If only the camera poses are included in the state vector, the raw observations need to be converted back into the relative camera pose transformations and the uncertainty of the transformation needs to be computed. Usually, some approximation is needed to compute the covariance of the transformation. For example, it is assumed in [3] that range noise for each point has the same Gaussian distribution, whereas the covariance matrix is derived in [12] by sampling the error surface around the converged minima, and then, fitting a Gaussian to that local surface. For 3-D laser scans, it is probably acceptable to assume that the range noise is the same for each point. But in the case of the SwissRanger, range noise associated with different pixels is quite different.

The rest of this paper is structured as follows. Section III describes the visual sensor package. The3-D feature extraction and registration process is explained in Section IV. Section V covers the mathematical formulation of the EIF SLAM algorithm and the proposed information-efficient strategy. Section VIpresents both simulation and experimental results. Discussion and concluding remarks are drawn in Sections VII and VIII where improvements and future work directions are also proposed.

SECTION III

## Visual 3-D Sensor Package

In this paper, we have employed an improved version of the sensor package used in [6] that consists of a time-of-flight range camera (SwissRanger SR-3000, low resolution, 176 × 144 pixels) and a higher resolution conventional camera (Point Grey Dragonfly2, 1024 × 768pixels). The two cameras are fixed relative to each other, as illustrated in Fig. 1.

Fig. 1. Sensor package: a conventional pinhole camera aligned with the SwissRanger SR-3000 ranger.

The SwissRanger works on the principle of emitting modulated infrared light on the scene with a 20 MHz frequency, and then, measuring the phase shift of the reflection to provide 3-D range data without an additional tilting/panning mechanism, albeit within a limited range (the known nonambiguity range of the sensor [4] is 7.5 m). In addition to distance information, the SwissRanger is also able to return information about the reflected signal's amplitude, hence capturing an intensity image of the scene. However, this is currently too noisy and subject to substantial changes in illumination as the camera pose changes. Hence, the proposal for a conventional camera, insensitive to the infrared light emitted by the SwissRanger, is to capture scene texture and extract salient visual features to aid the SLAM algorithm.

SECTION IV

## Feature Extraction and Frame Registration

With no prior knowledge of the robot motion nor the scene, an efficient mechanism to estimate the relative pose between two images is required as an input to the filter. A popular choice drawn from computer vision as a fundamental component of many image registration and object recognition algorithms is the SIFT [9]. The main strength of SIFT is to produce a feature descriptor that allows quick comparisons with other features and is rich enough to allow these comparisons to be highly discriminatory. Robust data association between SIFT features is particularly effective as the descriptor representation is designed to avoid problems due to boundary effects, i.e., smooth changes in location, orientation, and scale do not cause radical changes in the feature vector. Furthermore, the representation is surprisingly resilient to deformations such as those caused by perspective effects [5].The process for the frame registrations is as follows: first, the two cameras are stereo calibrated (the MATLAB camera calibration toolbox at www.vision.caltech.edu/bouguetj/calib_doc` was used for that purpose). SIFT features are then detected in the 2-D camera image and matched across those in the previous images. Due to the offset between the two cameras, there is not a one-to-one corresponding pixel that can be obtained directly from the SwissRanger's intensity image. However, given the calibration information, we can compute the 3-D position at the point where the feature visual cue (bearing) should intersect the 3-D point cloud. If a point can indeed be located around the intersection point and its distance is within the known SwissRanger's measurement precision at that depth, we register this 3-D point as a potential feature. This process is illustrated in Fig. 2 for a single SIFT point. Applying a least square 3-D point set registration algorithm [8] and an outlier removal [13], we obtain a subset of features that can then be used for estimating the initial value of the new camera pose, with the previous camera pose as prior. All features are introduced as individual observations. While under normal illumination condition hundreds of SIFT features can be extracted from each image, the number of features used at the actual filtering stage is significantly reduced by the aforementioned pairwise matching and registration with depth information.

Fig. 2. Example of feature 2-D to 3-D registration. (a) 2-D SIFT feature is first extracted from the traditional camera image before their 3-D location is registered. (b) Bearing to feature intersecting with point cloud. Darker points indicate area of search in the point cloud around the bearing line depending on effective ranger resolution around that range. (c) Potential 3-D feature projected back onto the intensity image. Feature location is the closest match within given measurement uncertainty boundaries, yet it does not match exactly that of the pinhole camera due to resolution differences and calibration accuracies.
SECTION V

## Efficient EIF SLAM

### A. Information-Efficient Filtering Strategy

As described in Section I, in the search and rescue scenario, the desire is for a system that can deliver not only maximal information but also a human comprehensible presentation of the environment in minimal time. To do so, a“look-ahead and search backwards” algorithm is proposed. The idea is maximizing data collection by buffering up all the information available but only choosing the most crucial data to be processed for mapping and localization. Pseudocode for the proposed algorithm is described in Algorithm 1.Assuming the robot begins at the origin [0,0,0] of the global coordinate frame at time t = 0, the feature global poses can be established and used as the first “base” frame Fbase.For the following individual frames Fi, matching features are found between Fbase and each Fi, unless the number of common features reaches a predefined minimum or the number of frames being examined exceeds the look-ahead buffer size. The minimal number of common features is restricted by the 3-D registration algorithm [8] while buffer size is an empirical number determined by the traveling speed of the device in order to guarantee a set of images with sufficient overlapping across the sequence, as well as the desired coarseness of the map.3-D registration is performed on each frame with respect to their matching Fbase (all global coordinates). Given new observations made at each new robot pose, information gain can be obtained without recovering the state vector, which is the major computational expense in EIF SLAM. Since each new frame introduces a different number of new features, the information matrix is expanded with the minimal number of new features found across all frames in the buffer. This procedure is to guarantee that the expanded information matrices contain the same number of elements so that their determinants can be compared. The camera pose at which the observations provide maximal information gain is added to the filter, and the corresponding frame is included as a new entry in the Fbase database. Frames in the look-ahead buffer previous to the new Fbase are dropped, and a new buffer is started from the consecutive frame after the last entry in Fbase. The same procedure is repeated until the end of the trajectory.

For the occasional circumstance when there are no matches between frames in the look-ahead buffer and frames in the Fbasedatabase, matching is attempted with the previously dropped frames. If one of these frames provides sufficient matching features, it will be treated as a new frame and both will be updated in the filter. This mechanism ensures that crucial information can be added back to the filter at any time to mitigate such undesirable situation in search and rescue as when rescuers become rescuees themselves. Although the robot is not processing every frame it acquired during the traversed course, its comprehensive collection of information about the environment becomes handy when the “kidnap” situation manifests itself. The robot's knowledge is not limited to a set of known positions as it is also in possession of additional nonfiltered information from the scenes in between base positions which it can retrieve to potentially recover to a known position. While the sudden risk of not knowing its whereabouts is certainly always looming in such risky and unpredictable environment, the proposed mechanism nevertheless increases the chances for a robot to regain its estimated location based on past knowledge in an efficient manner.

### B. EIF SLAM

Computational advantages of using an EIF rather than an EKF are now well known, particularly in situations where excessive information is to be processed. This work employs an EIF that maintains all the features as well as the entire sequence of camera poses in the state vector. New camera poses are initialized with respect to the best matching frame at a known pose, and measurement updates are additive in the information form. The sensor package is assumed to operate in full 6 DOF without a process model; therefore, the formulation of the filter becomes simpler and results in a naturally sparse information matrix: there is no motion prediction step to correlate the current state with previous states. For full 6 DOF SLAM, the state vector X contains a set of 3-D features and a set of camera poses. The camera poses are represented as TeX Source $$\left(x_C, y_C, z_C,\alpha_C, \beta_C, \gamma_C\right)\eqno{\hbox{(1)}}$$ in which the notation αC, βC, and γCrepresents the ZYX Euler angle rotation and the corresponding rotation matrix is referred to asRPYCCC).A 3-D point feature in the state vector is represented by TeX Source $$\left(x_F, y_F, z_F\right)\eqno{\hbox{(2)}}$$expressed in the global coordinate frame. Let i represent the information vector and I be the associated information matrix. The relationship between the estimated state vector , the corresponding covariance matrix P, the information vector i, and the information matrix I is TeX Source $$\hat{X} = I^{-1}i, \qquad P = I^{-1}.\eqno{\hbox{(3)}}$$The first camera pose is chosen as the origin of the global coordinate system. At time t = 0, the state vector X contains only the initial camera pose [0,0,0,0,0,0]T, and the corresponding 6 × 6 diagonal information matrix I is filled with large diagonal values representing the camera starting at a known position. A 3-D feature is detected by the sensor package as a pixel location u,v and a depth d. These observed parameters can be written as a function of the local coordinates (xL, yL, zL)of the 3-D feature with respect to the current camera pose TeX Source $$\left(\matrix{u\cr v\cr d}\right)=\left(\matrix{\displaystyle -{f x_L\over z_L}\cr\displaystyle -{f y_L\over z_L}\cr\displaystyle \sqrt{x_L^2+y_L^2+z_L^2}}\right)\eqno{\hbox{(4)}}$$where f is the focal length. For convenience, this relationship is referred to as TeX Source $$\left(\matrix{u\cr v\cr d}\right) =H_1\left(\left(\matrix{x_L\cr y_L\cr z_L}\right)\right).\eqno{\hbox{(5)}}$$The relation between the feature local coordinates and the state vector X, represented in global coordinates, is TeX Source $$\left(\matrix{x_L \cr y_L\cr z_L }\right)=RPY(\alpha_C, \beta_C, \gamma_C)^T \left(\left(\matrix{ x_F\cr y_F \cr z_F }\right)- \left(\matrix{ x_C \cr y_C\cr z_C }\right) \right)\eqno{\hbox{(6)}}$$referred to as TeX Source $$\left(\matrix{ x_L \cr y_L \cr z_L}\right)=H_2(X).\eqno{\hbox{(7)}}$$Thus, the observation model becomes TeX Source $$\left(\matrix{u\cr v\cr d}\right)=H(X)=H_1(H_2(X))+w\eqno{\hbox{(8)}}$$where w is the observation noise. It is assumed that w is(approximately) a zero-mean Gaussian noise (3-D vector). In the update step, the information vector and information matrix update can be described by TeX Source \eqalignno{I(k+1) & = I(k) + \nabla H_{k+1}^T Q_{k+1}^{-1} \nabla H_{k+1}\cr i(k+1) & = i(k) + \nabla H_{k+1}^TQ_{k+1}^{-1} \cr& \quad\times [z(k+1) {-} H_{k+1}(\hat{X}(k)) {+} \nabla H_{k+1}\hat{X}(k)]&\hbox{(9)}}where Qk+1 is the covariance matrix of the observation noisewk+1 and z(k+1) is the observation vector. The corresponding state vector estimation can be computed by solving a linear equation TeX Source $$I(k+1)\hat{X}(k+1)=i(k+1). \eqno{\hbox{(10)}}$$Detailed derivation of the Jacobian can be found in [6]. In covariance form, the determinant of the N × N covariance matrix indicates the volume of theN-dimensional uncertainty polyhedron of the filter. The smaller the volume, the more confident the filter is about its estimation. As the information matrix has an inverse relationship with the covariance matrix, as described by (3), the maximally informative frame must update the information matrix to have the largest determinant. We use the natural logarithm of the information matrix determinant, denoted as log(det(I(k+1))), as the measurable quantity of this information update. As described in Section V-A, in a sequence of overlapped images containing common features, each image is evaluated with respect to the base frame database Fbase. Thus, in order to proceed with the actual update of the filter, the pose corresponding to the frame such that log(det(I(k+1))) is maximized becomes the one added to the filter. An empirical threshold to gauge the update quality is also defined based on the desired coarseness of the map. When the maximum determinant is smaller than this threshold, meaning there is little information gain in updating the filter with the current sequence in the look-ahead buffer, all the frames in the sequence are updated to maximize the information gain.

TABLE I Performance Comparison
SECTION VI

## Results

The algorithm was first validated with computer simulated data to demonstrate its capabilities to produce a consistent SLAM output with known true robot and feature locations. Real data were then collected by the proposed sensor package while operating at a realistic sampling rate. The outcome of a reconstructed 3-D dense map that closely resembles the environment being explored alongside pictures from the scene are presented to show the qualities of the strategy.

### A. Simulation Results

The simulation environment was generated in MATLAB. The camera traveled along a piecewise polygonal trajectory in 3-D while continually rotating in 6-D for two loops, and took observations at 300 equally distributed poses. Two thousand features were randomly populated in the 20 m × 6 m × 20 m environment. Table I summarizes the key performance improvements between traditional EIF and the proposed technique for two measures of map smoothness. It can be seen how, for instance, with a buffer size of three frames (the next consecutive frame plus two look-ahead frames), computational time is reduced to almost 1/2 of the consecutive case, as the dimensionality of the information matrix is reduced by nearly 1/3 against the full size experiment. The estimated error in the robot pose when compared against the ground truth is pictured in Figs. 3 and 4 for the 95% uncertainty bounds test, which show the consistency of the estimation over the robot pose for the two buffer size cases depicted. 3-D views of the camera poses and map can be seen in Fig. 5(a) and (b). A detail for the case of a sharp rectangular turn is also depicted in Fig. 5(c) to gain a better understanding of the evolution of the pose distributions.

Fig. 3. Estimated covariance and estimation error in xC,yC,zC coordinates, using all 300 frames in the piecewise polygonal trajectory.
Fig. 4. Estimated covariance and estimation error in xC,yC,zC coordinates, using 163 out of 300 frames in the piecewise polygonal trajectory.
Fig. 5. Example trajectories depicting a polygon-shaped 3-D motion. (a) 3-D view of the full trajectory (two loops). (b) Zoomed-in detail. Dark (blue) ellipsoids show 95% uncertainty boundaries of the robot's estimated trajectories. Light (aqua) lines indicate robot's current heading. Stars (red) illustrate feature locations. (c) Zoomed-in section of a simulation with a simple rectangle trajectory. It can be seen in the zoomed-in view how more estimations were needed after the sharp turn indicating a decrease in the confidence levels. Once that was regained, estimations once again became more sparse along the trajectory. It also demonstrated how feature distribution influenced the frame selection.

### B. Experimental Results

For experimental evaluation, scans were collected from a mock-up search and rescue arena. The sensor package was handheld and operated at a combined sampling rate of approximately 5 Hz in a modern laptop, partly due to the extra time consumed by saving data to the hard drive for later batch processing. The sensor package is maneuvered in relatively slow motion, and it is therefore assumed to produce synchronized data from both cameras at this frequency. Three hundred sixty six scans were collected in about 1 min and 13 s. A sequence of four frames from the dataset is shown in Fig. 6. By applying the proposed approach, 118 out of 366 scans were automatically selected to be added to the filtering process based on the estimator's uncertainty belief. The final distribution of the selected frames over the entire data sequence is depicted

Fig. 6. Sample sequence of four images, where (d) is the one finally chosen to be incorporated into the filter.
Fig. 7. Distribution of selected frames over the entire data sequence.
Fig. 8. Sparse information matrix, where dark (blue) blocks show the correlations.
in Fig. 7. The state vector ended up containing a total of 2550 elements [118 camera poses (6-D) and 614 features (3-D)]. The final information matrix is illustrated in Fig. 8. The entire processing time was 1530 s, which could be further reduced by exploiting the sparseness of the information matrix. The final dense 3-D map constructed by superimposing the local point clouds to the filtered camera trajectory is shown in Fig. 9(a), where texture has been projected back into the cloud points visible in the 2-D camera images (field of view of the SwissRanger is very close to but slightly larger than the field of view of the traditional camera). A more detailed reconstruction of a small corner section and some of its constituent frames is displayed in Fig. 10. Results were also collected for comparison by implementing a standard EIF SLAM and the direct registration of 3-D features between consecutive frames over the full dataset. The 3-D maps constructed by these methods are presented in Fig. 9(b) and (c), respectively, and further discussed in the following section.
Fig. 9. 3-D map obtained from filtering of 118 (out of 366) frames of the 6 m × 3 m search and rescue arena. (a) Resulting 3-D point cloud map of the entire area superimposed on the selected estimated camera poses, as viewed from one of the corners. (b) EIF SLAM result obtained by incorporating all frames. (c) Final 3-D point cloud reconstructed by direct 3-D registration between consecutive frames. (d) Overall view of the search and rescue arena.
Fig. 10. Partial 3-D map reconstructed for the corner area covered by frames 94–140 (46 frames). Only 14 frames were actually processed in the filter that was sufficient to produce the highly detailed map seen in (a) (we choose to present this corner with an advertising tool for easy scene recognition for the readers). (b) Frame 94. (c) Frame 118. (d) Frame 140.
SECTION VII

## Discussion

Unlike most conventional fixed time step or fixed displacement approaches, our proposed technique exhibits the ability to fuse the minimal information required based on the robot uncertainty belief, which is, in turn, dynamically influenced by the complexity of the robot trajectory, the observation quality, and the estimation of the last known robot position. Results depicted in zoomed-in Fig. 5(b)and (c) are particularly relevant, as they clearly show how at areas with a reduced number of features and sharper motions, such as corners in the depicted trajectory, information losses due to sudden changes in the scenes are compensated for by the nature of the proposed filtering mechanism. As robot pose uncertainty increases and information gets scarcer, a larger number of pose estimations are necessary to compensate for the reduced quality of the observations within the robot's field of view. Thus, the consistency of the filter is maintained despite this information loss, as shown earlier in Fig. 4 during simulation.

Although the concept of information gain is very abstract, it becomes more intuitive when the filter's decision is compared with visual information side by side, as shown in Fig. 6. Frame (a) shown in Fig. 6 is taken at the last known robot position. Frames (b)–(d) in Fig. 6 are the following frames in the sequence. By visual examination, it is apparent to a human observer that frame (d) in Fig. 6 has better clarity and covers more unexplored territory compared to the other two preceding images, which in the filter's “comprehension”means better quality of matching features and greater number of more accurately initialized new features. To justify the quality of our proposed algorithm, a traditional EIF SLAM and direct 3-D registration has also been implemented using the same dataset. Results are illustrated in Fig. 9(b) and (c), respectively. Despite the long computational time required to perform EIF SLAM over the entire frame sequence (4 h and 20 min in a fast parallel cluster machine, with a final matrix of 7275 × 7275 elements), a reasonable result is obtained with full SLAM, while a large accumulated error of more than 2 m in the X-direction was the result of direct 3-D registration. From these figures, one may observe that the 3-D map obtained by using the proposed efficient methodology is even slightly better than the one using all the frames, and point clouds alignments are visually neater upon loop closure. This may appear somewhat surprising at first, as it can be reasonably expected that more observations should theoretically lead toward better filter estimations. In fact, further analysis of the simulation presented in Section VI-A demonstrated that using all the frames produced a superior pose and map estimation as expected. Pose covariance xC of one of the furthest away (in more error) poses in the simulated robot trajectory for the full and reduced number of frames is presented in Fig. 11. Covariances of other state variables also followed the same tendency. However, in the real world, observation model uncertainties and roundoff errors introduced by the correspondences between 3-D point clouds and 2-D features have practically meant that incorporating all the information not only did indeed prolong the computing process as expected, but also introduced more observation errors in the system that for longer runs, ultimately resulted in marginally poorer results. Due to the nature of the handheld experiment in 6-D, it is unrealistic to compare the estimation with the ground truth, so results with real data can only be provided at a qualitative level. Camera covariances xC, yC, and zC are nevertheless illustrated in Fig. 12 to show to some extent the bounded nature of the errors and the filter corrections. From all the experiments presented, we can conclude that the proposed algorithm is most valid for handling real datasets and constitutes a significant step in visually improving the map quality of 3-D unstructured environments in an efficient manner.

Fig. 11. xC covariance evolution of frame 89 in simulation experiment. Red dashed line indicates covariances when all frames were processed while blue line shows the covariances when only using selected frames.
SECTION VIII

## Conclusion and Future Work

We have presented an approach for producing 3-D dense map with a combination of vision and range sensors. The proposed algorithm not only produces consistent SLAM outputs but also dynamically incorporates observations into the estimation process for robust 3-D navigation in unstructured terrain. Results have shown that by gauging the information gain in each frame, we can automatically incorporate the most apt observations for the purpose of SLAM and extract comprehensive findings about the collective environment we intend to explore.

Although the current results appear promising, there are still some limitations to the proposed algorithm due to various factors.

1. SIFT is of limited assistance when dealing with more dramatic changes in view angles, especially with narrower field of view cameras. More challenging egomotions are expected when the sensor package is mounted on a continuously moving urban search and rescue (USAR) platform, instead of handheld as shown here.

2. The SwissRanger tends to return noisy measurements when it encounters glass surfaces. It also returns ambiguous measurements when measuring beyond its range, hence limiting its use to constrained environments.

3. Exploration of larger areas. Integration with another method such as submapping is a rational approach to address the scalability issue.

Fig. 12. xC (red circle), yC (blue cross), zC (green square) covariances of camera poses. Markers indicate the frames that are incorporated in the filter.

The proposed knowledge-driven algorithm can be regarded as tradeoff between computational efficiency and information loss. We believe it is critical for intelligent systems in the field to be able to attribute a measure of relevance to the information attained given the objective at hand. It is not hard to imagine how when exploring inside a collapsed building, it seems of higher priority to place exits, stairways, large cavities, etc., with relative accuracy, rather than yielding undue emphasis on generating perfectly straight walls. Active trajectory planning and exploration seem like natural companions to our proposed strategy in order to best comprehend the environment surrounding the robot, in minimum time and within bounded estimation errors. This is decidedly the case in the domain of autonomous robotics for search and rescue contemplated under the umbrella of this research paper.

## Footnotes

Manuscript received December 20, 2007; revised May 20, 2008. First published September 30, 2008; current version published nulldate. This paper was recommended for publication by Associate Editor J. Neira and Editor L. Parker upon evaluation of the reviewers' comments. This work was supported in part by the Australian Research Council (ARC) (the ARC Centre of Excellence for Autonomous Systems (CAS) is a partnership between the University of Technology Sydney, the University of Sydney, and the University of New South Wales) through its Centre of Excellence Programme and in part by the New South Wales State Government.

The authors are withthe Australian Research Council (ARC) Centre of Excellence for Autonomous Systems, University of Technology, Sydney, N.S.W. 2007, Australia (e-mail: w.zhou@cas.edu.au; j.vallsmiro@cas.edu.au; g.dissanayake@cas.edu.au).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

## References

1. A solution to the simultaneous localization and map building (SLAM) problem

G. Dissanayake, P. Newman, S. Clark, H. Durrant-Whyte, M. Csorba

IEEE Trans. Robot. Autom. Vol. 17, issue (3) pp. 229–241, 2001-06

2. Dense 3D map building based on LRF data and color image fusion

Proc. IEEE Int. Conf. Intell. Robot Syst., 2005, 1774–1779

3. Globally consistent range scan alignment for environment mapping

F. Lu, E. Milios

Auton. Robots, Vol. 4, issue (4) pp. 333–349 1997

4. An all-solid-state optical range camera for 3D real-time imaging with sub-centimeter depth resolution (SwissRanger)

T. Oggier, M. Lehmann, R. Kaufmann, M. Schweizer, M. Richter, P. Metzler, G. Lang, F. Lustenberger, N. Blanc

Proc. SPIE Vol. 5249, pp. 534–545, 2004

5. A performance evaluation of local descriptors

K. Mikolajczyk, C. Schmid

Proc. IEEE Comp. Soc. Conf. Comput. Vision Pattern Recognit., 2003, 257–263

6. Dense 3D map construction for indoor search and rescue

L. P. Ellekilde, S. Huang, J. Valls Miró, G. Dissanayake

J. Field Robot., vol. 24, issue (1/2), p. 71–89, 2007

7. Information efficient 3D visual SLAM in unstructured domains

W. Zhou, J. Valls Miró, G. Dissanayake

Proc. IEEE Int. Conf. Intel. Sensors, Sensor Netw. Inf. Process., Melbourne, Qld., Australia, 2007, pp. 323–328

8. Least square fitting of two 3-D point sets

K. S. Arun, T. S. Huang, S. D. Blostein

IEEE Pattern Anal. Mach. Intell., Vol. 9, issue (5) pp. 698–700, 1987-09

9. Distinctive image features from scale-invariant keypoints

D. G. Lowe

Int. J. Comput. Vision Vol. 60, issue (2) pp. 91–110 2004

10. Visually augmented navigation in an unstructured environment using a delayed state history

R. M. Eustice, O. Pizarro, H. Singh

Proc. IEEE Int. Conf. Robot. Autom., 2004, 25–32

11. Exactly sparse delayed-state filters

R. M. Eustice, H. Singh, J. J. Leonard

Proc. IEEE Int. Conf. Robot. Autom., 2005, 2417–2424

12. Using laser range data for 3D SLAM in outdoor environments

D. M. Cole, P. M. Newman

Proc. IEEE Int. Conf. Robot. Autom., 2006, 1556–1563

13. Random sampling consensus: A paradigm for model fitting with application to image analysis and automated cartography

M. Fischler, R. Bolles

Commun. Assoc. Comput. Mach., Vol. 24 pp. 381–395 1981

14. Square root SAM

F. Dellaert

Proc. Robot.: Sci. Syst., Cambridge, MA, 2005

15. Sensor influence in the performance of simultaneous mobile robot localization and map building

A. J. Castellanos, J. M. M. Montiel, J. Neira, J. D. Tardós

Proc. 6th Int. Symp. Exp. Robot., 1999, 203–212

16. Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks

S. Se, D. G. Lowe, J. Little

Int. J. Robot. Res., Vol. 21, issue (8) pp. 735–758 2002

17. A state-of-art 3D sensor for robot navigation

J. W. Weingarten, G. Gruener, R. Siegwart

Sendai, Japan
Proc. IEEE Int. Conf. Intell. Robot Syst., 2004, 2155–2160

18. Entropy minimization SLAM using stereo vision

J. M. Saez, F. Escolano

Barcelona, Spain
Proc. IEEE Int. Conf. Robot. Autom., 2005, 36–43

19. 6DOF entropy minimization SLAM

J. M. Saez, F. Escolano

Proc. IEEE Int. Conf. Robot. Autom., 2006, 1548–1555

20. Vision-based SLAM using Rao–Blackwellised particle filter

R. Sim, P. Elinas, M. Griffin, J. J. Little

Int. Joint Conf. Artif. Intell. Workshop Reasoning Uncertainty Robot., presented at the, Edinburgh, U.K., 2005

21. Landmark-based simultaneous localization and mapping with stereo vision

M. N. Dailey, M. Parnichkun

Proc. IEEE Asian Conf. Ind. Autom. Robot., 2005, 108–113

22. Vision-based global localization and mapping for mobile robots

S. Se, D. G. Lowe, J. J. Little

IEEE Trans. Robot., Vol. 21, issue (3) pp. 364–375, 2005-06

23. Mapping large scale environments using relative position information among landmarks

S. Huang, Z. Wang, G. Dissanayake

Proc. IEEE Int. Conf. Robot. Autom., 2006, 2297–2302

24. Outdoor SLAM using visual appearance and laser ranging

P. Newman, D. Cole, K. Ho

Proc. IEEE Int. Conf. Robot. Autom., 2006, 1180–1187

25. Using laser range data for 3D SLAM in outdoor environments

D. M. Cole, P. Newman

Proc. IEEE Int. Conf. Robot. Autom., 2006, 1556–1563

26. Photo-realistic 3D model reconstruction

S. Se, P. Jasiobedzki

Proc. IEEE Int. Conf. Robot. Autom. Workshop SLAM, 2006 pp. 3076–3082

## Cited By

No Citations Available

## Keywords

### IEEE Keywords

No Keywords Available

### INSPEC: Controlled Indexing

SLAM (robots), robot vision

### More Keywords

No Keywords Available

No Corrections

## Media

No Content Available
This paper appears in:
IEEE Transactions on Robotics
Issue Date:
OCTOBER 2008
On page(s):
1078 - 1087
ISBN:
1552-3098
Print ISBN:
N/A
INSPEC Accession Number:
10291071
Digital Object Identifier:
10.1109/TRO.2008.2004834
Date of Current Version:
31 Oct, 2008
Date of Original Publication:
30 Sep, 2008

Machler, P.

Tomono, M.